CMSC 330, Spring 2013
Organization of Programming Languages
In this project, you will implement classes and methods for working with regular expressions. When you are done with this project, you will be able to compile regular expressions to NFAs and DFAs, use them to recognize strings. You will also conduct an experiment comparing the performance of different regular expression implementations.
To get you started, we've supplied you several files containing code that you'll use for this project. The file example.rb gives some Ruby code showing how to use the other files. We discuss the other three files in turn.
Regular expressions (re.rb)
The file re.rb contains five classes that we will use to represent regular expressions as abstract syntax trees (ASTs): REEpsilon (the empty string), REConst (one character), REStar (Kleene closure), REUnion (disjunction), and REConcat (concatenation). Each of these classes has a constructor that takes the appropriate number of arguments depending on how many subexpressions each syntactic form has. For example, here are some regular expressions along with code to create them using these classes.
The regular expression corresponding to r4 above is depicted below; this figure illustrates why we call it an abstract syntax tree:
Each of the five classes also has accessor methods to get the child expressions. These are created with the Ruby method attr_accessor :name, which creates a getter method name to retrieve field @name and a setter method name= to write to it. For example, REUnion has two fields, expr1 and expr2, so we could write r3.expr1 (r3 is defined in the table above) to get the left disjunct (REConst.new("a") in this case) and r3.expr2 to get the right disjunct (REConst.new("b")).
We also included some utility methods in these classes: An == method to compare two regular expressions structurally, a hash method to construct hash keys for regexps, and a to_s method to present a regular expression in textual form. This last method adds lots of extra parentheses so that precedence is clear.
Parsing regular expressions (parser.rb)
As you have no doubt guessed, constructing regular expression ASTs by hand is tedious and error-prone. To help make it easier for you to test your project, we've included a class Parser in parser.rb to translate the more usual string representation of a regular expression into the appropriate AST. To use Parser, all you need to do is instantiate it and invoke its parse(s) method, e.g., r = Parser.new.parse("abc") will give you the AST r4 above.
This particular parser is not very sophisticated, and we do not guarantee that it is bug free. Regular expressions in this language are over upper- and lowercase alphabetic characters. As usual, Kleene closure is *, and binds more tightly than conjunction (represented by juxtaposition, as usual), which itself binds more tightly than disjunction, which is |.
Finite automata (fa.rb)
Finally, fa.rb contains a class NFA that we will use to represent non-deterministic and deterministic finite automata (recall that a DFA is just a special case of an NFA). An NFA instance contains three fields to represent the key parts of the NFA: @start, the initial state; @final, an array containing the final states; and @delta, a hash containing the transitions.
In these NFAs, a "state" is an arbitrary Ruby object. For example, the file example.rb uses integers to represent states. In our formal translation of regexps to NFAs, we labeled states with regexps, and you can do the same here as well. (Note that states cannot be nil.)
The set of transitions @delta has, roughly speaking, type Hash<State, Hash<String, Array<State>>>: it takes a state and and produces a hash that takes a symbol (a string of length 1) and produces an array of states. Thus, @delta[s1][x] is an array containing all the states that we transition to from state s1 on input symbol x. We may examine this data structure directly during grading, so please do not change the representation of @delta.
The class NFA includes several methods to help construct and work with NFAs. The method set_start(s) sets the NFA's start state to s. The method make_final(s) adds s to the list of final states. The method add_trans(s1, sym, s2) adds a transition from s1 to s2 on symbol sym. Note that although sym is a string, it must always have length 1. Next, the method states returns an array containing all the states in the NFA. This includes the start state, all the final states, and any state appearing in a transition. The method to_s gives a readable representation of an NFA. Finally, accept?(s) returns an accepting state if the NFA accepts s, and nil otherwise.
Part 1: Implement check() and trans()
For the first part of the project, you will implement the check(e) and trans(e) functions from slides 61 and 68, respectively. Both of these functions work by recursively visiting the AST of the regular expression. There are several ways to visit ASTs in an object-oriented language, but we'll do something simple for this project: For each of the five regular expression classes, you will implement methods check and trans. (Thus, instead of check(e) as we wrote in the slides, in this project you'll call e.check, and similarly for trans.) For the compound regular expressions (REStar, REUnion, REConcat), these will likely recursively call check and trans on the subexpressions. For example, REUnion#check will call @expr1.check and @expr2.check.
Your method check should return true if self accepts the empty string, and false otherwise.
Your method trans should return an Array where each element is itself an Array whose first component is a String (a symbol) and whose second component is a regexp. See the code for REConst#trans for an example (we've already filled this in for you).
Part 2: Translate regular expressions to NFAs
Next, implement the method NFA.of_re(r), which, given a regexp r returns an NFA accepting the same language. Use the reduce algorithm given on slide 72.
Part 3: NFAs to DFAs
Next, write a method NFA#is_dfa? that returns true iff self is a deterministic finite automaton, and write a method NFA#to_dfa that returns a new DFA that accepts the same language as self (which should not be modified). Use the algorithm on slide 81 for NFA#to_dfa. You do not need to (and should not) minimize the resulting DFA. Both is_dfa? and to_dfa should work correctly if self has an implicit dead state (i.e., if there are states that do not include transitions for all possible inputs).
Part 4: Regexp bake off
In parts 1-3, you developed code to generate DFAs from regexps. For the last part of the project, your job is to develop and conduct an experiment that answers the following question: Which implementation performs regular expression matching the fastest: (1) regexps translated to NFAs, as implemented in parts 1-2; (2) regexps translated to DFAs, as implemented in parts 1-3; or (3) native regexps in Ruby?
To answer this question, at a minimum you will need to pick at least six regular expressions, match the regexps against a large number of different strings under each of the three approaches (enough so that it takes at least a second or so), and time the results. Be sure to add ^ and $ to the Ruby regexp to make it do exact matching, so that you're comparing times for the same thing. For example, if you run your implementation on the regexp "330", you should use the Ruby regexp /^330$/.
On a Unix machine, you can use the "time" command to run a program and measure how long it took (use "real," the wall-clock time, as your measure). In Ruby, you can use Time.now.usec to get a high-resolution (microsecond) time. To help produce more accurate results, run your time trials for each variant at least 3 times and average the results.
Some things you will need to think about for this experiment are: If you match a single regexp r against many different strings, you can amortize the cost of translating r to an NFA/DFA, and only do that once. How does that affect the results? Does the particular regular expression matter? For example, does it matter if the regexp is long? Includes many disjunctions? Includes Kleene closure? Does it matter if the string being matched is long?
Write up your experiment and results in a short (1-2 page) plain text or pdf file called experiment.txt or experiment.pdf, and submit it along with the rest of your project file. Your writeup should contain the following:
This part of the project will worth approximately 20% of the project grade, and will be graded on a scale of 0-3 points:
Hints and Tips
Public TestsThe zip file containing the public tests has several Ruby programs in it; the public tests are in the files public_check_ol.rb, public_is_dfa_ol.rb, etc. Once you have copied your fa.rb and re.rb files into the p2_public_tests directory, you should be able to run the files there. For example,
% ruby public_check_ol.rb test check on re-ol-2 PASSED test check on re-ol-3 PASSEDIf you look inside the test files, you will see that they invoke the script tester.rb, which can be called in several ways. Here are the possible sets of arguments:
false true false false false false true false true false(This output is compared against that in outputs/check-re-ol-2.out.) You can see that the input file is a regular expression file inputs/re-ol-2 in which each line has a regular expression on it. For each regexp, it will print whether that regexp accepts the empty string (by calling the check method on the NFA your code constructed from the regexp).
As another example, the test file public_of_re_ol.rb calls ruby tester.rb of_re inputs/re-ol-2 inputs/str-ol-2 and thus tries to match each string in inputs/str-ol-2 to each regular expression in inputs/re-ol-2 (and the output is compared against the file outputs/of_re-ol-2.out).
You are advised to construct your own tests by modifying these public tests to have different strings to match, different regular expressions to match against, etc.
You may either submit over the web or under command line interface.
Submitting over the webPut your files (re.rb, fa.rb, and write-up) in a zip archive. Don't put things under a directory but on the top level.
Submit your archive directly to the submit server by clicking on the submit link in the column "web submission".
Next, use the submit dialog to submit your schedule.rb file directly.
Select your file using the "Browse" button, then press the "Submit project!" button.
Submitting in command line interfaceMake a directory and put your files in it. Download .submit and submit.jar in Project Files to the same directory. Then run the following command under that directory:
The first time you submit this way you will be asked to enter your directory ID and password. All files in the directory (and its subdirectories) will then be put in a jar file and submitted to the submit server. If your submission is successful you will see the message:
The Campus Senate has adopted a policy asking students to include the following statement on each assignment in every course: "I pledge on my honor that I have not given or received any unauthorized assistance on this assignment." Consequently your program is requested to contain this pledge in a comment near the top.
Please carefully read the academic honesty section of the course syllabus. Any evidence of impermissible cooperation on projects, use of disallowed materials or resources, or unauthorized use of computer accounts, will be submitted to the Student Honor Council, which could result in an XF for the course, or suspension or expulsion from the University. Be sure you understand what you are and what you are not permitted to do in regards to academic integrity when it comes to project assignments. These policies apply to all students, and the Student Honor Council does not consider lack of knowledge of the policies to be a defense for violating them. Full information is found in the course syllabus---please review it at this time.