CMSC 330, Spring 2013

Organization of Programming Languages

Project 2

Due 11:59pm Thu, Feb 28, 2013

Introduction

In this project, you will implement classes and methods for working with regular expressions. When you are done with this project, you will be able to compile regular expressions to NFAs and DFAs, use them to recognize strings. You will also conduct an experiment comparing the performance of different regular expression implementations.

Project files

  • Code skeleton zipfile contains the following files:
  • Files for command line submission (see "How to Submit" for usage)
  • Public tests: copy your re.rb and fa.rb into the p2_public_tests directory and run the tests.

Code Overview

To get you started, we've supplied you several files containing code that you'll use for this project. The file example.rb gives some Ruby code showing how to use the other files. We discuss the other three files in turn.

Regular expressions (re.rb)

The file re.rb contains five classes that we will use to represent regular expressions as abstract syntax trees (ASTs): REEpsilon (the empty string), REConst (one character), REStar (Kleene closure), REUnion (disjunction), and REConcat (concatenation). Each of these classes has a constructor that takes the appropriate number of arguments depending on how many subexpressions each syntactic form has. For example, here are some regular expressions along with code to create them using these classes.

RegexpRuby code
empty stringr1 = REEpsilon.new
cr2 = REConst.new("c")
a|br3 = REUnion.new(REConst.new("a"), REConst.new("b"))
abcr4 = REConcat.new(REConst.new("a"), REConcat(REConst.new("b"), REConst.new("c"))
a*br5 = REConcat.new(REStar.new(REConst.new("a")), REConst.new("b"))

The regular expression corresponding to r4 above is depicted below; this figure illustrates why we call it an abstract syntax tree:

AST for regexp /abc/

Each of the five classes also has accessor methods to get the child expressions. These are created with the Ruby method attr_accessor :name, which creates a getter method name to retrieve field @name and a setter method name= to write to it. For example, REUnion has two fields, expr1 and expr2, so we could write r3.expr1 (r3 is defined in the table above) to get the left disjunct (REConst.new("a") in this case) and r3.expr2 to get the right disjunct (REConst.new("b")).

We also included some utility methods in these classes: An == method to compare two regular expressions structurally, a hash method to construct hash keys for regexps, and a to_s method to present a regular expression in textual form. This last method adds lots of extra parentheses so that precedence is clear.

Parsing regular expressions (parser.rb)

As you have no doubt guessed, constructing regular expression ASTs by hand is tedious and error-prone. To help make it easier for you to test your project, we've included a class Parser in parser.rb to translate the more usual string representation of a regular expression into the appropriate AST. To use Parser, all you need to do is instantiate it and invoke its parse(s) method, e.g., r = Parser.new.parse("abc") will give you the AST r4 above.

This particular parser is not very sophisticated, and we do not guarantee that it is bug free. Regular expressions in this language are over upper- and lowercase alphabetic characters. As usual, Kleene closure is *, and binds more tightly than conjunction (represented by juxtaposition, as usual), which itself binds more tightly than disjunction, which is |.

Finite automata (fa.rb)

Finally, fa.rb contains a class NFA that we will use to represent non-deterministic and deterministic finite automata (recall that a DFA is just a special case of an NFA). An NFA instance contains three fields to represent the key parts of the NFA: @start, the initial state; @final, an array containing the final states; and @delta, a hash containing the transitions.

In these NFAs, a "state" is an arbitrary Ruby object. For example, the file example.rb uses integers to represent states. In our formal translation of regexps to NFAs, we labeled states with regexps, and you can do the same here as well. (Note that states cannot be nil.)

The set of transitions @delta has, roughly speaking, type Hash<State, Hash<String, Array<State>>>: it takes a state and and produces a hash that takes a symbol (a string of length 1) and produces an array of states. Thus, @delta[s1][x] is an array containing all the states that we transition to from state s1 on input symbol x. We may examine this data structure directly during grading, so please do not change the representation of @delta.

The class NFA includes several methods to help construct and work with NFAs. The method set_start(s) sets the NFA's start state to s. The method make_final(s) adds s to the list of final states. The method add_trans(s1, sym, s2) adds a transition from s1 to s2 on symbol sym. Note that although sym is a string, it must always have length 1. Next, the method states returns an array containing all the states in the NFA. This includes the start state, all the final states, and any state appearing in a transition. The method to_s gives a readable representation of an NFA. Finally, accept?(s) returns an accepting state if the NFA accepts s, and nil otherwise.

Part 1: Implement check() and trans()

For the first part of the project, you will implement the check(e) and trans(e) functions from slides 61 and 68, respectively. Both of these functions work by recursively visiting the AST of the regular expression. There are several ways to visit ASTs in an object-oriented language, but we'll do something simple for this project: For each of the five regular expression classes, you will implement methods check and trans. (Thus, instead of check(e) as we wrote in the slides, in this project you'll call e.check, and similarly for trans.) For the compound regular expressions (REStar, REUnion, REConcat), these will likely recursively call check and trans on the subexpressions. For example, REUnion#check will call @expr1.check and @expr2.check.

Your method check should return true if self accepts the empty string, and false otherwise.

Your method trans should return an Array where each element is itself an Array whose first component is a String (a symbol) and whose second component is a regexp. See the code for REConst#trans for an example (we've already filled this in for you).

Part 2: Translate regular expressions to NFAs

Next, implement the method NFA.of_re(r), which, given a regexp r returns an NFA accepting the same language. Use the reduce algorithm given on slide 72.

Part 3: NFAs to DFAs

Next, write a method NFA#is_dfa? that returns true iff self is a deterministic finite automaton, and write a method NFA#to_dfa that returns a new DFA that accepts the same language as self (which should not be modified). Use the algorithm on slide 81 for NFA#to_dfa. You do not need to (and should not) minimize the resulting DFA. Both is_dfa? and to_dfa should work correctly if self has an implicit dead state (i.e., if there are states that do not include transitions for all possible inputs).

Part 4: Regexp bake off

In parts 1-3, you developed code to generate DFAs from regexps. For the last part of the project, your job is to develop and conduct an experiment that answers the following question: Which implementation performs regular expression matching the fastest: (1) regexps translated to NFAs, as implemented in parts 1-2; (2) regexps translated to DFAs, as implemented in parts 1-3; or (3) native regexps in Ruby?

To answer this question, at a minimum you will need to pick at least six regular expressions, match the regexps against a large number of different strings under each of the three approaches (enough so that it takes at least a second or so), and time the results. Be sure to add ^ and $ to the Ruby regexp to make it do exact matching, so that you're comparing times for the same thing. For example, if you run your implementation on the regexp "330", you should use the Ruby regexp /^330$/.

On a Unix machine, you can use the "time" command to run a program and measure how long it took (use "real," the wall-clock time, as your measure). In Ruby, you can use Time.now.usec to get a high-resolution (microsecond) time. To help produce more accurate results, run your time trials for each variant at least 3 times and average the results.

Some things you will need to think about for this experiment are: If you match a single regexp r against many different strings, you can amortize the cost of translating r to an NFA/DFA, and only do that once. How does that affect the results? Does the particular regular expression matter? For example, does it matter if the regexp is long? Includes many disjunctions? Includes Kleene closure? Does it matter if the string being matched is long?

Write up your experiment and results in a short (1-2 page) plain text or pdf file called experiment.txt or experiment.pdf, and submit it along with the rest of your project file. Your writeup should contain the following:

  1. Your experimental design. Include enough information here that we could reproduce the experiment if we wanted to. List the regular expressions you use, how many times you run them, describe the kinds of strings you match against, etc. (Tip: you can use the wget url command on Linuxlab to download web pages; these might be a good source of large amounts of text to search through.)
  2. A description of the equipment on which you ran the experiment (if you used a linuxlab machine, just give us the name of the machine; if you used your own machine, describe its processor, speed, available memory, and operating system).
  3. The results of the experiment. Summarize the results of the measurements you took. For example, if you ran several different regexps against large amounts of text, you could report the average search time for each regexp and each implementation (NFAs, DFAs, Ruby internal) per input line.
  4. A very brief analysis speculating why the experimental result is as reported.

This part of the project will worth approximately 20% of the project grade, and will be graded on a scale of 0-3 points:

  • 3 points - solid experiment well above and beyond the minimum
  • 2 points - solid, but minimal experiment
  • 1 point - problems with the experiment
  • 0 points - serious issues with experiment
We will not grade you on grammar or spelling, unless there are problems to the point that we cannot understand your writeup.

Hints and Tips

  • Several of the project parts depend on each other. When we grade your project, we will plug in correct versions of any necessary methods; for example, when grading your regexp to nfa conversion, we will "monkey patch" your code with our versions of check() and trans(). While this should help make grading equitable, it also means you need to be extra careful not to change any of the interfaces we supply you with, since we will be relying on those in grading.
  • Beware that a hash cannot itself be used as a key in another hash. For example, if you had h = { } and then g = { }; g[h] = 1 then Ruby will complain. This issue may arise if you use hashes to represent sets; note that the NFA-to-DFA reduction takes a set of NFA states and combines them into a single DFA state. If the latter is represented as a hash-based set, then you will not be able to store it as the key for a hash.

Public Tests

The zip file containing the public tests has several Ruby programs in it; the public tests are in the files public_check_ol.rb, public_is_dfa_ol.rb, etc. Once you have copied your fa.rb and re.rb files into the p2_public_tests directory, you should be able to run the files there. For example,
% ruby public_check_ol.rb 
test check on re-ol-2 PASSED
test check on re-ol-3 PASSED
If you look inside the test files, you will see that they invoke the script tester.rb, which can be called in several ways. Here are the possible sets of arguments:
  • check re-file

    The file re-file is expected to contain regular expressions, one per line. For each, line it will call our parser (in parser.rb) which in turn will invoke your code to read in the regular expression and convert it to an NFA. Then it will call the check method on each NFA and print the result.

  • trans re-file

    The file re-file file is read in and each regular expression is converted to an NFA, as above, but this time it will invoke the trans method on each NFA and then will print the resulting set of transitions.

  • of_re re-file str-file

    After reading in each regular expression in re-file and converting it to an NFA, it will check whether the strings in str-file (one per line) are matched by each NFA. It will print the results of the match, and will print whether the matching results differ from what Ruby's matcher would do, if you had typed in the regular expression directly.

  • is_dfa nfa-file

    Reads in each NFA from the input file and invokes (and prints the output of) your is_dfa? method on the NFA.

  • to_dfa nfa-file str-file

    Converts each NFA in the input file to a DFA (using your to_dfa method) and then attempts to accept each string in str-file using the DFA. Makes sure that the behavior for the DFA is the same as the original NFA it was converted from.
The public tests themselves invoke tester.rb and then compare its output with an output file in the outputs/ directory. As an example, the test public_check_ol.rb will invoke ruby tester.rb check inputs/re-ol-2, which produces
false
true
false
false
false
false
true
false
true
false
(This output is compared against that in outputs/check-re-ol-2.out.) You can see that the input file is a regular expression file inputs/re-ol-2 in which each line has a regular expression on it. For each regexp, it will print whether that regexp accepts the empty string (by calling the check method on the NFA your code constructed from the regexp).

As another example, the test file public_of_re_ol.rb calls ruby tester.rb of_re inputs/re-ol-2 inputs/str-ol-2 and thus tries to match each string in inputs/str-ol-2 to each regular expression in inputs/re-ol-2 (and the output is compared against the file outputs/of_re-ol-2.out).

You are advised to construct your own tests by modifying these public tests to have different strings to match, different regular expressions to match against, etc.

Submission

You may either submit over the web or under command line interface.

Submitting over the web

Put your files (re.rb, fa.rb, and write-up) in a zip archive. Don't put things under a directory but on the top level.

Submit your archive directly to the submit server by clicking on the submit link in the column "web submission".

Next, use the submit dialog to submit your schedule.rb file directly.

Select your file using the "Browse" button, then press the "Submit project!" button.

Submitting in command line interface

Make a directory and put your files in it. Download .submit and submit.jar in Project Files to the same directory. Then run the following command under that directory:

java -jar submit.jar

The first time you submit this way you will be asked to enter your directory ID and password. All files in the directory (and its subdirectories) will then be put in a jar file and submitted to the submit server. If your submission is successful you will see the message:

Successful submission # received for project 2

Academic Integrity

The Campus Senate has adopted a policy asking students to include the following statement on each assignment in every course: "I pledge on my honor that I have not given or received any unauthorized assistance on this assignment." Consequently your program is requested to contain this pledge in a comment near the top.

Please carefully read the academic honesty section of the course syllabus. Any evidence of impermissible cooperation on projects, use of disallowed materials or resources, or unauthorized use of computer accounts, will be submitted to the Student Honor Council, which could result in an XF for the course, or suspension or expulsion from the University. Be sure you understand what you are and what you are not permitted to do in regards to academic integrity when it comes to project assignments. These policies apply to all students, and the Student Honor Council does not consider lack of knowledge of the policies to be a defense for violating them. Full information is found in the course syllabus---please review it at this time.

Valid HTML 4.01!

Web Accessibility