Assignment # 3: Perl


Abbreviation expansion                           DUE: March 9, 2000 - 6:00pm

Many interactive programs, such as text editors, allow a user to define abbreviations for commonly used strings. Once an  abbreviation has been defined, it can be automatically translated into its expanded form. For example, if the abbreviation EPA abbreviates the expanded form Environmental Protection Agency, then the sentence:

Fred's Chemical Company and Taco Shack has been fined by the EPA for illegally dumping toxic waste.

gets expanded into:

Fred's Chemical Company and Taco Shack has been fined by the Environmental Protection Agency for illegally dumping toxic waste.

Your assignment is to write a Perl program that rewrites input lines, replacing abbreviations with their expanded forms.

Each input line will contain zero or more strings (non-empty sequences of characters), delimited by whitespace (spaces, tabs, etc.). Input lines
will be of four types:

  1. New context lines have the form:
    #CONTEXT#
    This means that a new local context for the input text is started.
  2. Definition lines are in the form:
    String1 #LOCAL_DEF# String2 String3 ··· StringN
    or
    String1 #GLOBAL_DEF# String2 String3 ··· StringN
    This means that String1 becomes a local or global abbreviation, respectively, for String2 String3 ··· StringN (i.e. String1 in the input expands to String2 String3 ··· StringN in the output).
  3. Undefinition lines are in the form:
    #LOCAL_UNDEF# String
    or
    #GLOBAL_UNDEF# String
    This removes the abbreviation represented by String from the current set of local or global abbreviations, respectively.
  4. Any other line of text. None of #CONTEXT#, #DEF# or #UNDEF# will appear on a regular text line.

When your program encounters a line of type 1, a new local abbreviation context is created. Abbreviations are classified as either local or global. All local abbreviation definitions become undefined when a new context is entered (i.e. a #CONTEXT# line is reached), while global abbreviations remain defined until explicitly undefined (as described below). A #CONTEXT# line is not required at the beginning of an input data file.

When your program encounters a line of type 2, it should create a new global or local abbreviation for String1 and produce no output. A global abbreviation will be applied to all input lines between the definition line and a subsequent undefinition line, while a local abbreviation line will apply to all input lines between the definition line and either a subsequent undefinition or a new context line. If any one or more of the strings in String2 String3 ··· StringN is currently an abbreviation, replace each such string with its expansion, before storing the expansion for String1. If an abbreviation of the same kind (global or local) already exists for String1, delete the old expansion after determining the new expansion.

When your program encounters a line of type 3, it should delete the specified global or local abbreviation.

No other text is allowed on lines of types 1, 2 and 3 aside from that described, but whitespace before, in between and after the required strings is allowed.

When your program encounters a line of type 4, it should print out the input line, with each abbreviation expanded. Local abbreviations take precedence over global ones, meaning that the local abbreviation expansion should be used if both a global and local abbreviation are currently defined. Any abbreviation replaced must match an entire word, not just a substring (a part of a word). Words are delimited by whitespace.

No other words should be affected.  Abbreviations within expanded text should be ignored. If the line contains no abbreviations, then it should
be printed out unchanged (except whitespace doesn't have to be the same).

Lines of types 2 and 4 also continue onto the following line, if the last character on the line is a backslash (\). This means that two or more input lines effectively translate to one abbreviation definition or output line.  Remember that the newline at the end of an input line (e.g., after a line-ending backslash) counts as whitespace.

The output from your program should consist of one output line for each input line (or each set of input lines, all but the last ending in a backslash) of type 4. You can assume that strings on the left-hand side of all abbreviation definitions are single words, and that the right-hand side of a definition is a (possibly empty) sequence of words separated by whitespace. Also, don't worry if the output lines are very long.


For example:

foo
foo #LOCAL_DEF# foo bar baz
bar
foo
baz \
foobar
#CONTEXT#
foo
foo #LOCAL_DEF# foo bar baz
foo #GLOBAL_DEF# foo bar baz
bar
foo
#LOCAL_UNDEF# foo
foo
foobar bar

should output:

foo
bar
foo bar baz
baz foobar
foo
bar
foo bar baz
foo bar baz bar baz
foobar bar


Handing in the assignment

Instructions for submitting your work:

  1. Name your Perl source file assignment3.pl, and make it executable;
  2. Your Perl file should read test data from the input data file specified on the command line, redirected from stdin, and write its output to stdout, also perhaps redirected to a file (e.g., assignment3.pl <test_data.txt >test_output.txt);
  3. tar the file(s) for submission (e.g., tar cvf submit3.tar assignment3.pl);
  4. submit the tar file: ~al330001/bin/submit 3 submit3.tar.

Your work may not be graded if these procedures are not followed exactly.

A large penalty will be assessed if the required output format is not followed exactly.