CMSC 430, Spring 2009

Theory of Language Translation

Project 2 - C-- Scanner & Parser

Due 11:59pm Thursday, March 12th, 2009 (Extension to Mon, March 16th)

Introduction

This project is the first in the series of projects that you have to implement in order to build a compiler for 'C--', a variant of 'C'. Your compiler will scan, parse, type check, and eventually generate optimized Java class files from input C-- programs. In this project you will implement the compiler front end (scanner & parser) for C--.

Getting Started

Project 2 will be implemented in Java in Eclipse. Download the following file:
  • Eclipse project archive file p2.zip
You can import the project into Eclipse as an existing project in an archive file.

Before making any changes to your code, make sure you can run the code in StudentTests.java (applies skeleton scanner/parser to toy1.c and toy2.c). Remember that you may need to apply "Refresh" to the entire project so Eclipse can find the files generated by JLex and CUP.

Along with files used to make direct submissions to the submit server (submit.jar, .submit), you will find the following project files:

  • Scanner specification for JLex - mycc.lex
  • Parser specification for Cup - mycc.cup

C-- Language Syntax

  • Comments

    // Ignore to the end of line

  • Data types: int & void. There can be arrays of int.

    There are no explicit boolean variables, but expressions may have type boolean if they are the result of a comparison.

  • Operators include:

    • arithmetic: +, -, *, /
    • relational: ==, !=, <, >
    • logical: &&, ||, !

    All binary arithmetic and logical operators are left-associative. NOT (!) and UNARY_MINUS (-) are not associative. I.e., (- - 2) is illegal; you must use -(-2). Arithmetic operators require integer operands, logical operators require boolean operands, The relational operators == and != work on both bool & int. The relational operators < and > work on only int. Relational operators ==, !=, <, > all return a boolean.

    The operators have the following precedence:

    Highest ! (unary minus)
    * /
    + -
    == != > <
    Lowest && ||

    You have to assign the appropriate precedences in CUP.

    All math operators (+, -, *, /) and comparisons (<, >) are defined for only integers. Logical operators (&&, ||, !) apply only to results of comparisons (booleans). == and != apply to both booleans and integers.

  • Control Statements:

    C-- provides three control statements, if, if-else, and while.

       if ( condition )
       {
         statement_list;
       }
    
       if ( condition )
       {
         statement_list;
       }
       else
       {
         statement_list;
       }
    
       while ( condition )
       {
         statement_list;
       }
    

There can be any level of nesting of these control statements. Curly braces are required for all C-- control statements.

For example, the code

   if (x != 0)
     a = b + c;  

should produce a syntax error. The correct syntax would be

   if (x != 0)
   {
     a = b + c;	  
   }

There is no implicit type casting in conditional statements. For instance, "if (x)" is not valid if "x" is an int. You need to write "if ( x != 0)"

  • Functions:

    All functions must be declared with type int or void.

    Function calls in C-- can appear either on the RHS of an expression or as a statement:

       x = a + foo() + 10;  // foo must return an integer
    
       foo();               // foo may return either integer or void
    

    A function call will have either 0 or 1 arguments.

    Functions may have 'return' statements. There can be one or more return statements in the statement list. The syntax for the return statements are

       return; 	// when returns void
    
       return exp;  // when returns integer 
    

  • Program Syntax:

    A valid C-- program consists of zero or more global variable declarations, followed by one or more function definitions. The body of the functions may contain local variable declarations, followed by the code.

    The function structure will be

       [ int | void ] function_name( 0 or 1 parameters ) 
       { 
         local variable declarations;   // optional 
         statement list; 
       }
    

  • Forward Function Declarations:

    A function has to be defined before a call to it can be made.

    For instance, the following code should cause bar() to be marked as an undefined function.

       void foo()
       {
         bar(1);
       }
    

       void bar(int x)
       {
         y = x;
       }
    

    However, we do allow forward function declarations. They are similar to actual function declarations, except they do not contain a function body.

    For example, the following code contains a forward declaration of bar(), which allows it to be used before it is actually defined:

       void bar( int x );
    

       foo()
       {
         bar();
       }
    

       void bar( int y )
       {
         y = 1;
       }
    

    Forward declarations specify the function return type and parameter type. They must match with actual function declaration. The name of the parameter in the forward declaration is not relevant.

  • Built-in functions

    There are three built-in library functions: printInt(), printStr(), and printLn(). The compiler will automatically generate code for these functions if they are used. printInt() takes an int as an argument, printStr() takes a literal constant string, and printLn() has no parameters.

    For instance:

       int x; 
       printInt(1); printInt(x); printInt(x+1); 
       printStr("foo"); 
       printLn(); 
    

  • Scoping

    C-- applies nested lexical scoping, where every pair of curly brackets creates a new nested scope (and can include new variable declarations). Variables visible include those declared locally and in enclosing scopes. For instance:

    A:   int x;
         foo()
    B:      int x,y;
            if (x == 1)   // refers to x declared at B:
            {
    C:         int x,z;
               x = y+z;   // refers to x,z declared at C:,
                          //   y declared at B:
            } 
            else 
            {
               int i,j;
               x = 3;     // refers to x declared at B:
                          //   x,z at C: are not visible 
            }
         }
    

    Requirements

    Lexical Analyzer:

    The lexical analyzer should identify the keywords, operators, identifiers, strings, constants and other necessary characters correctly. You have to generate the lexical analyzer using JLex, the java lexical analyzer generator. The JLex specifications are very similar to the lex specifications. Read the JLex manual to learn JLex specifications and syntax.

    You may need to explicitly look for carriage returns on PC-based systems, if the %notunix command to the lexer is not working. You can do this by specifying "\r\n" instead of "\n" as the newline.

    Goals:

    • find all legal tokens
    • handle comments
    • Report the following simple scanning errors using parser.msg():
      • "illegal character "
      • "unterminated string"

    Tasks:

    • add comments
    • extend strings
    • extend numbers

    Parser:

    The parser is generated using the java parser generator, CUP. CUP specifications are similar to yacc/bison, taking an LALR grammar. Read the CUP manual to learn CUP specifications and syntaxes. Report simple syntax errors, and attempt to recover.

    Goals:

    • Accept all legal C-- syntax
    • Report the following simple syntax errors using "error" token and parser.msg():
      • "illegal statement"
      • "illegal expression"
      • "illegal variable declaration"
      • "missing semicolon"

    Tasks:

    • VOID functions
    • array declarations
    • multiple variable declarations
    • nested lexical scoping
    • arithmetic, logical, and relational operators
    • forward function declarations

    Project Files

    You have been given a toy parser which parses a subset of the C-- language. You need to enhance them to parse the full C-- language. The files are listed below, with a brief description. All your modifications will be to the files mycc.lex and mycc.cup.

    • mycc.lex

      The JLex specification file for implementing the scanner. Most of the basic tokens have been added for you. All you need to do is extend it for comments, strings, and better handling of identifiers and numbers.

    • mycc.cup

      The CUP specification file. You need to extend this file in order to implement the parser. Most of your changes should be to this file. A lot of the grammar and actions have been provided.

    • symTabEntry.java

      Code for implementing symbol table entries You should not need to modify this file.

    • symTab.java

      Code for implementing the symbol table You should not need to modify this file.

    • expNode.java

      Code for storing information with expression nodes You should not need to modify this file.

    • Yylex.java, parser.java, sym.java

      Files created by JLex & CUP

    • goAll, goAll.bat

      Script for compiling mycc & testing it on test*.in

    • goTest, goTest.bat

      Script testing mycc on test*.in

    • goJar, goJar.bat

      Scriptfile for putting mycc.cup & mycc.lex in p2.jar for submission.

    • toy*.c

      Some sample input C-- programs handled by the toy front end initially provided

    • test*.c

      Some sample input C-- programs you should try to parse

    • test*.log

      Log files created by mycc when using the goAll scripts

    • test*.out

      Some sample output files from a fully implemented front end

    Main Classes

    • parser

      Generated by CUP, this is the class which performs the actual parse. All action code in your CUP grammar are executed as methods of class "parser". Important public fields include globalSymTab, currSymTab, and curType. These fields are used to store information to be transferred between different actions in CUP. The "parser" class also contains main(), the starting point of the user code. You should use the static method msg() to report any scanner & parser error messages.

    • sym

      Generated by CUP, this class contains the constants for all the token in the grammar. Useful constants include sym.INT, sym.BOOL, and sym.VOID. Also used by JLex for the scanner.

    • Yylex

      Generated by JLex, this class implements the scanner.

    • symTab

      Stores all the symbols in a scope as a HashMap. The symbol table also keeps track of its parent. All local nested scopes are kept in a children list, except for function scopes, which are stored in the entry for the function symbol.

    • symTabEntry

      Stores information for each symbol. Generic information includes name (as a String) & type (sym.INT, etc). Also includes information specific to arrays & functions. Important to note the symTabEntry for a function stores the symTab for that function.

    • expNode

      Stores information as to the type and value of an expression. Helpful for keeping track of intermediate expressions (e.g., 1+2*4).

    Hints and Tips

    • When implementing your project in Eclipse, you will likely need to frequently apply the Refresh command in order for Eclipse to locate the latest files created by JLex, CUP, and your parser.

    • If you examine TestUtils.java, you will see that the test scripts
      • Ignore line numbers and white space
      • Ignores "Syntax error" warnings
      • Only checks for the presence of the following error messages (ignoring their location and line numbers)
        • "unterminated string"
        • "illegal expression"
        • "illegal variable declaration"
        • "missing semicolon"
      However, we reserve the right to examine the actual output of your parser when evaluating its correctness.

    • You may add checks for additional syntax errors if you wish, but it should not affect whether you pass any of the public/release/secret tests.

    Submission

    You can submit your project in two ways:
    • Submit your mycc.lex and mycc.cup files by first putting both files in a single .zip or .jar archive. The files should be at the top level of the archive, not in a subdirectory.

      You may create a jar archive containing mycc.cup & mycc.lex by going to the directory containing your project, then either execute goJar (or goJar.bat) or type the following command directly:

      jar cf p2.jar mycc.cup mycc.lex

      You can submit the resulting p2.jar archive to to the submit server by clicking on the submit link in the column "web submission".

      Select the p2.jar archive using the "Browse" button, then press the "Submit project!" button.

    • Submit directly by executing a Java program on a computer with Java and network access. Use the submit.jar file from the archive p2.zip, To submit, go to the directory containing your project, then either execute goSubmit (or goSubmit.bat) or type the following command directly:

      java -jar submit.jar

      You will be asked to enter your class account and password, then all files in the directory (and its subdirectories) will be put in a jar file and submitted to the submit server. If your submission is successful you will see the message:

      Successful submission # received for project 2

    Academic Integrity

    Please carefully read the academic honesty section of the course syllabus. In particular, remember that you are not allowed to show or copy any code for your project, either from the web or other students. Any evidence of impermissible cooperation on projects, use of disallowed materials or resources, or unauthorized use of computer accounts, will be submitted to the Student Honor Council, which could result in an XF for the course, or suspension or expulsion from the University. Be sure you understand what you are and what you are not permitted to do in regards to academic integrity when it comes to project assignments. These policies apply to all students, and the Student Honor Council does not consider lack of knowledge of the policies to be a defense for violating them. Full information is found in the course syllabus---please review it at this time.