CMSC 430 Project #2 - Scanner/Parser ************************************** Language Syntax ************************************** 1. Comments: // Ignore to the end of line 2. Data types: int & void. There can be arrays of int. There are no explicit boolean variables, but expressions may have type boolean if they are the result of a comparison. 3. Operators include: arithmetic +, -, *, / relational ==, !=, <, > logical &&, ||, ! All binary arithmetic and logical operators are left-associative. NOT (!) and UNARY_MINUS (-) are not associative. I.e., (- - 2) is illegal; you must use -(-2). Arithmetic operators require integer operands, logical operators require boolean operands, The relational operators == and != work on both bool & int. The relational operators < and > work on only int. Relational operators ==, !=, <, > all return a boolean. The operators have the following precedence: Highest ! (unary minus) * / + - == != > < Lowest && || You have to assign the appropriate precedences in CUP. All math operators (+, -, *, /) and comparisons (<, >) are defined for only integers. Logical operators (&&, ||, !) apply only to results of comparisons (booleans). == and != apply to both booleans and integers. 4. Control Statements: C-- provides three control statements, if, if-else, and while. i. if ( condition ) { statement_list; } ii. if ( condition ) { statement_list; } else { statement_list; } iii. while ( condition ) { statement_list; } There can be any level of nesting of these control statements. Curly braces are required for all C-- control statements. For example, the code if (x != 0) a = b + c; should produce a syntax error. The correct syntax would be if (x != 0) { a = b + c; } There is no implicit type casting in conditional statements. For instance, "if (x)" is not valid if "x" is an int. You need to write "if ( x != 0)" 5. Functions: All functions must be declared with type int or void. Function calls in C-- can appear either on the RHS of an expression or as a statement: x = a + foo() + 10; // foo must return an integer foo(); // foo may return either integer or void A function call will have either 0 or 1 arguments. Functions may have 'return' statements. There can be one or more return statements in the statement list. The syntax for the return statements are return; // when returns void return exp; // when returns integer 6. Program Syntax: A valid C-- program consists of zero or more global variable declarations, followed by one or more function definitions. The body of the functions may contain local variable declarations, followed by the code. The function structure will be [ int | void ] function_name( 0 or 1 parameters ) { local variable declarations; // optional statement list; } 7. Forward Function Declarations: A function has to be defined before a call to it can be made. For instance, the following code should cause bar() to be marked as an undefined function. void foo() { bar(1); } void bar(int x) { y = x; } However, we do allow forward function declarations. They are similar to actual function declarations, except they do not contain a function body. For example, the following code contains a forward declaration of bar(), which allows it to be used before it is actually defined: void bar( int x ); foo() { bar(); } void bar( int y ) { y = 1; } Forward declarations specify the function return type and parameter type. They must match with actual function declaration. The name of the parameter in the forward declaration is not relevant. 8. There are three built-in library functions: printInt(), printStr(), and printLn(). The compiler will automatically generate code for these functions if they are used. printInt() takes an int as an argument, printStr() takes a literal constant string, and printLn() has no parameters. For instance: int x; printInt(1); printInt(x); printInt(x+1); printStr("foo"); printLn(); 9. C-- applies nested lexical scoping, where every pair of curly brackets creates a new nested scope (and can include new variable declarations). Variables visible include those declared locally and in enclosing scopes. For instance: A: int x; foo() { B: int x,y; if (x == 1) // refers to x declared at B: { C: int x,z; x = y+z; // refers to x,z declared at C:, // y declared at B: } else { int i,j; x = 3; // refers to x declared at B: // x,z at C: are not visible } } ********************************************** Requirements ********************************************** Lexical Analyzer: The lexical analyzer should identify the keywords, operators, identifiers, strings, constants and other necessary characters correctly. You have to generate the lexical analyzer using JLex, the java lexical analyzer generator. The JLex specifications are very similar to the lex specifications. Read jlex_manual.html to know the jlex specifications and syntaxes. You may need to explicitly look for carriage returns on PC-based systems, if the %notunix command to the lexer is not working. You can do this by specifying "\r\n" instead of "\n" as the newline. Goals: 1) find all legal tokens 2) handle comments 3) report unterminated strings Tasks: add comments extend strings extend numbers Parser: The parser is generated using the java parser generator, CUP. CUP specifications are similar to yacc/bison, taking an LALR grammar. Read jcup_manual.html to know the cup specifications and syntaxes. Report simple syntax errors, and attempt to recover. Goals: 1) accept all legal C-- syntax 2) report simple syntax errors using "error" token examples: "illegal statement" "illegal expression" "illegal declaration" "missing semicolon" Tasks: VOID functions array declarations multiple variable declarations nested lexical scoping arithmetic, logical, and relational operators forward function declarations ********************************************************** Implementation ********************************************************* You have been given a toy parser which parses a subset of the C-- language. You need to enhance them to parse the full C-- language. The files are listed below, with a brief description: mycc.lex: The JLex specification file for implementing the scanner. Most of the basic tokens have been added for you. All you need to do is extend it for comments, strings, and better handling of identifiers and numbers. mycc.cup: The CUP specification file. You need to extend this file in order to implement the parser. Most of your changes should be to this file. A lot of the grammar and actions have been provided. symTabEntry.java: Code for implementing symbol table entries You should not need to modify this file. symTab.java: Code for implementing the symbol table You should not need to modify this file. expNode.java: Code for storing information with expression nodes You should not need to modify this file. Yylex.java, parser.java, sym.java: Files created by JLex & CUP go, go.bat: Scriptfile for compiling mycc. goAll, goAll.bat: Script for compiling mycc & testing it on test*.in goTest, goTest.bat: Script testing mycc on test*.in toy*.c: Some sample input C-- programs handled by the toy front end test*.c: Some sample input C-- programs you should try to parse test*.log: Log files created by mycc when using the goAll scripts test*.out: Some sample output files from a fully implemented front end ********************************************************** Main classes ********************************************************* 1) parser Generated by CUP, this is the class which performs the actual parse. All action code in your CUP grammar are executed as methods of class "parser". Important public fields include globalSymTab, currSymTab, and curType. These fields are used to store information to be transfered between different actions in CUP. The "parser" class also contains main(), the starting point of the user code. 2) sym Generated by CUP, this class contains the constants for all the token in the grammar. Useful constants include sym.INT, sym.BOOL, and sym.VOID. Also used by JLex for the scanner. 3) Yylex Generated by JLex, this class implements the scanner. 4) symTab Stores all the symbols in a scope as a HashMap. The symbol table also keeps track of its parent. All local nested scopes are kept in a children list, except for function scopes, which are stored in the entry for the function symbol. 5) symTabEntry Stores information for each symbol. Generic information includes name (as a String) & type (sym.INT, etc). Also includes information specific to arrays & functions. Important to note the symTabEntry for a function stores the symTab for that function. 6) expNode Stores information as to the type and value of an expression. Helpful for keeping track of intermediate expressions (e.g., 1+2*4). ********************************************************************