computer organization
c m s c 311  
f a l l   2 0 0 2  

Purpose

The overall goal of the project is to write an assembly language interpreter. Thus, it may seem odd to start off with a parsing/tokenizing project.

There are two reasons for giving this project. First, I think tokenizing is a useful skill everyone should have. Breaking up a line is just something that you tend to do when working with text.

Second, the rest of the projects require you to manipulate text. Many students probably used all sorts of "cin" tricks to read the input in, and ended up with code that was difficult to read, write, and to fix up.

The idea of these methods is to emulate features of the StringTokenizer in Java. Of course, this isn't the same thing, but it's close enough.

To process a file, you will read in a line at a time, then break the line up using the methods below into tokens. At that point, you will process the tokens.

Location of File

Go to ~chf11001/Proj/P0/, and there should be two files. Use Tokenizer.h and create Tokenizer.cpp.

Tokenizer Specifications

Tokenizer

Static Public Methods

static string leftTrim( const string & str ) ;
Given parameter str, removes any leading blank spaces. For example, if str is " hello, world ", then the string returned should be "hello, world ". Only the leading spaces have been removed. The remaining spaces are unchanged. If there are no leading spaces, just return the string as is (an empty string falls into this category).
static string rightTrim( const string & str ) ;
Given parameter str, removes any trailing blank spaces. For example, if str is " hello, world ", then the string returned should be " hello, world". Only the trailing spaces have been removed. The remaining spaces are unchanged. If there are no trailing spaces, just return the string as is (an empty string falls into this category).
static string trim( const string & str ) ;
Given parameter str, removes any leading AND trailing blank spaces. For example, if str is " hello, world ", then the string returned should be "hello, world". Only the leading and trailing spaces have been removed. The remaining spaces are unchanged. If there are no trailing spaces, just return the string as is (an empty string falls into this category).

Use leftTrim and rightTrim to implement this static method.

static bool isBlankLine( const std::string & str ) ;
Returns true is the line consists of zero or more blanks. Returns false if there is a non-blank character.
static std::string stripComment( const std::string & str, const std::string & commentStr ) ;
Finds the leftmost occurrence of commentStr (that's not surrounded by single or double quotes). Removes the commentStr to end of string. For example, if the string were "add $r1, $r2, $r3 # adds", and the comment string were "#", then the retuen value would be: "add $r1, $r2, $r3 ".

If there is no comment string, just return the string itself.

static string getLeftToken( string & str, const string & delimiter ) ;
A delimiter is a separator. Often spaces are used to separate words. Sometimes commas are used. You might have colons, slashes, etc. I will also assume you might use more than one character as a delimiter. For example, you might use double colons.

The behavior of this method depends on what the delimiter is. If the delimiter is a single space or multiple spaces, then you should get the leftmost token by:

  • skipping leading white space in str
  • getting characters all the way up to the next white space character (or to the end of the string, if there is no additional white space).

For example, if you have " cat dog ", then once the call is made to getLeftToken(), you should return "cat", and str, the parameter will be " dog ". This behaves very much like "cin" when reading in a string. If the string is empty or contains only blanks, then return the empty string.

If the delimiter is anything else, you should search for the first delimiter not enclosed in single or double quotes. You should read up to that delimiter and stop. Should the delimiter not exist, then just return the token.

For example, suppose the delimiter is ",". You have " ok "one, two, three" , four". The left (first) token should be "ok "one, two, three" ", and str should be " four".

The delimiter is not kept.

Also, "escaped" delimiters don't count. For example, if you have the string "paper \, scissors, rock", then the first (left) token is "paper , scissors" and str is "rock".

Notice that escaped delimiters are replaced by the actual delimiter. Backslashes can also be delimiters, which means they can also be escaped.

static string getRightToken( string & str, const string & delimiter ) ;
Similar to getLeftToken() except your get the token from the right side. Basically, process the string right to left, instead of left to right.
static int findRealDelimiter( const std::string & str, const std::string & delimiter ) ;
This is useful helper function. This returns the index of the first character of a delimiter that is not surrounded by single or double quotes. If there is no such delimiter, return -1.
static vector<string> split( const std::string & str, const std::string & delimiter, bool trimToken = true ) ;
Calls firstToken() repeatedly to break a string into tokens. As each token is pulled out, if trimToken is true, call trim() on the token to remove leading and trailing blanks.

Be careful. If the string was "h,," you would have 3 tokens. "h", "", and "". For delimiters that are not blanks, the number of tokens should be one more than the number of "real" delimiters.

static std::string removeEnd( const std::string & str, char left, char right ) ;
Suppose you had a string like " [ cat ]", and you want to remove brackets. There may be leading or trailing spaces. So, specify left as '[' and right as ']'. This returns back " cat ".

More technically, it removes leading white space and left, should it exist. It removes trailing white space and right should it exist. Should left not exist, then this function does a left trim. Should the right one not exist, then it does a right trim (see the two methods above). Thus, if the string were " cat dog " and left and right were '[' and ']' respectively, then the result would be "cat dog".

How to Call Static Methods

To call a static method, you need to specify the class, followed by the double colons, followed by the method name and any arguments.

Here's an example:

   vector<string> tokens = Tokenizer::split( str ) ;

See the class syllabus for policies concerning email
Last Modified: Wed Mar 13 20:20:09 EST 2002
left up down right home