C M S C     2 1 4
C o m p u t e r   S c i e n c e   I I
F a l l   2 0 0 2


Writing a Tokenizer (Task 2)

Updates (Clarifying what a token is)

A token will not consist of any escaped characters (such as \n). When you see the first occurrence of a double quote, you can assume that there will be a matching double quote, which is the next closest double quote afterwards (thus no nesting of double quotes). Similarly, if you see a single quote, it matches the nearest single quote.

There can be double quotes within single quotes, and single quotes within double quotes, but no double quotes within double quotes, and no single quotes between single quotes.

For example, in ["can't*miss"*fun*'see"there'*] (where the brackets indicate the start and end of the string, and asterisks represent spaces), the three tokens are [can't*miss], [fun], [see"there].

You can assume no backslashes appear in the string.

Purpose

One of the odd features of Java is the lack of some I/O that either resembles scanf() and printf() (from C) or cin and cout (from C++). In fact, many Java textbooks delay discussing simple I/O from users until very late in the book because text-based I/O is complicated in Java.

The reason such discussion is delayed is that folks who use Java tend to prefer reading in one line of text at a time (similar to using getline() (from C++) or gets() (from C)) and breaking the line up into tokens. Java supports (at least) two classes for tokenizing: StringTokenizer and StreamTokenizer.

The goal of this project is to learn how to break up a line into tokens. You may have seen a project like this when you took CMSC 106. However, such a project always seems difficult no matter where it's presented. Nevertheless

You can think of these tokens as "words". Teaching a student to write and/or use tokenizers is difficult for beginning Java students. Beginners have a difficult time learning it when they're struggling to understand functions and loops.

Fortunately, you're not a beginning programmer. A variation of this project is often used at the "strings" project in CMSC 106.

The purpose of writing this simple tokenizer is to allow you to read input a line at a time (using getline()), passing this line as an argument to a Tokenizer static method, and get a vector of strings (i.e., tokens) as a result.

Beyond the basic task of tokenizing, you will gain practice using:

The following is a specification for the Tokenizer class. This class is provided as Tokenizer.h

Tokenizer Specifications

Tokenizer

Static Public Methods

NOTE: Asterisks represent spaces, to make it easier to see. They are not mean to represent actual asterisks.

static string leftTrim( const string & str ) ;
Given parameter str, removes any leading blank spaces. For example, if str is
"****hello,***world****", then the string returned should be "hello,***world****". Only the leading spaces have been removed. The remaining spaces are unchanged. If there are no leading spaces, just return the string as is (an empty string falls into this category).
static string rightTrim( const string & str ) ;
Given parameter str, removes any trailing blank spaces. For example, if str is
"****hello,***world****", then the string returned should be "****hello,***world". Only the trailing spaces have been removed. The remaining spaces are unchanged. If there are no trailing spaces, just return the string as is (an empty string falls into this category).
static string trim( const string & str ) ;
Given parameter str, removes any leading AND trailing blank spaces. For example, if str is
"****hello,***world****", then the string returned should be "hello,***world". Only the leading and trailing spaces have been removed. The remaining spaces are unchanged. If there are no trailing spaces, just return the string as is (an empty string falls into this category).

Use leftTrim and rightTrim to implement this static method.

static string getLeftToken( string & str ) ;
A token is defined in one of the following three ways:

  • a sequence of 1 or more non-blank characters, (CORRECTION) unless the string is in double quotes or single quotes, in which case, there may be blanks.
  • If the sequence begins with a single quote, then the token extends to the next occuring single quote. (In P1, you can assume there is no escaped characters between single quotes, in particular, there is no \').
  • If the sequence begins with a double quote, then the token extends to the next occuring double quote. (In P1, you can assume there is no escaped characters between double quotes, in particular, there is no \').
In order to get the first token, you need to skip over blank spaces (hint: see if any of the static methods you've implemented already does this), then read to the next blank or the end of string, whichever occurs first.

For example, if the parameter str was "*cat***in*the*hat*", then you should return "cat" (no spaces occur in the result).

Furthemore, str should be modified to contain the rest of the string after "cat". Thus, str should contain "***in*the*hat*". Notice that there are leading spaces.

If the string is empty or contains only blanks, return the empty string, and set str to the empty string as well.

If the string begins and ends in a single quote (or begin and ends in a double quote), you should strip the quotes. Thus, if you had "*'cat****in'*the*hat", the function should return "cat****in" and not "'cat****in'". (You may have to print this to see the single and double quotes clearly).

You may assume that single open quotes have a matching single close quote.

You may assume that double open quotes have a matching double close quote.

static vector<string> split( const string & str ) ;
This static method takes a string that may contain spaces, and splits it into tokens. The tokens are stored in a vector of strings. The tokens should be added one by one using the push_back() method. Declare a local variable of type vector<string>. Repeatedly add tokens until there are no more.

For example, if the string were "*cat*"in*the"*'red/white hat'***", there would be three tokens. "cat", "in*the", and "red/white*hat".

Note that the string given in the previous example has literal quotes. The single quotes around 'red/white*hat' and the double quotes around "in the" are real quotes, i.e., they would appear in an input file.

However, the outermost double quotes (i.e., "*cat*"in*the"*'red/white*hat'*") which are shown in red, let you know where the string begins and ends.

How to Call Static Methods

To call a static method, you need to specify the class, followed by the double colons, followed by the method name and any arguments.

Here's an example:

   vector<string> tokens = Tokenizer::split( str ) ;

How to Test the Code

In order to test the code, you should create an input file and write a few words on each line.

Then, write a program that process the input file line by line. Here's a sample program to use the Tokenizer class.

#include <iostream>
#include <string>
#include <vector>

#include "Tokenizer.h"

using namespace std ;

int main()
{
   string line ;
   int count = 1 ;
   while ( getline( cin, line ) )
     {
        vector<string> token = Tokenizer::split( line ) ;
        // print out line information
        cout << "This is line " << count++ << endl ;
        cout << "----------------" << endl ;
        // print out tokens on line
        for ( int i = 0 ; i < static_cast<int>( token.size() ) ; i++ )
            cout << "   This is token " << i 
                 << ": " << token[ i ] << endl ;
     }
   return 0 ;
}

See the class syllabus for policies concerning email
Last Modified: Fri Sep 13 20:11:45 EDT 2002
left up down right home