c m s c 311
f a l l 2 0 0 2
Yes, basically, it should not be considered as a delimiter. It should be considered a literal string. Thus, if comma is a delimiter, then \, is an escaped delimiter. You should not consider it as a delimiter, and get rid of the comma.
When you get a token (using getLeftToken() or getRightToken()), you should take care of all backslashed (escaped) sequences. For all escaped sequences (except \n), you simply remove the backslash. For backslash followed by the character 'n' (which appears as \n, i.e. two separate characters), then replace that with a newline (a single character).
Well, not exactly "more" (a little more perhaps). I noticed from the project description that getLeftToken() left trims the token. So, I am going to follow that. I think I did that to handle white space for space as delimiter.
However, make getRightToken() find the rightmost delimiter and NOT trim the white space.
Thus, if the string were "***cat***,***dog**", then the rightmost token would be "***dog**", and the rest of the string would be "***cat***" (without the comma).
Yesterday, someone asked what we should do about escaped delimiters. Here is the part that's the usual stuff (i.e., not extra credit:
You may assume backslashes do not appear for anything else. Thus, \a is not a valid substring, unless "a" is a delimiter.
Suppose you have :: as a delimiter (two colons). Then, you should be able to handle the following: \::: is considerd an escaped delimiter (thus not a true delimiter) followed by a colon. \:::: is treated as an escaped (non-true) delimiter followed by ::, which is a true delimiter.
\\\\\\ is really three escaped backslashes and gets translated to a three real backslashes.
Finally, findRealDelimiter, when looking for 1 or more blank spaces should return the index of the first non-quoted blank that appears after a string of non-blank characters. If such a blank does not appear, it returns -1. Thus, [***cat***], will return 6 (where * is being used to stand for blanks, to make it easier to see in this FAQ and brackets are used to show where the string starts and end, but is not part of the string itself).
If you have "***'cat**'*", then findRealDelimiter() returns 10 since it skips the first three blanks, ignores the item in single quotes, and finds the space at the end.
For any other delimiter, it just looks for the first non-quoted, non-escaped version of that delimiter. Only blanks used as delimiters have this weird feature.
Yes. Unfortunately, using space (blank, not TAB or newlines) as a delimiter has to be handled differently. Otherwise, it makes split() awkward to use. If " " is a delimiter, then getLeftToken() should skip over any blank spaces, then over any non-blank spaces, and stop at the next blank space, or the end of string, whichever comes first.
Thus, if the string were "****cat***in*the*hat" where * represents a blank, then, the first token is "cat", and the rest of the string is "**in*the*hat". The tokenizer should skip the first four spaces, read over the non-blank "cat", then stop at the first blank after "cat". It removes that blank, and the rest of the string is "**in*the*hat". Again, blanks with double or single quotes are not to be considered delimiters. You can assume there will be no \* (i.e., backslash followed by a blank), though you can handle it if you want.
Notice that it comes as "cat". If you didn't trim the token, then it would be "****cat".
If the delimiter contains more than one blank (and nothing else) assume it's just one blank space.
For any other delimiter, you simply look for the first "real" occurrence of the delimiter.
Actually, it doesn't. While the rest are useful for tokenizing, this one static method is used for later projects to remove comments from the assembly code.
Therefore, you can assume that comments won't appear in methods dealing with tokenizing. (In any case, without knowing what the comment string is, you can't remove comments withother methods, so you have to make this assumption).
getLeftToken() should get the first token (up to the first delimiter, should it exist), then replace the escaped characters in the token (replacing \' with ', \\ with \, \" with ", and \n with the newline). The rest of the string should remain unchanged (i.e. the substring after the token).
So split() merely calls getLeftToken() over and over, until the string is empty. It doesn't need to do any substitutions of escaped characters.
You are encouraged to do so. You may find there are many convenient methods available, that will help you avoid doing work. You're always encouraged to read through the set of methods available to you.
Try this link. It's SGI (Silicon Graphics) reference to STL. It's not great, but it should do. There are various "find" and "replace" operations that might be useful.
Do you need to ask? Of course, you can! One useful function to write is one that replaces escaped characters.
See the class syllabus for policies concerning email
Last Modified: Wed Jun 5 10:38:37 EDT 2002