INFM743 - Development of Internet Applications

Week 4: Regular Expressions

Sample Files

Ken Lay's Email - one file with all of Ken Lay's messages from the Enron email collection.
nesting.pl - A file to illustrate nesting parentheses
Some sample code

Links

Regular expressions are a very powerful way to look for patterns in text. We are learning them as part of perl, but regular expressions aren't perl specific - most programming languages allow you to use them, as well as some operating systems and software packages.

Basics

Patterns are always contained between two / characters. We first saw these when we split lines. @lines = split(/\n/,$file); In this case, the pattern is /\n/ - a new line, and since it's a pattern we put it between the /s.

We can use any series of characters as a pattern. For example, we may want to see if a line of text contains two lower case L characters in a row. In that case, our pattern is /ll/.

When we are trying to match a pattern or substitute text using regular expressions, we don't use an equals sign, but rather the =~ combination. The example above would look like this:

$text = "The quick brown fox jumped over the lazy yellow dog."; if ($text =~ /ll/) { print "Yes\n"; } else { print "No\n"; } This will print "Yes" because the characters "ll" appear in the word "yellow". If we used the pattern /lll/ (three ls), the pattern would NOT match because three ls do not appear in a row.

More Advanced Patterns

Start and End of a line - The beginning of a line is indicated with the ^ in a pattern. For example, the pattern /^The/ would check for a line beginning with "The". It would not match if "The" appeared only in the middle of the line. The end of the line is indicated with the $. The pattern /dog.$/ would check for a line ending with "dog." You can use both of these together as well. For example, the pattern /^yellow$/ would look for a line that contains only "yellow" with no other characters; the pattern says "the line starts, has the word yellow, and then ends."

Wildcards - Very frequently, you will want to match patterns where you don't know what some of the characters are. In that case, you will use a wildcard character. In regular expressions, that is a period. The pattern /b.b/ will match anything with a b followed by any character (including spaces, letters, numbers, punctuation, etc) followed by another b.

There are also wildcards that are more specific:

\d - any digit
\D - any non-digit
\s - any white space character (spaces, tabs, new lines, etc)
\S - any non-space
\w - any word chacter (a letter or digit)
\W - any non-word character
[brew your own] - square brackets can contain a set of characters and one of them should match. For example, [abcdefg0123] would match any single character listed in the brackets. You can also use ranges of characters here: [a-zA-Z0-9] would match any lower case character, upper case character, or digit. You can include anything you want in the brackets, like [a-zA-Z0-5\-\. ] would match any characters, the digits 0 through 5, or a dash, period, or space.
[^brew your own] - you can use the ^ inside the square brackets to indicate "not" (I know this is confusing because the ^ also indicates the beginning of a line, but you just have to learn to be comfortable with it meaning something totally different when it's in square braces.) So the pattern [^0-5] means NOT 0-5 and would indicate any character EXCEPT 0, 1, 2, 3, 4, or 5.

Number of Occurrances - In addition to allowing for wildcard characters, you may also allow characters to occur a variable number of times. There are a few ways to do this:

* - the * indicates that the thing right before it is optional, and can occur many times. For example, /b.*b/ will match anything start and ending with a b, with anything (including nothing) between the "b"s. So, it would match "bob" or "beelzebub" or "blathering slob".
+ - The + works the same as *, but it requires the thing before it occurs at least once, and it can occur multiple times. The pattern /ab+a/ will match anything starting and ending with "a" and with at least one "b" in the middle: "aba", "abba", "abbbbbbbbbba" would all match though "aa" would not.
{min,max} - You can use curly braces to specify a minimum and maximum number of times something can occur. /a[a-zA-z}{1,4}z/ would look for anything starting with "a", ending with "z", and with between 1 and 4 letters occuring between them. You can also leave off the maximum value, but you should still include the coma. For example, /a[a-zA-Z]{2,}z/ would match any word starting with "a", ending in "z", and at least 2 letters between them.

Grouping

You may want to group patterns together and have them repeat. We do that with parentheses. Using parentheses also lets retrieve the text that matched the pattern in perl.

You can make a sub-pattern by enclosing it in parentheses. For example, say we are looking for an email address - we know that is something @ something dot something. You can have lots of "dot something"s at the end, though. My address - golbeck@cs.umd.edu - has an extra dot something, for example. So we want to allow it to match as long as there is one "dot something" but on many more. (Note: Matching email addresses can be very complicated, so for now we're just going to match on non-spaces. If you were really doing this, you would want to give a very precise list of characters.)

/\S+\@\S+\.\S+/ This will match a series of non-spaces, followed by an @, followed by more non-spaces, a dot, and more non-spaces. Now, to allow for many dot-somethings, we're going to first put the last dot-something in parentheses: /\S+\@\S+(\.\S+)/ By themselves, those parentheses don't change what will match. However, we can repeat the number of occurrances of that sub-pattern. In this case, we want at least one dot-something, but we will allow for many of them. Thus, we want the + character: /\S+\@\S+(\.\S+)+/ It's worth noting how the parentheses are different from the square brackets. With square brackets, we match on any character in the brackets. With parentheses, we must match the exact pattern they contain.

There is an extra benefit of using the parentheses in perl. We can use them to see the text that matched the pattern in parentheses. Perl assigns a special variable to this matching text: $1 matches the first set of parentheses, $2 matches the second, and so on. So say we have a list of all the words in the english language, and we want to find a word that has all the vowels in order. Our pattern would be /[^aeiou]*a[^aeiou]*e[^aiou]*i[^aeiou]*o[^aeiou]*u[^aeiou]*/ - that is a bunch of non-vowels, then an a, then more non-vowels, then an e, and so on. We can check to see if our current word matches that pattern:

if ($word =~ /[^aeiou]*a[^aeiou]*e[^aiou]*i[^aeiou]*o[^aeiou]*u[^aeiou]*/ ) But if we want to know what the word is, we need a way of getting the text that matches. We can put the pattern inside parentheses. These parentheses go inside the slashes that contain the pattern. You will see them added in the example below. Once we do that, we can use the $1 variable to show us the text that matches the pattern: if ($word =~ /([^aeiou]*a[^aeiou]*e[^aiou]*i[^aeiou]*o[^aeiou]*u[^aeiou]*)/ ) { print "$1\n"; } (trivia - there are many words that match this pattern: Abstemious (moderate or spare), abstentious (self-denying), acheilous (without a lip), adventious (rare form of adventitious), aerious, (airy), affectious (obs., loving), alpestrious (obs., mountainous), anemious (growing in windy conditions), annelidous (pertaining to a worm), arsenious (containing arsenic), arterious (arterial), arteriosus (prolongation of right ventricle in mammals), caesious (bluish-gray), fracedinous (producing heat through putrefaction), gareisoun (obs., garrison), gravedinous (obs., drowsy), majestious (rare, majesty), materious (obs., material), parecious (proximity of reproductive organs in certain mosses), placentious (obs., complaisant), and tragedious (calamitous) all match).

We can also search for a few things in a row. For example, say we want to look for sentences that have two words that start and end with an a. Our pattern would be

/ a[a-zA-Z]+a .* a[a-zA-Z]+a / We can then put the two patterns that represent the words in parentheses: / (a[a-zA-Z]+a) .* (a[a-zA-Z]+a) / Then, we can use perl to get both words. $1 will get the first word, and $2 will get the second one: if ($line =~ / (a[a-zA-Z]+a) .* (a[a-zA-Z]+a) /) { print "The first word is $1\n"; print "The second word is $2\n"; } This gets more complicated with nested parentheses, and I try to avoid that, but I've included an example above to show you how it works if you're really curious.

Examples

Recall our hero saving the day with regular expressions. Now, you will be the hero. We have a huge collection of Ken Lay's emails. Let's save the day by searching for something formatted as an address. This won't be perfect, but it will do pretty well. Generally, addresses start with a set of numbers and are followed by some words (e.g. 123 Main St.). We can start building this pattern like this: /\d+ [a-zA-Z0-9\-,\. ]+/ That will look for at least one digit followed by a space, followed by some combination of letters, numbers, dashes, commas, periods, and spaces. Now this will pick up anything like that including if someone says "I had 2 beers with dinner last night". This pattern would match "2 beers with dinner last night" as an address. We can improve it a bit by looking for lines where the whole thing is an address (like you'd see in an email signature). So let's add some code to say that the line should only contain this address, but allow for some optional white space before and after in case the address is indented. Notice we have added ^\s* to the beginning and \s*% to the end /^\s*\d+ [a-zA-Z0-9\-,\. ]+\s*$/ Now that we have this pattern, we want to search each line in the file. We will add parentheses around the pattern to allow us to see the part that matched. #!/usr/bin/perl open (FILE, "KenLayEmail.txt"); while (<FILE>) { if ($_ =~ /^(\s*\d+ [a-zA-Z0-9\-,\. ]+\s*)$/ ){ print "$1\n"; } } close FILE; That's it. If you run this, you will see that there are lots of addresses that come up. Some other stuff will also appear, and you could potentially refine the pattern to improve the precision and recall.

Let's do one more example: searching for phone numbers. Say we're sticking with US numbers. They have an area code, prefix, and extension. We'll say they can be (xxx) xxx-xxxx, xxx-xxx-xxxx, xxx.xxx.xxxx, xxx xxx xxxx, or some combination with dashes, spaces, and periods between the numbers. We are going to build this two ways. First, we'll build a pattern for numbers that have the area code in parentheses. That pattern would be /$\d\d\d$ \d\d\d[\.\- ]\d\d\d\d/. This says we start with three digits in parentheses, followed by a space followed by three digits, a dot, dash, or space, and four more digits. The other possibility would be for a pattern without the parentheses. That would be /\d\d\d[\.\- ]\d\d\d[\.\- ]\d\d\d\d/. Notice this is similar to the previous one, but there are no parentheses, and we've added a dot, dash, or space between the area code and prefix.

So we have two possible patterns and we want to match either one. We can combine them to say "match one or the other". To do that, we put a vertical bar between them (usually, the | is the shift of the \). This bar stands for "or" . (Note to those of you who program in other languages - do NOT use || - that is for conditions, not regular expressions). So the patterns are combined /$\d\d\d$ \d\d\d[\.\- ]\d\d\d\d|\d\d\d[\.\- ]\d\d\d[\.\- ]\d\d\d\d/ Look closely and you will see the | in between them. Now we have a crazy looking pattern. Again, we will put it in parentheses in the code so we can print the result:

#!/usr/bin/perl open (FILE, "KenLayEmail.txt"); while (<FILE>) { if ($_ =~ /($\d\d\d$ \d\d\d[\.\- ]\d\d\d\d|\d\d\d[\.\- ]\d\d\d[\.\- ]\d\d\d\d)/) { print "$1\n"; } } close FILE; You will see this works beautifully in the code, returning lots and lots of phone numbers for us. On some systems, I could even teach you to have your perl script automatically call those numbers :).

Substitution

You can also use patterns to substitute a pattern with something else. The general syntax for this is: $string =~ s/pattern to substitute/thing to substitute it with/; We always start with the variable storing the string and use the =~ operator. Then, we have two patterns. There are three / marks. The pattern containing the thing to replace goes between the first to slashes. The thing to replace it with goes between the second two. We also always start it witih an "s" to indicate we are substituting.

For example, say we want to substitute all the number 1 with the word "one" in the string "That's 1 small step for a man, 1 giant leap for man kind." We can start by doing this:

$string = "That's 1 small step for a man, 1 giant leap for man kind." $string =~ s/1/one/; If you try that code, you will see it replaces the first 1, but not the second. This is because the substitution only replaces the first instance. If we want to replace them all, we add a "g" to the end to indicate global replacement. $string = "That's 1 small step for a man, 1 giant leap for man kind." $string =~ s/1/one/g; There is another letter we can put with the g - an i. The i indicates that we should ignore the case, making our substitution case insensitive. You can use just the g, just the i, or both together.

We use both substitution and hashes in the example code which implements the ADFGX cipher. You can find more details about this in the exercises for this week.