Week 4: Regular Expressions
Sample Files
Links
Regular expressions are a very powerful way to look for patterns in text. 
We are learning them as 
part of perl, but regular expressions aren't perl specific - most 
programming languages allow you to
use them, as well as some operating systems and software packages.
Basics
Patterns are always contained between two / characters. We first saw these
when we split lines. 
@lines = split(/\n/,$file);
In this case, the pattern is /\n/ - a new line, and since it's a pattern 
we put it between
the /s.
We can use any series of characters as a pattern. For example, we may want 
to see if a line
of text contains two lower case L characters in a row. In that case, our 
pattern is /ll/. 
When we are trying to match a pattern or substitute text using regular 
expressions, we don't 
use an equals sign, but rather the =~ combination. The example above would 
look like this:
$text = "The quick brown fox jumped over the lazy yellow dog.";
if ($text =~ /ll/) {
	print "Yes\n";
} else {
	print "No\n";
}
This will print "Yes" because the characters "ll" appear in the word 
"yellow". If we used
the pattern /lll/ (three ls), the pattern would NOT match because three ls 
do not appear in a row.
More Advanced Patterns
Start and End of a line - The beginning of a line is indicated with 
the ^ in a pattern.
For example, the pattern /^The/ would check for a line 
beginning 
with "The". It would
not match if "The" appeared only in the middle of the line. The end of the 
line is indicated 
with the $. The pattern /dog.$/ would check for a line ending with "dog." 
You can use both
of these together as well. For example, the pattern /^yellow$/ would look 
for a line that 
contains only "yellow" with no other characters; the pattern says "the 
line starts, has the
word yellow, and then ends."
Wildcards - Very frequently, you will want to match patterns where 
you don't know
what some of the characters are. In that case, you will use a wildcard 
character. In regular
expressions, that is a period. The pattern /b.b/ will match anything with 
a b followed by any
character (including spaces, letters, numbers, punctuation, etc) followed 
by another b.
There are also wildcards that are more specific:
- \d - any digit
- \D - any non-digit
- \s - any white space character (spaces, tabs, new lines, etc)
- \S - any non-space
- \w - any word chacter (a letter or digit)
- \W - any non-word character
- [brew your own] - square brackets can contain a set of characters and 
one of them should match. For 
example, [abcdefg0123] would match any single character listed in the 
brackets. You can also
use ranges of characters here: [a-zA-Z0-9] would match any lower case 
character, upper case character, 
or digit. You can include anything you want in the brackets, like 
[a-zA-Z0-5\-\. ] would match
any characters, the digits 0 through 5, or a dash, period, or space.
- [^brew your own] - you can use the ^ inside the square brackets to 
indicate "not" (I know this is confusing
because the ^ also indicates the beginning of a line, but you just have to 
learn to be comfortable with it meaning
something totally different when it's in square braces.) So the pattern 
[^0-5] means NOT 0-5 and would indicate any character
EXCEPT 0, 1, 2, 3, 4, or 5.
Number of Occurrances  - In addition to allowing for wildcard 
characters, you may also
allow characters to occur a variable number of times. There are a few ways 
to do this:
-  * - the * indicates that the thing right before it is optional, and 
can occur many times.
For example, /b.*b/ will match anything start and ending with a b, with 
anything (including
nothing) between the "b"s. So, it would match "bob" or "beelzebub" or 
"blathering slob".
-  + - The + works the same as *, but it requires the thing before it 
occurs at least once, 
and it can occur multiple times. The pattern /ab+a/ will match anything 
starting and ending with "a" and
with at least one "b" in the middle: "aba", "abba", "abbbbbbbbbba" would 
all match though "aa"
would not.
- {min,max} - You can use curly braces to specify a minimum and maximum 
number of times
something can occur. /a[a-zA-z}{1,4}z/ would look for anything starting 
with "a", ending with "z", 
and with between 1 and 4 letters occuring between them.
 You can also leave 
off the  maximum value, but you should still include the coma. For 
example, 
/a[a-zA-Z]{2,}z/ would 
match any word starting with "a", ending in "z", and at least 2 letters 
between them.
Grouping
You may want to group patterns together and have them repeat. We do that 
with parentheses. 
Using parentheses also lets retrieve the text that matched the pattern in 
perl. 
You can make a sub-pattern by enclosing it in parentheses. For example, 
say we are looking for an email address - 
we know that is something @ something dot something.  You can have lots of 
"dot something"s at the end, though. My 
address - golbeck@cs.umd.edu -  has an extra dot something, for example.
So we want to allow it to match as long as there is one "dot something" 
but on many more. (Note: Matching email addresses
can be very complicated, so for now we're just going to match on 
non-spaces. If you were really doing this, you would
want to give a very precise list of characters.) 
/\S+\@\S+\.\S+/ 
This will match a series of non-spaces, followed by an @, followed by more 
non-spaces, a dot, and more non-spaces.
Now, to allow for many dot-somethings, we're going to first put the last 
dot-something in parentheses:
/\S+\@\S+(\.\S+)/ 
By themselves, those parentheses don't change what will match. However, we 
can repeat the number of occurrances
of that sub-pattern. In this case, we want at least one dot-something, but 
we will allow for many of them. Thus, 
we want the + character:
/\S+\@\S+(\.\S+)+/ 
It's worth noting how the parentheses are different from the square 
brackets. With square brackets, we match
on any character in the brackets. With parentheses, we must match the 
exact pattern they contain.
There is an extra benefit of using the parentheses in perl. We can use 
them to see the text that matched the
pattern in parentheses. Perl assigns a special variable to this matching 
text: $1 matches the first set of 
parentheses, $2 matches the second, and so on. 
So say we have a list of all the words in the english language, and we 
want to find a word that has all the 
vowels in order. Our pattern would be 
/[^aeiou]*a[^aeiou]*e[^aiou]*i[^aeiou]*o[^aeiou]*u[^aeiou]*/ - that is 
a bunch of non-vowels, then an a, then more non-vowels, then an e, and so 
on. We can check to see if our
current word matches that pattern:
if ($word =~ /[^aeiou]*a[^aeiou]*e[^aiou]*i[^aeiou]*o[^aeiou]*u[^aeiou]*/ 
)
But if we want to know what the word is, we need a way of getting the text 
that matches. We can put the pattern
inside parentheses. These parentheses go inside the slashes that contain 
the pattern. You will see them added in
the example below. Once we do that, we can use the $1 variable to show us 
the text that matches the pattern:
if ($word =~ 
/([^aeiou]*a[^aeiou]*e[^aiou]*i[^aeiou]*o[^aeiou]*u[^aeiou]*)/ ) {
	print "$1\n";
}
(trivia - there are many words that match this pattern: Abstemious 
(moderate or spare), abstentious (self-denying), acheilous (without a 
lip), adventious (rare form of adventitious), aerious, (airy), affectious 
(obs., loving), alpestrious (obs., mountainous), anemious (growing in 
windy conditions), annelidous (pertaining to a worm), arsenious 
(containing arsenic), arterious (arterial), arteriosus (prolongation of 
right ventricle in mammals), caesious (bluish-gray), fracedinous 
(producing heat through putrefaction), gareisoun (obs., garrison), 
gravedinous (obs., drowsy), majestious (rare, majesty), materious (obs., 
material), parecious (proximity of reproductive organs in certain mosses), 
placentious (obs., complaisant), and tragedious (calamitous) all match).
We can also search for a few things in a row. For example, say we want to 
look for sentences that have two
words that start and end with an a. Our pattern would be 
/ a[a-zA-Z]+a .* a[a-zA-Z]+a /
We can then put the two patterns that represent the words in parentheses:
/ (a[a-zA-Z]+a) .* (a[a-zA-Z]+a) /
Then, we can use perl to get both words. $1 will get the first word, and 
$2 will get the second one:
if ($line =~ / (a[a-zA-Z]+a) .* (a[a-zA-Z]+a) /) {
	print "The first word is $1\n";
	print "The second word is $2\n";
}
This gets more complicated with nested parentheses, and I try to
avoid that, but I've included an example above to show you how it works if 
you're really curious.
Examples
Recall our hero saving the day 
with regular expressions. 
Now, you will be the hero. We have a 
huge collection of Ken Lay's emails. 
Let's save the day by searching for something formatted as an address. 
This won't be perfect, 
but it will do pretty well. Generally, addresses start with a set of 
numbers and are followed by some 
words (e.g. 123 Main St.). We can start building this pattern like this: 
/\d+ [a-zA-Z0-9\-,\. ]+/
That will look for at least one digit followed by a space, followed by 
some combination of letters, numbers,
dashes, commas, periods, and spaces. Now this will pick up anything like 
that including if someone says "I
had 2 beers with dinner last night". This pattern would match "2 beers 
with dinner last night" as an address.
We can improve it a bit by looking for lines where the whole thing is an 
address (like you'd see in an email 
signature). So let's add some code to say that the line should only 
contain this address, but allow for some
optional white space before and after in case the address is indented. 
Notice we have added ^\s* to the beginning
and \s*% to the end
/^\s*\d+ [a-zA-Z0-9\-,\. ]+\s*$/
Now that we have this pattern, we want to search each line in the file. We 
will add parentheses around the
pattern to allow us to see the part that matched.
#!/usr/bin/perl
open (FILE, "KenLayEmail.txt");
while () {
	if ($_ =~ /^(\s*\d+ [a-zA-Z0-9\-,\. ]+\s*)$/  ){
		print "$1\n";
	}
} 
close FILE;
That's it. If you run this, you will see that there are lots of addresses 
that come up. Some other stuff will
also appear, and you could potentially refine the pattern to improve the 
precision and recall.
Let's do one more example: searching for phone numbers. Say we're sticking 
with US numbers. They have an area
code, prefix, and extension. We'll say they can be (xxx) xxx-xxxx, 
xxx-xxx-xxxx, xxx.xxx.xxxx, xxx xxx xxxx, or
some combination with dashes, spaces, and periods between the numbers. We 
are going to build this two ways.
First, we'll build a pattern for numbers that have the area code in 
parentheses. That pattern would be 
/\(\d\d\d\) \d\d\d[\.\- ]\d\d\d\d/. This says we start with three digits 
in parentheses, followed by a space
followed by three digits, a dot, dash, or space, and four more digits. The 
other possibility would be for a
pattern without the parentheses. That would be /\d\d\d[\.\- ]\d\d\d[\.\- 
]\d\d\d\d/. Notice this is similar to the 
previous one, but there are no parentheses, and we've added a dot, dash, 
or space between the area code and 
prefix. 
So we have two possible patterns and we want to match either one. We can 
combine them to say "match one or the other".
To do that, we put a vertical bar between them (usually, the | is the 
shift of the \). This bar stands for "or" . (Note
to those of you who program in other languages - do NOT use || - that is 
for conditions, not regular expressions). 
So the patterns are combined /\(\d\d\d\) \d\d\d[\.\- ]\d\d\d\d|\d\d\d[\.\- 
]\d\d\d[\.\- ]\d\d\d\d/ Look closely and
you will see the | in between them. Now we have a crazy looking pattern. 
Again, we will put it in parentheses
in the code so we can print the result:
#!/usr/bin/perl
open (FILE, "KenLayEmail.txt");
while () {
	
	if ($_ =~ /(\(\d\d\d\) \d\d\d[\.\- ]\d\d\d\d|\d\d\d[\.\- 
]\d\d\d[\.\- ]\d\d\d\d)/)
	{
		print "$1\n";
	}
}
close FILE;
You will see this works beautifully in the code, returning lots and lots 
of phone numbers for us. On some systems,
I could even teach you to have your perl script automatically call those 
numbers :).
Substitution
 You can also use patterns to substitute a pattern 
with something else. The general syntax for this is:
$string =~ s/pattern to substitute/thing to substitute it with/;
We always start with the variable storing the string and use the =~ 
operator. Then, we have two patterns. There are three / marks. The pattern 
containing the thing to replace goes between the first to slashes. The 
thing to replace it with goes between the second two. We also always start 
it witih an "s" to indicate we are substituting.
For example, say we want to substitute all the number 1 with the word 
"one" in the string "That's 1 small step for a man, 1 giant leap for man 
kind."
We can start by doing this:
$string = "That's 1 small step for a man, 1 giant leap for man  kind."
$string =~ s/1/one/;
If you try that code, you will see it replaces the first 1, but not the 
second. This is because the substitution only replaces the first instance. 
If we want to replace them all, we add a "g" to the end to indicate 
global replacement.
$string = "That's 1 small step for a man, 1 giant leap for man  kind."
$string =~ s/1/one/g; 
There is another letter we can put with the g - an i. The i indicates that 
we should ignore the case, making our substitution case insensitive. You 
can use just the g, just the i, or both together.
We use both substitution and hashes in the example code which implements 
the ADFGX cipher. You can find more details about this in the exercises 
for this week.