Week 4: Regular Expressions
Sample Files
Links
Regular expressions are a very powerful way to look for patterns in text.
We are learning them as
part of perl, but regular expressions aren't perl specific - most
programming languages allow you to
use them, as well as some operating systems and software packages.
Basics
Patterns are always contained between two / characters. We first saw these
when we split lines.
@lines = split(/\n/,$file);
In this case, the pattern is /\n/ - a new line, and since it's a pattern
we put it between
the /s.
We can use any series of characters as a pattern. For example, we may want
to see if a line
of text contains two lower case L characters in a row. In that case, our
pattern is /ll/.
When we are trying to match a pattern or substitute text using regular
expressions, we don't
use an equals sign, but rather the =~ combination. The example above would
look like this:
$text = "The quick brown fox jumped over the lazy yellow dog.";
if ($text =~ /ll/) {
print "Yes\n";
} else {
print "No\n";
}
This will print "Yes" because the characters "ll" appear in the word
"yellow". If we used
the pattern /lll/ (three ls), the pattern would NOT match because three ls
do not appear in a row.
More Advanced Patterns
Start and End of a line - The beginning of a line is indicated with
the ^ in a pattern.
For example, the pattern /^The/ would check for a line
beginning
with "The". It would
not match if "The" appeared only in the middle of the line. The end of the
line is indicated
with the $. The pattern /dog.$/ would check for a line ending with "dog."
You can use both
of these together as well. For example, the pattern /^yellow$/ would look
for a line that
contains only "yellow" with no other characters; the pattern says "the
line starts, has the
word yellow, and then ends."
Wildcards - Very frequently, you will want to match patterns where
you don't know
what some of the characters are. In that case, you will use a wildcard
character. In regular
expressions, that is a period. The pattern /b.b/ will match anything with
a b followed by any
character (including spaces, letters, numbers, punctuation, etc) followed
by another b.
There are also wildcards that are more specific:
- \d - any digit
- \D - any non-digit
- \s - any white space character (spaces, tabs, new lines, etc)
- \S - any non-space
- \w - any word chacter (a letter or digit)
- \W - any non-word character
- [brew your own] - square brackets can contain a set of characters and
one of them should match. For
example, [abcdefg0123] would match any single character listed in the
brackets. You can also
use ranges of characters here: [a-zA-Z0-9] would match any lower case
character, upper case character,
or digit. You can include anything you want in the brackets, like
[a-zA-Z0-5\-\. ] would match
any characters, the digits 0 through 5, or a dash, period, or space.
- [^brew your own] - you can use the ^ inside the square brackets to
indicate "not" (I know this is confusing
because the ^ also indicates the beginning of a line, but you just have to
learn to be comfortable with it meaning
something totally different when it's in square braces.) So the pattern
[^0-5] means NOT 0-5 and would indicate any character
EXCEPT 0, 1, 2, 3, 4, or 5.
Number of Occurrances - In addition to allowing for wildcard
characters, you may also
allow characters to occur a variable number of times. There are a few ways
to do this:
- * - the * indicates that the thing right before it is optional, and
can occur many times.
For example, /b.*b/ will match anything start and ending with a b, with
anything (including
nothing) between the "b"s. So, it would match "bob" or "beelzebub" or
"blathering slob".
- + - The + works the same as *, but it requires the thing before it
occurs at least once,
and it can occur multiple times. The pattern /ab+a/ will match anything
starting and ending with "a" and
with at least one "b" in the middle: "aba", "abba", "abbbbbbbbbba" would
all match though "aa"
would not.
- {min,max} - You can use curly braces to specify a minimum and maximum
number of times
something can occur. /a[a-zA-z}{1,4}z/ would look for anything starting
with "a", ending with "z",
and with between 1 and 4 letters occuring between them.
You can also leave
off the maximum value, but you should still include the coma. For
example,
/a[a-zA-Z]{2,}z/ would
match any word starting with "a", ending in "z", and at least 2 letters
between them.
Grouping
You may want to group patterns together and have them repeat. We do that
with parentheses.
Using parentheses also lets retrieve the text that matched the pattern in
perl.
You can make a sub-pattern by enclosing it in parentheses. For example,
say we are looking for an email address -
we know that is something @ something dot something. You can have lots of
"dot something"s at the end, though. My
address - golbeck@cs.umd.edu - has an extra dot something, for example.
So we want to allow it to match as long as there is one "dot something"
but on many more. (Note: Matching email addresses
can be very complicated, so for now we're just going to match on
non-spaces. If you were really doing this, you would
want to give a very precise list of characters.)
/\S+\@\S+\.\S+/
This will match a series of non-spaces, followed by an @, followed by more
non-spaces, a dot, and more non-spaces.
Now, to allow for many dot-somethings, we're going to first put the last
dot-something in parentheses:
/\S+\@\S+(\.\S+)/
By themselves, those parentheses don't change what will match. However, we
can repeat the number of occurrances
of that sub-pattern. In this case, we want at least one dot-something, but
we will allow for many of them. Thus,
we want the + character:
/\S+\@\S+(\.\S+)+/
It's worth noting how the parentheses are different from the square
brackets. With square brackets, we match
on any character in the brackets. With parentheses, we must match the
exact pattern they contain.
There is an extra benefit of using the parentheses in perl. We can use
them to see the text that matched the
pattern in parentheses. Perl assigns a special variable to this matching
text: $1 matches the first set of
parentheses, $2 matches the second, and so on.
So say we have a list of all the words in the english language, and we
want to find a word that has all the
vowels in order. Our pattern would be
/[^aeiou]*a[^aeiou]*e[^aiou]*i[^aeiou]*o[^aeiou]*u[^aeiou]*/ - that is
a bunch of non-vowels, then an a, then more non-vowels, then an e, and so
on. We can check to see if our
current word matches that pattern:
if ($word =~ /[^aeiou]*a[^aeiou]*e[^aiou]*i[^aeiou]*o[^aeiou]*u[^aeiou]*/
)
But if we want to know what the word is, we need a way of getting the text
that matches. We can put the pattern
inside parentheses. These parentheses go inside the slashes that contain
the pattern. You will see them added in
the example below. Once we do that, we can use the $1 variable to show us
the text that matches the pattern:
if ($word =~
/([^aeiou]*a[^aeiou]*e[^aiou]*i[^aeiou]*o[^aeiou]*u[^aeiou]*)/ ) {
print "$1\n";
}
(trivia - there are many words that match this pattern: Abstemious
(moderate or spare), abstentious (self-denying), acheilous (without a
lip), adventious (rare form of adventitious), aerious, (airy), affectious
(obs., loving), alpestrious (obs., mountainous), anemious (growing in
windy conditions), annelidous (pertaining to a worm), arsenious
(containing arsenic), arterious (arterial), arteriosus (prolongation of
right ventricle in mammals), caesious (bluish-gray), fracedinous
(producing heat through putrefaction), gareisoun (obs., garrison),
gravedinous (obs., drowsy), majestious (rare, majesty), materious (obs.,
material), parecious (proximity of reproductive organs in certain mosses),
placentious (obs., complaisant), and tragedious (calamitous) all match).
We can also search for a few things in a row. For example, say we want to
look for sentences that have two
words that start and end with an a. Our pattern would be
/ a[a-zA-Z]+a .* a[a-zA-Z]+a /
We can then put the two patterns that represent the words in parentheses:
/ (a[a-zA-Z]+a) .* (a[a-zA-Z]+a) /
Then, we can use perl to get both words. $1 will get the first word, and
$2 will get the second one:
if ($line =~ / (a[a-zA-Z]+a) .* (a[a-zA-Z]+a) /) {
print "The first word is $1\n";
print "The second word is $2\n";
}
This gets more complicated with nested parentheses, and I try to
avoid that, but I've included an example above to show you how it works if
you're really curious.
Examples
Recall our hero saving the day
with regular expressions.
Now, you will be the hero. We have a
huge collection of Ken Lay's emails.
Let's save the day by searching for something formatted as an address.
This won't be perfect,
but it will do pretty well. Generally, addresses start with a set of
numbers and are followed by some
words (e.g. 123 Main St.). We can start building this pattern like this:
/\d+ [a-zA-Z0-9\-,\. ]+/
That will look for at least one digit followed by a space, followed by
some combination of letters, numbers,
dashes, commas, periods, and spaces. Now this will pick up anything like
that including if someone says "I
had 2 beers with dinner last night". This pattern would match "2 beers
with dinner last night" as an address.
We can improve it a bit by looking for lines where the whole thing is an
address (like you'd see in an email
signature). So let's add some code to say that the line should only
contain this address, but allow for some
optional white space before and after in case the address is indented.
Notice we have added ^\s* to the beginning
and \s*% to the end
/^\s*\d+ [a-zA-Z0-9\-,\. ]+\s*$/
Now that we have this pattern, we want to search each line in the file. We
will add parentheses around the
pattern to allow us to see the part that matched.
#!/usr/bin/perl
open (FILE, "KenLayEmail.txt");
while () {
if ($_ =~ /^(\s*\d+ [a-zA-Z0-9\-,\. ]+\s*)$/ ){
print "$1\n";
}
}
close FILE;
That's it. If you run this, you will see that there are lots of addresses
that come up. Some other stuff will
also appear, and you could potentially refine the pattern to improve the
precision and recall.
Let's do one more example: searching for phone numbers. Say we're sticking
with US numbers. They have an area
code, prefix, and extension. We'll say they can be (xxx) xxx-xxxx,
xxx-xxx-xxxx, xxx.xxx.xxxx, xxx xxx xxxx, or
some combination with dashes, spaces, and periods between the numbers. We
are going to build this two ways.
First, we'll build a pattern for numbers that have the area code in
parentheses. That pattern would be
/\(\d\d\d\) \d\d\d[\.\- ]\d\d\d\d/. This says we start with three digits
in parentheses, followed by a space
followed by three digits, a dot, dash, or space, and four more digits. The
other possibility would be for a
pattern without the parentheses. That would be /\d\d\d[\.\- ]\d\d\d[\.\-
]\d\d\d\d/. Notice this is similar to the
previous one, but there are no parentheses, and we've added a dot, dash,
or space between the area code and
prefix.
So we have two possible patterns and we want to match either one. We can
combine them to say "match one or the other".
To do that, we put a vertical bar between them (usually, the | is the
shift of the \). This bar stands for "or" . (Note
to those of you who program in other languages - do NOT use || - that is
for conditions, not regular expressions).
So the patterns are combined /\(\d\d\d\) \d\d\d[\.\- ]\d\d\d\d|\d\d\d[\.\-
]\d\d\d[\.\- ]\d\d\d\d/ Look closely and
you will see the | in between them. Now we have a crazy looking pattern.
Again, we will put it in parentheses
in the code so we can print the result:
#!/usr/bin/perl
open (FILE, "KenLayEmail.txt");
while () {
if ($_ =~ /(\(\d\d\d\) \d\d\d[\.\- ]\d\d\d\d|\d\d\d[\.\-
]\d\d\d[\.\- ]\d\d\d\d)/)
{
print "$1\n";
}
}
close FILE;
You will see this works beautifully in the code, returning lots and lots
of phone numbers for us. On some systems,
I could even teach you to have your perl script automatically call those
numbers :).
Substitution
You can also use patterns to substitute a pattern
with something else. The general syntax for this is:
$string =~ s/pattern to substitute/thing to substitute it with/;
We always start with the variable storing the string and use the =~
operator. Then, we have two patterns. There are three / marks. The pattern
containing the thing to replace goes between the first to slashes. The
thing to replace it with goes between the second two. We also always start
it witih an "s" to indicate we are substituting.
For example, say we want to substitute all the number 1 with the word
"one" in the string "That's 1 small step for a man, 1 giant leap for man
kind."
We can start by doing this:
$string = "That's 1 small step for a man, 1 giant leap for man kind."
$string =~ s/1/one/;
If you try that code, you will see it replaces the first 1, but not the
second. This is because the substitution only replaces the first instance.
If we want to replace them all, we add a "g" to the end to indicate
global replacement.
$string = "That's 1 small step for a man, 1 giant leap for man kind."
$string =~ s/1/one/g;
There is another letter we can put with the g - an i. The i indicates that
we should ignore the case, making our substitution case insensitive. You
can use just the g, just the i, or both together.
We use both substitution and hashes in the example code which implements
the ADFGX cipher. You can find more details about this in the exercises
for this week.