Homework 4

In 2015, Jeb Bush released all the email he received as governor. That contained lots of personal information, including social security numbers, street addresses, and more.

Oops.

You're an investigative reporter who knows perl! What regular expressions would you use to find:

  • Social security numbers (they take the form xxx-xx-xxxx)
  • Email addresses (this doesn't have to be perfect. Don't copy one online - make your own)
  • Street addresses
Copy this text into the text box at the Regex Tester and see what your patterns find. Put them in parentheses to get an actual list of the matches to show up on the right of the tool. Use the "g" after the last / to see all matches in the text. It's also fine to use the "i" option to make your patterns case insensitive.

Don't worry - your patterns won't be perfect. They will find incorrect things and miss correct ones. Just make them as good as you can.

For the exercises, use a local version of the file named the same as the file online.

Exercise 1

We searched through Ken Lay's emails for addresses and phone numbers. Now search it for email addresses. You can start with a simple pattern of non-space characters, but refine it as you see things that don't make sense. As you refine, be sure to allow for dots, dashes, and underscores but not other punctuation. Print out the addresses as you find them.

Exercise 2

Going through the Ken Lay Emails, you will see each message starts with a To: and a From:. Build a list like the one I supplied last week where, for each email, you print a line that lists the sender and recipient: bob@example.com,joe@example.com alice@example.com,eve@example.com joe@example.com,eve@example.com For now, if there are multiple recipients, just use the first one. However, it is good to think about how you would do this if you had to print a line for all recipients.

Hint: The To: always comes after the From:. You will have to add some variable to keep track of whether you have already seen the From: of the message and are looking for the To:, or if you have seen a To: and are now looking for a From: again.