CMSC 330, Summer 2012

Organization of Programming Languages

Project 1: Ruby Text Processing

Due Wednesday, June 6, 2012
11:59pm (23:59 EDT)

5/30 20:35 EDT: Some clarifications have been posted in the "Notes" section below

Introduction

When you surf the World Wide Web, you use a web browser, such as Firefox, to get web pages from a web server, such as Apache. For example, if you click the first link above, which has the URL http://www.mozilla.com/firefox, then your web browser will connect to the web server running on the machine www.mozilla.org and ask it for the web page firefox.

There's a lot more we could say about web browsers and web servers, but for this project, there's only one other thing you need to know: Most web servers are configured to log all the requests they get to a file. For example, the following is a line from the CS department web log that resulted from a request for the CMSC 330 main web page:

209.17.153.170 - - [03/Aug/2007:11:34:36 -0400] "GET /class/fall2007/cmsc330/ HTTP/1.1" 200 2488
From left-to-right, this log entry means: The request came from IP address 209.17.153.170; the date was August 3, 2007 at the time shown; the request was for the page /class/fall2007/cmsc330/, and the web browser understands http version 1.1; the request was successful (status code 200); and 2488 bytes were sent from the web server to the web browser.

As you can imagine, there are many reasons for the CS department, and anyone else who runs a web server, to maintain web logs. For example, we might want to know which web pages are the most popular on our web site, or what time of day our web server gets the most hits. In this project, you will write a Ruby program that parses web logs of the form shown above and reports various summary information.

What to Submit

Your should submit a file weblog.rb containing your solution. You may submit other files, but they will be ignored during grading. We will run your solution by invoking

ruby weblog.rb <mode> <log-file-name>

where <mode> describes what the tool should do (see below), and <log-file-name> names the file containing the web log.

Be sure to follow the project description below exactly. Your solution will be graded automatically, and so any deviation from the specification will result in losing points. In particular, if you have any debugging output in your program, be sure to turn it off before you submit your program.

You can access the project starter files at:

http://www.cs.umd.edu/~lam/cmsc330/summer2012/p1/p1.tar.gz

Copy this file to your home directory and unpack it with

tar -xzf p1.tar.gz

This should create a p1 directory that contains the starter file weblog.rb along with the public test files.

If you do your development inside the p1 directory, you can submit all files in that directory and below with the command

java -jar submit.jar

This command runs a Java program that looks in the current directory for a .submit file to find the project number, and then uploads all files within that directory to the submit server.

Alternately, you can use the web-based submit server interface at:

https://submit.cs.umd.edu/

Part 1: Validating log files

The first part of this project is to write a Ruby script that validates that an input file is in fact a web log and not, e.g., Aunt Bertha's secret apple pie recipe. We will select this task by passing the mode validate to your Ruby script. In particular, to test your solution to part 1, we will invoke your program with

ruby weblog.rb validate <log-file-name>

In response, your script should output exactly one line of text and then exit. That line should either contain the three letters yes followed by a newline if the log file is valid, or the two letters no followed by a newline otherwise.

A valid log file contains zero or more lines of text, each of which ends in a newline. (Hint: There are Ruby methods to read in a single line of text at a time.) Each line must contain the following fields from left-to-right, with each field separated from the previous field with a single space. The left-most field has nothing in front of it, and the right-most field is followed only by the newline that ends the line:

  • The first field is a numeric IP address. The address contains four numbers in the range 0-255 separated by a period. (Note that by separated we mean that there are three periods total; if the period were a terminator, then there would be four periods.)
  • Next is a hyphen - (this could in theory be a different symbol, but it never happens). For the purposes of this project, you only need to verify that the hyphen is present.
  • Next is the name of the user requesting the page, which may contain any alphabetic characters (upper or lowercase), numbers, and underscore. The user name may be - if no user has been determined.
  • Next is the date the web page was requested. Here is the format for the date as described in the apache documentation:
        [day/month/year:hour:minute:second zone]
        day = 2*digit
        month = 3*letter
        year = 4*digit
        hour = 2*digit
        minute = 2*digit
        second = 2*digit
        zone = (`+' | `-') 4*digit
    
    For this project, we will require a slightly stricter set of strings: The day must be in the range 01-31 (regardless of the month); the month must be in the range Jan-Dec; the hour must be in the range 00-23, and minutes and seconds must be in the range 00-59; and the zone (which indicates the time zone) will always be -0400.
  • Next is the request itself, which is an arbitrary string that begins and ends with double quotes. Inside of the string any occurrence of double quote must be escaped by being prefixed with backslash.
  • Next is the status code, a non-negative integer
  • Last is the number of bytes sent, either a non-negative integer or - if nothing was sent to the requester.

Any file whose lines do not conform to these rules is invalid. Extra whitespace between any fields (or at the beginning or end of the line) is not valid.

Hint: It might be easier to write a regular expression that accepts a line in a more general format, and then check after it's been split into pieces that the various range constraints have been satisfied. Use Rubular to help test your regular expressions.

Part 2: Gathering statistics

The next part of the project is to add additional modes to your Ruby script that summarize various aspects of the web log. You will write three new modes, as described below. You may assume that we will only use these new modes on log files that are valid.

Mode: bytes

In this mode, you should output the total number of bytes sent by the web server across all log entries. This size should be reported in the largest appropriate unit (bytes, KB, MB, GB) and truncated to the nearest integer. Remember that 1024 bytes = 1 KB, and 1024 KB = 1 MB, etc. So if if the total size is 1337 bytes, your program should output "1 KB". If the total size is 42000000 bytes, your program should output "40 MB". Any size larger than 1 GB should be reported in GB. There should be a single space between the number and the unit.

The output should be a single line of text containing this number with units as described above, terminated by a newline. For example, a sample run of your script might look like:

% ruby weblog.rb bytes sample.log
238 KB
Hint: Remember that the web server uses - to indicate no bytes sent by the web server.

Mode: time

In this mode, you should produce a histogram indicating the total number of requests that were served in each possible hour of the day, totaled across all requests on all days. You should ignore the time zone part of the log entry (recall that they're always -0400 for this project). Your output should consist of 24 lines listing the two-digit hour, followed by space, followed by the number of requests (which may be zero), followed by a newline. For example, a sample run of your script might look like:

% ruby weblog.rb time sample.log
00 8
01 3
02 0
...
23 5
meaning that 8 requests were served between 00:00:00 and 00:59:59 in time zone -0400, inclusive, 3 requests were served between 01:00:00 and 01:59:59, inclusive, and so on.

Mode: popularity

In this mode, you should produce a list containing the top-ten most common request strings (not web pages) received by the web server, across all entries in the log file. The output should contain at most ten lines (Hint: You need to handle the case where there are fewer than ten different requests), one line per popular request. Each line should contain the total number of times the request was received, followed by a space, followed by the request string, which should still include quotes. The lines should be sorted from most popular to least popular. For example, a sample run of your script might look like:

% ruby weblog.rb popularity access-log
5 "GET /hcil/_includes/publications-2-col.html HTTP/1.0"
3 "GET /class/fall2005/cmsc412/ HTTP/1.1"
2 "GET /class/fall2005/cmsc417/ HTTP/1.1"
...

Hints and Tips

  • This project is non-trivial, in part because you will probably be writing in Ruby for the first time, so be sure to start right away, and come to office hours if you get stuck.
  • Follow good program development practices: Test each part of your program as you develop it. Start developing a simplified solution and then add features as you are sure that earlier parts work. Test early and often, and re-run your tests as you add new features to be sure you didn't break anything.
  • Before you get too far, review the Ruby class reference, and look for classes and methods that might be helpful. For example, the Array and Hash classes will come in handy. Finding the right class might save you a lot of time and make your program easier to develop.
  • If you write methods that should return a true or false value, remember that a Ruby 0 is not false.
  • Ruby has an integrated debugger, which can be invoked by running Ruby with the -rdebug option. The debugger's p command may be helpful for viewing the values of variables and data structures. The var local command prints all of the local variables at the current point of exclusion. The chapter "When Trouble Strikes" of The Pragmatic Programmer's Guide discusses the debugger in more detail.
  • There are no release tests for this project, so you do not need to be concerned with tokens. Note that the public tests do only very minimal testing of your program--the graded portion of the testing is done using secret tests. To thoroughly debug your program, you will need to construct test cases of your own, based on the project description. If you need help with this, please come to TA office hours.
  • Remember to save your work frequently---a power failure, network failure, or problem with a phone connection could cost many hours of lost work. For the same reason, submit your project often. You can retrieve previously-submitted versions of your program from the submit server should disaster strike.
  • Be sure you have read and understand the project grading policies in the course syllabus. Do this well in advance of the project due date.

Public Test Cases

Public test cases have been posted.

Notes

  • You may assume single digit numbers in IP addresses (e.g., 192.168.1.5) will not have leading zeros. Numbers in in the weblog date entry will have leading zeros if they have less than the maximum number of digits for that field (e.g., 25/Jul/2006:01:02:02).
  • Requests in weblog entries may contain quotation marks, if they are escaped with a backslash. In other words, "GET /class/"notes".txt" is an illegal request, but "GET /class/\"notes\".txt" is legal.
  • Both "-" and "0" are valid size entries. Zero is still a non-negative integer. You should handle both cases. Count "-" as zero for the bytes portion of the assignment.
  • You may assume that the logfile exists, its name is passed correctly on the command line, and is valid when in bytes, time, and popularity mode.
  • In bytes mode, the smallest unit is "bytes". Sizes less than 1024 (1 KB) should be reported as # of bytes (e.g., "12 bytes"). For 0, the correct output (both grammatically and for project grading) is "0 bytes", not "0 byte". For 1, the correct output (for project grading) is "1 bytes", not "1 byte" (despite it being grammatically incorrect).
  • For bytes mode, "truncated" means to drop fractional values when converting to an integer. So 9.9999999 => 9
  • You may assume that if there are 2 requests that appear the same number of times, the order you output them in the "popularity" mode is not significant. Or just assume that in project 1 testing, there will be no ties among the top 10 most popular requests for tests of "popularity" mode.

Academic Integrity

The Campus Senate has adopted a policy asking students to include the following statement on each assignment in every course: "I pledge on my honor that I have not given or received any unauthorized assistance on this assignment." Consequently your program is requested to contain this pledge in a comment near the top.

Please carefully read the academic honesty section of the course syllabus. Any evidence of impermissible cooperation on projects, use of disallowed materials or resources, or unauthorized use of computer accounts, will be submitted to the Student Honor Council, which could result in an XF for the course, or suspension or expulsion from the University. Be sure you understand what you are and what you are not permitted to do in regards to academic integrity when it comes to project assignments. These policies apply to all students, and the Student Honor Council does not consider lack of knowledge of the policies to be a defense for violating them. Full information is found in the course syllabus---please review it at this time.

Valid HTML 4.01!