Project 1

Due September 24, 2007
11:59:59pm

UPDATE: Public test cases have been posted [13 Sep 2007]

Introduction

When you surf the World Wide Web, you use a web browser, such as Firefox, to get web pages from a web server, such as apache. For example, if you click the first link above, which has the URL http://www.mozilla.com/firefox, then your web browser will connect to the web server running on the machine www.mozilla.org and ask it for the web page firefox.

There's a lot more we could say about web browsers and web servers, but for this project, there's only one other thing you need to know: Most web servers are configured to log all the requests they get to a file. For example, the following is a line from the CS department web log that resulted from a request for the CMSC 330 main web page:

209.17.153.170 - - [03/Aug/2007:11:34:36 -0400] "GET /class/fall2007/cmsc330/ HTTP/1.1" 200 2488

From left-to-right, this log entry means: The request came from IP address 209.17.153.170; the date was August 3, 2007 at the time shown; the request was for the page /class/fall2007/cmsc330/, and the web browser understands http version 1.1; the request was successful (status code 200); and 2488 bytes were sent from the web server to the web browser.

As you can imagine, there are many reasons for the CS department, and anyone else who runs a web server, to maintain web logs. For example, we might want to know which web pages are the most popular on our web site, or what time of day our web server gets the most hits. In this project, you will write a Ruby program that parses web logs of the form shown above and reports various summary information.

What to Submit

Your should submit a file weblog.rb containing your solution. You may submit other files, but they will be ignored during grading. We will run your solution by invoking

ruby weblog.rb <mode> <log-file-name>

where <mode> describes what the tool should do (see below), and <log-file-name> names the file containing the web log.

Be sure to follow the project description below exactly. Your solution will be graded automatically, and so any deviation from the specification will result in losing points. In particular, if you have any debugging output in your program, be sure to turn it off before you submit your program.

You can access the project starter files at:

http://www.cs.umd.edu/~atif/Teaching/Fall2007/project1/p1.tar.gz

Copy this file to your home directory and unpack it with

gtar -xzf p1.tar.gz

This should create a p1 directory that contains the sample log file sample.log, along with a few other project starter files.

If you do your development inside the p1 directory, you can submit all files in that directory and below with the command

java -jar submit.jar

This command runs a Java program that looks in the current directory for a .submit file to find the project number, and then uploads all files within that directory to the submit server.

Alternately, you can use the web-based submit server interface at:

https://submit.cs.umd.edu/

Part 1: Validating log files

The first part of this project is to write a Ruby script that validates that an input file is in fact a web log and not, e.g., Aunt Bertha's secret apple pie recipe. We will select this task by passing the mode validate to your Ruby script. In particular, to test your solution to part 1, we will invoke your program with

ruby weblog.rb validate <log-file-name>

In response, your script should output exactly one line of text and then exit. That line should either contain the three letters yes followed by a newline if the log file is valid, or the two letters no followed by a newline otherwise.

A valid log file contains zero or more lines of text, each of which ends in a newline. (Hint: There are Ruby methods to read in a single line of text at a time.) Each line must contain the following fields from left-to-right, with each field separated from the previous field with a single space. The left-most field has nothing in front of it, and the right-most field is followed only by the newline that ends the line:

The first field is a numeric IP address. The address contains four numbers in the range 0-255 separated by a period. (Note that by separated we mean that there are three periods total; if the period were a terminator, then there would be four periods.)
Next is a hyphen - (this could in theory be a different symbol, but it never happens). For the purposes of this project, you only need to verify that the hyphen is present.
Next is the name of the user requesting the page, which may contain any alphabetic characters (upper or lowercase), numbers, and underscore. The user name may be - if no user has been determined.
Next is the date the web page was requested. Here is the format for the date as described in the apache documentation:
```
    [day/month/year:hour:minute:second zone]
    day = 2*digit
    month = 3*letter
    year = 4*digit
    hour = 2*digit
    minute = 2*digit
    second = 2*digit
    zone = (`+' | `-') 4*digit
```
For this project, we will require a slightly stricter set of strings: The day must be in the range 01-31 (regardless of the month); the month must be in the range Jan-Dec; the hour must be in the range 00-23, and minutes and seconds must be in the range 00-59; and the zone (which indicates the time zone) will always be -0400.
Next is the request itself, which is an arbitrary string that begins and ends with double quotes. Inside of the string any occurrence of double quote must be escaped by being prefixed with backslash.
Next is the status code, a non-negative integer
Last is the number of bytes sent, either a non-negative integer or - if nothing was sent to the requester.

Any file whose lines do not conform to these rules is invalid. Extra whitespace between any fields (or at the beginning or end of the line) is not valid.

Hint: It might be easier to write a regular expression that accepts a line in a more general format, and then check after it's been split into pieces that the various range constraints have been satisfied.

Part 2: Gathering statistics

The next part of the project is to add additional modes to your Ruby script that summarize various aspects of the web log. You will write three new modes, as described below. You may assume that we will only use these new modes on log files that are valid.

Mode: bytes

In this mode, you should output the total number of bytes sent by the web server across all log entries. This size should be reported in the largest appropriate unit (bytes, KB, MB, GB) and truncated to the nearest integer. Remember that 1024 bytes = 1 KB, and 1024 KB = 1 MB, etc. So if if the total size is 1337 bytes, your program should output "1 KB". If the total size is 42000000 bytes, your program should output "40 MB". Any size larger than 1 GB should be reported in GB. There should be a single space between the number and the unit.

The output should be a single line of text containing this number with units as described above, terminated by a newline. For example, a sample run of your script might look like:

% ruby weblog.rb bytes sample.log
238 KB

Hint: Remember that the web server uses - to indicate no bytes sent by the web server.

Mode: time

In this mode, you should produce a histogram indicating the total number of requests that were served in each possible hour of the day, totaled across all requests on all days. You should ignore the time zone part of the log entry (recall that they're always -0400 for this project). Your output should consist of 24 lines listing the two-digit hour, followed by space, followed by the number of requests (which may be zero), followed by a newline. For example, a sample run of your script might look like:

% ruby weblog.rb time sample.log
00 8
01 3
02 0
...
23 5

meaning that 8 requests were served between 00:00:00 and 00:59:59 in time zone -0400, inclusive, 3 requests were served between 01:00:00 and 01:59:59, inclusive, and so on.

Mode: popularity

In this mode, you should produce a list containing the top-ten most common request strings (not web pages) received by the web server, across all entries in the log file. The output should contain at most ten lines (Hint: You need to handle the case where there are fewer than ten different requests), one line per popular request. Each line should contain the total number of times the request was received, followed by a space, followed by the request string, which should still include quotes. The lines should be sorted from most popular to least popular. For example, a sample run of your script might look like:

% ruby weblog.rb popularity access-log
5 "GET /hcil/_includes/publications-2-col.html HTTP/1.0"
3 "GET /class/fall2005/cmsc412/ HTTP/1.1"
2 "GET /class/fall2005/cmsc417/ HTTP/1.1"
...

Hints and Tips

This project is non-trivial, in part because you will probably be writing in Ruby for the first time, so be sure to start right away, and come to office hours if you get stuck.
Follow good program development practices: Test each part of your program as you develop it. Start developing a simplified solution and then add features as you are sure that earlier parts work. Test early and often, and re-run your tests as you add new features to be sure you didn't break anything.
Before you get too far, review the Ruby class reference, and look for classes and methods that might be helpful. For example, the Array and Hash classes will come in handy. Finding the right class might save you a lot of time and make your program easier to develop.
If you write methods that should return a true or false value, remember that a Ruby 0 is not false.
Ruby has an integrated debugger, which can be invoked by running Ruby with the -rdebug option. The debugger's p command may be helpful for viewing the values of variables and data structures. The var local command prints all of the local variables at the current point of exclusion. The chapter "When Trouble Strikes" of The Pragmatic Programmer's Guide discusses the debugger in more detail.
There are no release tests for this project, so you do not need to be concerned with tokens. Note that the public tests do only very minimal testing of your program--the graded portion of the testing is done using secret tests. To thoroughly debug your program, you will need to construct test cases of your own, based on the project description. If you need help with this, please come to TA office hours.
Remember to save your work frequently---a power failure, network failure, or problem with a phone connection could cost many hours of lost work. For the same reason, submit your project often. You can retrieve previously-submitted versions of your program from the submit server should disaster strike.
Be sure you have read and understand the project grading policies in the course syllabus. Do this well in advance of the project due date.

Academic Integrity

The Campus Senate has adopted a policy asking students to include the following statement on each assignment in every course: "I pledge on my honor that I have not given or received any unauthorized assistance on this assignment." Consequently your program is requested to contain this pledge in a comment near the top.

Please carefully read the academic honesty section of the course syllabus. Any evidence of impermissible cooperation on projects, use of disallowed materials or resources, or unauthorized use of computer accounts, will be submitted to the Student Honor Council, which could result in an XF for the course, or suspension or expulsion from the University. Be sure you understand what you are and what you are not permitted to do in regards to academic integrity when it comes to project assignments. These policies apply to all students, and the Student Honor Council does not consider lack of knowledge of the policies to be a defense for violating them. Full information is found in the course syllabus---please review it at this time.

CMSC 330, Fall 2007

Organization of Programming Languages

We meet in CSI 1115 on Tuesdays and Thursdays