CMSC 330, Spring 2012
Organization of Programming Languages
Project 1 - Processing Weblog Files
When you surf the World Wide Web, you use a web browser, such as Firefox, to get web pages from a web server, such as apache. For example, if you click the first link above, which has the URL http://www.mozilla.com/firefox, then your web browser will connect to the web server running on the machine www.mozilla.org and ask it for the web page firefox.
There's a lot more we could say about web browsers and web servers, but for this project, there's only one other thing you need to know: Most web servers are configured to log all the requests they get to a file. For example, the following is a line from the CS department web log that resulted from a request for the CMSC 330 main web page:
188.8.131.52 - - [03/Aug/2007:11:34:36 -0400] "GET /class/fall2007/cmsc330/ HTTP/1.1" 200 2488From left-to-right, this log entry means: The request came from IP address 184.108.40.206; the date was August 3, 2007 at the time shown; the request was for the page /class/fall2007/cmsc330/, and the web browser understands http version 1.1; the request was successful (status code 200); and 2488 bytes were sent from the web server to the web browser.
As you can imagine, there are many reasons for the CS department, and anyone else who runs a web server, to maintain web logs. For example, we might want to know which web pages are the most popular on our web site, or what time of day our web server gets the most hits. In this project, you will write a Ruby program that parses web logs of the form shown above and reports various summary information.
Getting StartedDownload the following zip archive p1.zip. It should include the following files:
Part 1: Validating log files
The first part of this project is to write a Ruby script that validates that an input file is in fact a web log and not, e.g., Aunt Bertha's secret apple pie recipe. We will select this task by passing the mode validate to your Ruby script. In particular, to test your solution to part 1, we will invoke your program with
In response, your script should output exactly one line of text and then exit. That line should either contain the three letters yes followed by a newline if the log file is valid, or the two letters no followed by a newline otherwise.
A valid log file contains zero or more lines of text, each of which ends in a newline. (Hint: There are Ruby methods to read in a single line of text at a time.) Each line must contain the following fields from left-to-right, with each field separated from the previous field with a single space. The left-most field has nothing in front of it, and the right-most field is followed only by the newline that ends the line:
Any file whose lines do not conform to these rules is invalid. Extra whitespace between any fields (or at the beginning or end of the line) is not valid.
Hint: It might be easier to write a regular expression that accepts a line in a more general format, and then check after it's been split into pieces that the various range constraints have been satisfied.
Part 2: Gathering statistics
The next part of the project is to add additional modes to your Ruby script that summarize various aspects of the web log. You will write three new modes, as described below. You may assume that we will only use these new modes on log files that are valid.
In this mode, you should output the total number of bytes sent by the web server across all log entries. This size should be reported in the largest appropriate unit (bytes, KB, MB, GB) and truncated to an integer (e.g., 7.9 -> 7). Remember that 1024 bytes = 1 KB, and 1024 KB = 1 MB, etc. So if if the total size is 1337 bytes, your program should output "1 KB". If the total size is 42000000 bytes, your program should output "40 MB". Any size larger than 1 GB should be reported in GB. There should be a single space between the number and the unit.
The output should be a single line of text containing this number with units as described above, terminated by a newline. For example, a sample run of your script might look like:
% ruby weblog.rb bytes sample.log 238 KBHint: Remember that the web server uses - to indicate no bytes sent by the web server.
In bytes mode, the smallest unit is "bytes". Sizes less than 1024 (1 KB) should be reported as # of bytes (e.g., "12 bytes"). For 0, the correct output (both grammatically and for project grading) is "0 bytes", not "0 byte". For 1, the correct output (for project grading) is "1 bytes", not "1 byte" (despite it being grammatically incorrect).
In this mode, you should produce a histogram indicating the total number of requests that were served in each possible hour of the day, totaled across all requests on all days. You should ignore the time zone part of the log entry (recall that they're always -0400 for this project). Your output should consist of 24 lines listing the two-digit hour, followed by space, followed by the number of requests (which may be zero), followed by a newline. For example, a sample run of your script might look like:
% ruby weblog.rb time sample.log 00 8 01 3 02 0 ... 23 5meaning that 8 requests were served between 00:00:00 and 00:59:59 in time zone -0400, inclusive, 3 requests were served between 01:00:00 and 01:59:59, inclusive, and so on.
In this mode, you should produce a list containing the top-ten most common request strings (not web pages) received by the web server, across all entries in the log file. The output should contain at most ten lines (Hint: You need to handle the case where there are fewer than ten different requests), one line per popular request. Each line should contain the total number of times the request was received, followed by a space, followed by the request string, which should still include quotes. The lines should be sorted from most popular to least popular. For example, a sample run of your script might look like:
% ruby weblog.rb popularity access-log 5 "GET /hcil/_includes/publications-2-col.html HTTP/1.0" 3 "GET /class/fall2005/cmsc412/ HTTP/1.1" 2 "GET /class/fall2005/cmsc417/ HTTP/1.1" ...
SubmissionYou can submit your project in two ways:
Hints and Tips
The Campus Senate has adopted a policy asking students to include the following statement on each assignment in every course: "I pledge on my honor that I have not given or received any unauthorized assistance on this assignment." Consequently your program is requested to contain this pledge in a comment near the top.
Please carefully read the academic honesty section of the course syllabus. Any evidence of impermissible cooperation on projects, use of disallowed materials or resources, or unauthorized use of computer accounts, will be submitted to the Student Honor Council, which could result in an XF for the course, or suspension or expulsion from the University. Be sure you understand what you are and what you are not permitted to do in regards to academic integrity when it comes to project assignments. These policies apply to all students, and the Student Honor Council does not consider lack of knowledge of the policies to be a defense for violating them. Full information is found in the course syllabus---please review it at this time.