CMSC 330, Summer 2012
Organization of Programming Languages
Project 1: Ruby Text Processing
11:59pm (23:59 EDT)
5/30 20:35 EDT: Some clarifications have been posted in the "Notes" section below
When you surf the World Wide Web, you use a web browser, such as Firefox, to get web pages from a web server, such as Apache. For example, if you click the first link above, which has the URL http://www.mozilla.com/firefox, then your web browser will connect to the web server running on the machine www.mozilla.org and ask it for the web page firefox.
There's a lot more we could say about web browsers and web servers, but for this project, there's only one other thing you need to know: Most web servers are configured to log all the requests they get to a file. For example, the following is a line from the CS department web log that resulted from a request for the CMSC 330 main web page:
188.8.131.52 - - [03/Aug/2007:11:34:36 -0400] "GET /class/fall2007/cmsc330/ HTTP/1.1" 200 2488From left-to-right, this log entry means: The request came from IP address 184.108.40.206; the date was August 3, 2007 at the time shown; the request was for the page /class/fall2007/cmsc330/, and the web browser understands http version 1.1; the request was successful (status code 200); and 2488 bytes were sent from the web server to the web browser.
As you can imagine, there are many reasons for the CS department, and anyone else who runs a web server, to maintain web logs. For example, we might want to know which web pages are the most popular on our web site, or what time of day our web server gets the most hits. In this project, you will write a Ruby program that parses web logs of the form shown above and reports various summary information.
What to Submit
Your should submit a file weblog.rb containing your solution. You may submit other files, but they will be ignored during grading. We will run your solution by invoking
where <mode> describes what the tool should do (see below), and <log-file-name> names the file containing the web log.
Be sure to follow the project description below exactly. Your solution will be graded automatically, and so any deviation from the specification will result in losing points. In particular, if you have any debugging output in your program, be sure to turn it off before you submit your program.
You can access the project starter files at:
Copy this file to your home directory and unpack it with
This should create a p1 directory that contains the starter file weblog.rb along with the public test files.
If you do your development inside the p1 directory, you can submit all files in that directory and below with the command
This command runs a Java program that looks in the current directory for a .submit file to find the project number, and then uploads all files within that directory to the submit server.
Alternately, you can use the web-based submit server interface at:
Part 1: Validating log files
The first part of this project is to write a Ruby script that validates that an input file is in fact a web log and not, e.g., Aunt Bertha's secret apple pie recipe. We will select this task by passing the mode validate to your Ruby script. In particular, to test your solution to part 1, we will invoke your program with
In response, your script should output exactly one line of text and then exit. That line should either contain the three letters yes followed by a newline if the log file is valid, or the two letters no followed by a newline otherwise.
A valid log file contains zero or more lines of text, each of which ends in a newline. (Hint: There are Ruby methods to read in a single line of text at a time.) Each line must contain the following fields from left-to-right, with each field separated from the previous field with a single space. The left-most field has nothing in front of it, and the right-most field is followed only by the newline that ends the line:
Any file whose lines do not conform to these rules is invalid. Extra whitespace between any fields (or at the beginning or end of the line) is not valid.
Hint: It might be easier to write a regular expression that accepts a line in a more general format, and then check after it's been split into pieces that the various range constraints have been satisfied. Use Rubular to help test your regular expressions.
Part 2: Gathering statistics
The next part of the project is to add additional modes to your Ruby script that summarize various aspects of the web log. You will write three new modes, as described below. You may assume that we will only use these new modes on log files that are valid.
In this mode, you should output the total number of bytes sent by the web server across all log entries. This size should be reported in the largest appropriate unit (bytes, KB, MB, GB) and truncated to the nearest integer. Remember that 1024 bytes = 1 KB, and 1024 KB = 1 MB, etc. So if if the total size is 1337 bytes, your program should output "1 KB". If the total size is 42000000 bytes, your program should output "40 MB". Any size larger than 1 GB should be reported in GB. There should be a single space between the number and the unit.
The output should be a single line of text containing this number with units as described above, terminated by a newline. For example, a sample run of your script might look like:
% ruby weblog.rb bytes sample.log 238 KBHint: Remember that the web server uses - to indicate no bytes sent by the web server.
In this mode, you should produce a histogram indicating the total number of requests that were served in each possible hour of the day, totaled across all requests on all days. You should ignore the time zone part of the log entry (recall that they're always -0400 for this project). Your output should consist of 24 lines listing the two-digit hour, followed by space, followed by the number of requests (which may be zero), followed by a newline. For example, a sample run of your script might look like:
% ruby weblog.rb time sample.log 00 8 01 3 02 0 ... 23 5meaning that 8 requests were served between 00:00:00 and 00:59:59 in time zone -0400, inclusive, 3 requests were served between 01:00:00 and 01:59:59, inclusive, and so on.
In this mode, you should produce a list containing the top-ten most common request strings (not web pages) received by the web server, across all entries in the log file. The output should contain at most ten lines (Hint: You need to handle the case where there are fewer than ten different requests), one line per popular request. Each line should contain the total number of times the request was received, followed by a space, followed by the request string, which should still include quotes. The lines should be sorted from most popular to least popular. For example, a sample run of your script might look like:
% ruby weblog.rb popularity access-log 5 "GET /hcil/_includes/publications-2-col.html HTTP/1.0" 3 "GET /class/fall2005/cmsc412/ HTTP/1.1" 2 "GET /class/fall2005/cmsc417/ HTTP/1.1" ...
Hints and Tips
Public Test Cases
Public test cases have been posted.
The Campus Senate has adopted a policy asking students to include the following statement on each assignment in every course: "I pledge on my honor that I have not given or received any unauthorized assistance on this assignment." Consequently your program is requested to contain this pledge in a comment near the top.
Please carefully read the academic honesty section of the course syllabus. Any evidence of impermissible cooperation on projects, use of disallowed materials or resources, or unauthorized use of computer accounts, will be submitted to the Student Honor Council, which could result in an XF for the course, or suspension or expulsion from the University. Be sure you understand what you are and what you are not permitted to do in regards to academic integrity when it comes to project assignments. These policies apply to all students, and the Student Honor Council does not consider lack of knowledge of the policies to be a defense for violating them. Full information is found in the course syllabus---please review it at this time.