CMSC 330, Spring 2012

Organization of Programming Languages

Project 1 - Processing Weblog Files

Due 11:59pm Tue, Feb 14th, 2012


When you surf the World Wide Web, you use a web browser, such as Firefox, to get web pages from a web server, such as apache. For example, if you click the first link above, which has the URL, then your web browser will connect to the web server running on the machine and ask it for the web page firefox.

There's a lot more we could say about web browsers and web servers, but for this project, there's only one other thing you need to know: Most web servers are configured to log all the requests they get to a file. For example, the following is a line from the CS department web log that resulted from a request for the CMSC 330 main web page: - - [03/Aug/2007:11:34:36 -0400] "GET /class/fall2007/cmsc330/ HTTP/1.1" 200 2488
From left-to-right, this log entry means: The request came from IP address; the date was August 3, 2007 at the time shown; the request was for the page /class/fall2007/cmsc330/, and the web browser understands http version 1.1; the request was successful (status code 200); and 2488 bytes were sent from the web server to the web browser.

As you can imagine, there are many reasons for the CS department, and anyone else who runs a web server, to maintain web logs. For example, we might want to know which web pages are the most popular on our web site, or what time of day our web server gets the most hits. In this project, you will write a Ruby program that parses web logs of the form shown above and reports various summary information.

Getting Started

Download the following zip archive It should include the following files:

Part 1: Validating log files

The first part of this project is to write a Ruby script that validates that an input file is in fact a web log and not, e.g., Aunt Bertha's secret apple pie recipe. We will select this task by passing the mode validate to your Ruby script. In particular, to test your solution to part 1, we will invoke your program with

ruby weblog.rb validate log-file-name

In response, your script should output exactly one line of text and then exit. That line should either contain the three letters yes followed by a newline if the log file is valid, or the two letters no followed by a newline otherwise.

A valid log file contains zero or more lines of text, each of which ends in a newline. (Hint: There are Ruby methods to read in a single line of text at a time.) Each line must contain the following fields from left-to-right, with each field separated from the previous field with a single space. The left-most field has nothing in front of it, and the right-most field is followed only by the newline that ends the line:

  • The first field is a numeric IP address. The address contains four numbers in the range 0-255 separated by a period. (Note that by separated we mean that there are three periods total; if the period were a terminator, then there would be four periods.)
  • Next is a hyphen - (this could in theory be a different symbol, but it never happens). For the purposes of this project, you only need to verify that the hyphen is present.
  • Next is the name of the user requesting the page, which may contain any alphabetic characters (upper or lowercase), numbers, and underscore. The user name may be - if no user has been determined.
  • Next is the date the web page was requested. Here is the format for the date as described in the apache documentation:
        [day/month/year:hour:minute:second zone]
        day = 2*digit
        month = 3*letter
        year = 4*digit
        hour = 2*digit
        minute = 2*digit
        second = 2*digit
        zone = (`+' | `-') 4*digit
    For this project, we will require a slightly stricter set of strings: The day must be in the range 01-31 (regardless of the month); the month must be in the range Jan-Dec; the hour must be in the range 00-23, and minutes and seconds must be in the range 00-59; and the zone (which indicates the time zone) will always be -0400.
  • Next is the request itself, which is an arbitrary string that begins and ends with double quotes. Inside of the string any occurrence of double quote must be escaped by being prefixed with backslash.
  • Next is the status code, a non-negative integer
  • Last is the number of bytes sent, either a non-negative integer or - if nothing was sent to the requester.

Any file whose lines do not conform to these rules is invalid. Extra whitespace between any fields (or at the beginning or end of the line) is not valid.

Hint: It might be easier to write a regular expression that accepts a line in a more general format, and then check after it's been split into pieces that the various range constraints have been satisfied.

Part 2: Gathering statistics

The next part of the project is to add additional modes to your Ruby script that summarize various aspects of the web log. You will write three new modes, as described below. You may assume that we will only use these new modes on log files that are valid.

Mode: bytes

In this mode, you should output the total number of bytes sent by the web server across all log entries. This size should be reported in the largest appropriate unit (bytes, KB, MB, GB) and truncated to an integer (e.g., 7.9 -> 7). Remember that 1024 bytes = 1 KB, and 1024 KB = 1 MB, etc. So if if the total size is 1337 bytes, your program should output "1 KB". If the total size is 42000000 bytes, your program should output "40 MB". Any size larger than 1 GB should be reported in GB. There should be a single space between the number and the unit.

The output should be a single line of text containing this number with units as described above, terminated by a newline. For example, a sample run of your script might look like:

% ruby weblog.rb bytes sample.log
238 KB
Hint: Remember that the web server uses - to indicate no bytes sent by the web server.

In bytes mode, the smallest unit is "bytes". Sizes less than 1024 (1 KB) should be reported as # of bytes (e.g., "12 bytes"). For 0, the correct output (both grammatically and for project grading) is "0 bytes", not "0 byte". For 1, the correct output (for project grading) is "1 bytes", not "1 byte" (despite it being grammatically incorrect).

Mode: time

In this mode, you should produce a histogram indicating the total number of requests that were served in each possible hour of the day, totaled across all requests on all days. You should ignore the time zone part of the log entry (recall that they're always -0400 for this project). Your output should consist of 24 lines listing the two-digit hour, followed by space, followed by the number of requests (which may be zero), followed by a newline. For example, a sample run of your script might look like:

% ruby weblog.rb time sample.log
00 8
01 3
02 0
23 5
meaning that 8 requests were served between 00:00:00 and 00:59:59 in time zone -0400, inclusive, 3 requests were served between 01:00:00 and 01:59:59, inclusive, and so on.

Mode: popularity

In this mode, you should produce a list containing the top-ten most common request strings (not web pages) received by the web server, across all entries in the log file. The output should contain at most ten lines (Hint: You need to handle the case where there are fewer than ten different requests), one line per popular request. Each line should contain the total number of times the request was received, followed by a space, followed by the request string, which should still include quotes. The lines should be sorted from most popular to least popular. For example, a sample run of your script might look like:

% ruby weblog.rb popularity access-log
5 "GET /hcil/_includes/publications-2-col.html HTTP/1.0"
3 "GET /class/fall2005/cmsc412/ HTTP/1.1"
2 "GET /class/fall2005/cmsc417/ HTTP/1.1"


  • You may assume single digit numbers in IP addresses (e.g., will not have leading zeros. Numbers in in the weblog date entry will have leading zeros if they have less than the maximum number of digits for that field (e.g., 25/Jul/2006:01:02:02).

  • Requests in weblog entries may contain quotation marks, if they are escaped with a backslash. In other words, "GET /class/"notes".txt" is an illegal request, but "GET /class/\"notes\".txt" is legal.

  • For bytes mode, truncated means drop fractional values when converting to an integer. So 9.9999999 => 9

  • You may assume that all test files will end in literal newline characters '\n' only.

  • You may assume that the logfile exists, its name is passed correctly on the command line, and is valid when in bytes, time, and popularity mode.

  • You may assume that if there are 2 requests that appear the same number of times, the order you output them in the "popularity" mode is not significant. Or just assume that in project 1 testing, there will be no ties among the top 10 most popular requests for tests of "popularity" mode.


    You can submit your project in two ways:
    • Submit your weblog.rb file directly to the submit server by clicking on the submit link in the column "web submission".

      Next, use the submit dialog to submit your weblog.rb file directly.

      Select your file using the "Browse" button, then press the "Submit project!" button. You do not need to put it in a Jar or Zip file.

    • Submit directly by executing a Java program on a computer with Java and network access. Included in are the following files:

      The files should be in the directory containing your project. From there you can either execute submit.rb, or type the following command directly:

      java -jar submit.jar

      The first time you submit this way you will be asked to enter your linuxlab class account and password. All files in the directory (and its subdirectories) will then be put in a jar file and submitted to the submit server. If your submission is successful you will see the message:

      Successful submission # received for project 1

    Hints and Tips

    • This project is not hard, but could consume a lot of time because you will probably be writing in Ruby for the first time. So be sure to start early so you'll have opportunities to ask questions on the class forum or come to office hours if you get stuck.
    • Follow good program development practices: Test each part of your program as you develop it. Start developing a simplified solution and then add features as you are sure that earlier parts work. Test early and often, and re-run your tests as you add new features to be sure you didn't break anything.
    • Before you get too far, review the Ruby class reference, and look for classes and methods that might be helpful. For example, the Array and Hash classes will come in handy. Finding the right class might save you a lot of time and make your program easier to develop.
    • If you write methods that should return a true or false value, remember that a Ruby 0 is not false.
    • Ruby has an integrated debugger, which can be invoked by running Ruby with the -rdebug option. The debugger's p command may be helpful for viewing the values of variables and data structures. The var local command prints all of the local variables at the current point of exclusion. The chapter "When Trouble Strikes" of The Pragmatic Programmer's Guide discusses the debugger in more detail.
    • To thoroughly debug your program, you will need to construct test cases of your own, based on the project description. If you need help with this, please come to TA office hours.
    • Remember to save your work frequently---a power failure, network failure, or problem with a phone connection could cost many hours of lost work. For the same reason, submit your project often. You can retrieve previously-submitted versions of your program from the submit server should disaster strike.
    • Be sure you have read and understand the project grading policies in the course syllabus. Do this well in advance of the project due date.

    Academic Integrity

    The Campus Senate has adopted a policy asking students to include the following statement on each assignment in every course: "I pledge on my honor that I have not given or received any unauthorized assistance on this assignment." Consequently your program is requested to contain this pledge in a comment near the top.

    Please carefully read the academic honesty section of the course syllabus. Any evidence of impermissible cooperation on projects, use of disallowed materials or resources, or unauthorized use of computer accounts, will be submitted to the Student Honor Council, which could result in an XF for the course, or suspension or expulsion from the University. Be sure you understand what you are and what you are not permitted to do in regards to academic integrity when it comes to project assignments. These policies apply to all students, and the Student Honor Council does not consider lack of knowledge of the policies to be a defense for violating them. Full information is found in the course syllabus---please review it at this time.

  • Web Accessibility