CMSC 330, Fall 2008
Organization of Programming Languages
As we saw in lecture and discussion section, there's lot of interesting data available in text form. In this project, you will write a basic Ruby script that lets us answer some questions about data from the 1994 Census.
The particular data set we will be using comes as a file of comma-separated values (.csv): each line of the file represents one census record, and the different fields of the record are separated by commas. Here are two sample records, taken from the first two lines of the file:
39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K 50, Self-emp-not-inc, 83311, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 13, United-States, <=50K
Note that we have left the data exactly as we downloaded it, and not cleaned it up for you in any way. This is real stuff! (The people who posted the data to the repository did filter it a bit from the original census data.) Among other data, the first line lists a 39-year-old, unmarried male state government employee, and the second line lists a 50-year-old self-employed married male. Here is a brief description and the possible data values in each field, going from left to right:
Any of the entires may also be ?, to indicate missing data.
What to submit
You should submit a file census.rb containing your solution. You may submit other files, but they will be ignored during grading. We will run your solution by invoking
where <mode> describes what the tool should do (see below), and <file-name> names the file containing the census data.
Be sure to follow the project description below exactly. Your solution will be graded automatically, and so any deviation from the specification will result in losing points. In particular, if you have any debugging output in your program, be sure to turn it off before you submit your program.
To get a project directory containing sample (real!) census data, copy the file from
into your home directory and unpack it with
This should create a p1 directory that contains the sample census data file census.csv.
If you do your development inside the p1 directory, you can submit all files in that directory and below with the command
This command looks in the current directory for a .submit file to find the project number, and then uploads all files within that directory to the submit server.
If you use this command often, you should add /afs/glue.umd.edu/class/fall2008/cmsc/330/0101/public/bin to your path.
The census file included in the p1 directory is a small slice of a much longer data file. You can find the full file at
Feel free to use this file as sample input file; just pass that long file name as an argument to your script. (It's a good test, because it contains lots of possible input values.) Please do not copy this file to your home directory. It takes up 3.8 megabytes, and if each student copies it to their home directory, we'll be wasting a lot of disk space, and will run into problems later in the semester. (Especially don't put it in your p1 directory, since if you did that, then every time you submitted your project it would get copied to the submit server!)
Part 1: Validating the input format
The first part of this project is to write a mode for census.rb that validates that an input file contains valid census data and not, e.g., Aunt Bertha's secret apple pie recipe. We will invoke your program with
In response, your script should output exactly one line of text and then exit. That line should either contain the three letters yes followed by a newline if the file is valid, or the two letters no followed by a newline otherwise.
A valid file contains zero or more lines of text, each of which ends in a newline. (Hint: There are Ruby methods to read in a single line of text at a time.) Each line must contain a comma-separated list of values following exactly the format above. In particular, you must check the following:
We will test this part of the project by (a) making valid inputs and verifying that your program accepts them, and (b) making invalid inputs and verifying that your program rejects them.
Hint: It might be easier to write a regular expression that accepts a line in a more general format, and then check after it's been split into pieces that the various range constraints have been satisfied. Don't forget, you don't need to use regular expressions for everything; you might be able to do a lot (or all) with just methods of String.
Part 2: Histograms
Next, you're going to add a feature to your Ruby script to output histograms summarizing the data set. In particular, there are three kinds of histograms your script needs to be able to generate: histograms of gender distribution, education level, and age.
First, if we invoke your script with the mode gender, your script should output two lines, the first beginning with Female:, followed by one space, followed by the number of females listed in the census data; and the second beginning with Male:, followed by one space, followed by the number of males listed in the census data. For example:
% ruby census.rb gender test_input Female: 1234 Male: 5678
Note that it is possible for the count of females or males to be zero, in which case you must still output both lines of text. Do not include entries with gender ? in the histogram.
Second, if we invoke your script with the education mode, your script should output the possible education levels followed by colon, space, and the count, with one education level per line. The education levels should be listed in alphabetical order (according to the standard rules for sorting ascii text). Again, categories with no entries for them should still be listed, with a zero count. For example,
% ruby census.rb education test_input 10th: 123 11th: 456 12th: 789 1st-4th: 123 5th-6th: 456 7th-8th: 789 9th: 12 Assoc-acdm: 34 Assoc-voc: 56 Bachelors: 0 Doctorate: 123 HS-grad: 45 Masters: 6 Preschool: 0 Prof-school: 90 Some-college: 1234
Do not include entries with education level ? in the histogram.
Finally, if we invoke your script with the age mode, your script should output age ranges in the format shown below, followed by a colon, space, and the number of entries in each range. The age ranges should be in 5-year increments starting at age 15 (even though no age less than 17 appears in the data set). All age ranges should be listed (possibly with zero counts). The last category should be for age 100 and above. For example:
% ruby census.rb age test_input 15-19: 123 20-24: 456 25-29: 789 30-34: 12 35-39: 42 40-44: 42 45-49: 42 50-54: 42 55-59: 42 60-64: 42 65-69: 42 70-74: 42 75-79: 42 80-84: 42 85-89: 42 90-94: 42 95-99: 42 100-: 42
Do not include entries with age ? in the histogram.
There's a programming challenge hidden here: How much code can you reuse for all three kinds of histograms? There's no requirement that you do so, and we won't check. So you could use copy-and-paste to duplicate your histogram code and modify it once you have it working. But it will be much more satisfying for you if you figure out a programmatic method of reuse, and it will keep your solution shorter. (Note that it's pretty hard to reuse everything, so you probably will need to do some code duplication.)
Part 3: Correlation
Finally, we want to use your script to try to decide whether income levels and gender are independent variables, meaning there is no staistical relationship between them. The statistical test you will use is Pearson's chi-square test. You can read more about this test on the web; for purposes of this project, here is the exact procedure you should follow to calculuate the p-value of the hypothesis that income and gender are independent:
In this last mode, which we will call indep, your script should output the p-value. For example:
% ruby census.rb indep test_input 0.05
Generally, a p-value of 0.05 or less is considered to indicate statistical significance, since it would indicate at least a 95% chance that gender and income are related.
Hints and Tips
The Campus Senate has adopted a policy asking students to include the following statement on each assignment in every course: "I pledge on my honor that I have not given or received any unauthorized assistance on this assignment." Consequently your program is requested to contain this pledge in a comment near the top.
Please carefully read the academic honesty section of the course syllabus. Any evidence of impermissible cooperation on projects, use of disallowed materials or resources, or unauthorized use of computer accounts, will be submitted to the Student Honor Council, which could result in an XF for the course, or suspension or expulsion from the University. Be sure you understand what you are and what you are not permitted to do in regards to academic integrity when it comes to project assignments. These policies apply to all students, and the Student Honor Council does not consider lack of knowledge of the policies to be a defense for violating them. Full information is found in the course syllabus---please review it at this time.
Data Set Source
The sample data we are using for this project came from the UCI Machine Learning Repository. More information.