Project 1

Due 11:59pm Monday, September 22, 2008

Updates

Sep 18. Minor correction in part 3: the p-value indicates the probability that the variables are independent. That is, a high p-value means gender and income are probably independent, whereas a low value means they are probably dependent.
Sep 11. For part 3, you can ignore the possibility of dividing by zero (i.e., just assume it won't happen).
Sep 10. Fixed a typo---the calculation of EFL was missing from part 3.
Sep 10. In parts 2 and 3, you may assume that the input file is in the right format.

Introduction

As we saw in lecture and discussion section, there's lot of interesting data available in text form. In this project, you will write a basic Ruby script that lets us answer some questions about data from the 1994 Census.

The particular data set we will be using comes as a file of comma-separated values (.csv): each line of the file represents one census record, and the different fields of the record are separated by commas. Here are two sample records, taken from the first two lines of the file:

39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K
50, Self-emp-not-inc, 83311, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 13, United-States, <=50K

Note that we have left the data exactly as we downloaded it, and not cleaned it up for you in any way. This is real stuff! (The people who posted the data to the repository did filter it a bit from the original census data.) Among other data, the first line lists a 39-year-old, unmarried male state government employee, and the second line lists a 50-year-old self-employed married male. Here is a brief description and the possible data values in each field, going from left to right:

Age - Integer greater than or equal to 17
Work status - Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked
Final weight - Non-negative integer. We're not sure what this is.
Education level - Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool
Education level (integral) - Non-negative integer. Another representation of the education level.
Marital status - Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse
Occupation - Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces
Relationship to census participant - Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried
Race - White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black
Gender - Female, Male
Capital gain - Non-negative integer
Capital loss - Non-negative integer
Hours per week worked - Non-negative integer (at most 168)
Native country - United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands
Income - <=50K, >50K

Any of the entires may also be ?, to indicate missing data.

What to submit

You should submit a file census.rb containing your solution. You may submit other files, but they will be ignored during grading. We will run your solution by invoking

ruby census.rb <mode> <file-name>

where <mode> describes what the tool should do (see below), and <file-name> names the file containing the census data.

Be sure to follow the project description below exactly. Your solution will be graded automatically, and so any deviation from the specification will result in losing points. In particular, if you have any debugging output in your program, be sure to turn it off before you submit your program.

To get a project directory containing sample (real!) census data, copy the file from

/afs/glue.umd.edu/class/fall2008/cmsc/330/0101/public/p1.tar.gz

into your home directory and unpack it with

gtar xzf p1.tar.gz

This should create a p1 directory that contains the sample census data file census.csv.

If you do your development inside the p1 directory, you can submit all files in that directory and below with the command

/afs/glue.umd.edu/class/fall2008/cmsc/330/0101/public/bin/330submit

This command looks in the current directory for a .submit file to find the project number, and then uploads all files within that directory to the submit server.

If you use this command often, you should add /afs/glue.umd.edu/class/fall2008/cmsc/330/0101/public/bin to your path.

The census file included in the p1 directory is a small slice of a much longer data file. You can find the full file at

/afs/glue.umd.edu/class/fall2008/cmsc/330/0101/public/census-full.csv

Feel free to use this file as sample input file; just pass that long file name as an argument to your script. (It's a good test, because it contains lots of possible input values.) Please do not copy this file to your home directory. It takes up 3.8 megabytes, and if each student copies it to their home directory, we'll be wasting a lot of disk space, and will run into problems later in the semester. (Especially don't put it in your p1 directory, since if you did that, then every time you submitted your project it would get copied to the submit server!)

Part 1: Validating the input format

The first part of this project is to write a mode for census.rb that validates that an input file contains valid census data and not, e.g., Aunt Bertha's secret apple pie recipe. We will invoke your program with

ruby census.rb validate <file-name>

In response, your script should output exactly one line of text and then exit. That line should either contain the three letters yes followed by a newline if the file is valid, or the two letters no followed by a newline otherwise.

A valid file contains zero or more lines of text, each of which ends in a newline. (Hint: There are Ruby methods to read in a single line of text at a time.) Each line must contain a comma-separated list of values following exactly the format above. In particular, you must check the following:

That each line contains exactly 15 fields.
That each field except the last is followed by a comma and exactly one space. (Thus, commas are separators rather than terminators.)
That each field contains contents of the right form. For example, an age of 16 is not valid, nor is an occupation of Window-washer. (Don't forget that ? is always a valid entry for a field.)

We will test this part of the project by (a) making valid inputs and verifying that your program accepts them, and (b) making invalid inputs and verifying that your program rejects them.

Hint: It might be easier to write a regular expression that accepts a line in a more general format, and then check after it's been split into pieces that the various range constraints have been satisfied. Don't forget, you don't need to use regular expressions for everything; you might be able to do a lot (or all) with just methods of String.

Part 2: Histograms

Next, you're going to add a feature to your Ruby script to output histograms summarizing the data set. In particular, there are three kinds of histograms your script needs to be able to generate: histograms of gender distribution, education level, and age.

First, if we invoke your script with the mode gender, your script should output two lines, the first beginning with Female:, followed by one space, followed by the number of females listed in the census data; and the second beginning with Male:, followed by one space, followed by the number of males listed in the census data. For example:

% ruby census.rb gender test_input
Female: 1234
Male: 5678

Note that it is possible for the count of females or males to be zero, in which case you must still output both lines of text. Do not include entries with gender ? in the histogram.

Second, if we invoke your script with the education mode, your script should output the possible education levels followed by colon, space, and the count, with one education level per line. The education levels should be listed in alphabetical order (according to the standard rules for sorting ascii text). Again, categories with no entries for them should still be listed, with a zero count. For example,

% ruby census.rb education test_input
10th: 123
11th: 456
12th: 789
1st-4th: 123
5th-6th: 456
7th-8th: 789
9th: 12
Assoc-acdm: 34
Assoc-voc: 56
Bachelors: 0
Doctorate: 123
HS-grad: 45
Masters: 6
Preschool: 0
Prof-school: 90
Some-college: 1234

Do not include entries with education level ? in the histogram.

Finally, if we invoke your script with the age mode, your script should output age ranges in the format shown below, followed by a colon, space, and the number of entries in each range. The age ranges should be in 5-year increments starting at age 15 (even though no age less than 17 appears in the data set). All age ranges should be listed (possibly with zero counts). The last category should be for age 100 and above. For example:

% ruby census.rb age test_input
15-19: 123
20-24: 456
25-29: 789
30-34: 12
35-39: 42
40-44: 42
45-49: 42
50-54: 42
55-59: 42
60-64: 42
65-69: 42
70-74: 42
75-79: 42
80-84: 42
85-89: 42
90-94: 42
95-99: 42
100-: 42

Do not include entries with age ? in the histogram.

There's a programming challenge hidden here: How much code can you reuse for all three kinds of histograms? There's no requirement that you do so, and we won't check. So you could use copy-and-paste to duplicate your histogram code and modify it once you have it working. But it will be much more satisfying for you if you figure out a programmatic method of reuse, and it will keep your solution shorter. (Note that it's pretty hard to reuse everything, so you probably will need to do some code duplication.)

Part 3: Correlation

Finally, we want to use your script to try to decide whether income levels and gender are independent variables, meaning there is no staistical relationship between them. The statistical test you will use is Pearson's chi-square test. You can read more about this test on the web; for purposes of this project, here is the exact procedure you should follow to calculuate the p-value of the hypothesis that income and gender are independent:

Compute the following five values:
- M = number of males
- F = number of females
- H = number of people with income higher than $50K
- L = number of people with income less than or equal to $50K
- N = total number of people
- Note: You should exclude from the above counts (and every other count below) any entry where either the gender or income level is ?
Compute the following four values:
- OMH = number of males with income higher than $50K
- OML = number of males with income less than or equal to $50K
- OFH = number of females with income higher than $50K
- OFL = number of females with income less than or equal to $50K
- In this step, you're filling in the following diagram with observed values from the data set:
  
  >$50K <=$50K
  Male OMH OML
  Female OFH OFL
Compute the following four values:
- EMH = (M*H)/N
- EML = (M*L)/N
- EFH = (F*H)/N
- EFL = (F*L)/N
- In this step, you're computing expected values for each of the four cells in the above table, if we assume gender and income are independent. For example, we would expect the number of high-income males to be (M/N)*H, since M/N of the population is male, and there are H total high income earners.
Compute S = [((EMH - OMH)^2)/EMH] + [((EML - OML)^2)/EML] + [((EFH - OFH)^2)/EFH] + [((EFL - OFL)^2)/EFL]
Look up the value of S in the following table. Find the largest value that S is greater than or equal to, and output the corresponding p-value, which is essentially the probability that gender and income are independent.
```
S        p-value
----------------
0.000    1
0.001    0.975
0.004    0.95
0.016    0.90
2.706    0.10
3.841    0.05
5.024    0.025
6.635    0.01
7.879    0.005
```

In this last mode, which we will call indep, your script should output the p-value. For example:

% ruby census.rb indep test_input
0.05

Generally, a p-value of 0.05 or less is considered to indicate statistical significance, since it would indicate at least a 95% chance that gender and income are related.

Hints and Tips

This project is non-trivial, in part because you will probably be writing in Ruby for the first time, so be sure to start right away, and come to office hours if you get stuck.
Follow good program development practices: Test each part of your program as you develop it. Start developing a simplified solution and then add features as you are sure that earlier parts work. Test early and often, and re-run your tests as you add new features to be sure you didn't break anything.
Before you get too far, review the Ruby class reference, and look for classes and methods that might be helpful. For example, the Array and Hash classes will come in handy. Finding the right class might save you a lot of time and make your program easier to develop.
If you write methods that should return a true or false value, remember that a Ruby 0 is not false.
Ruby has an integrated debugger, which can be invoked by running Ruby with the -rdebug option. The debugger's p command may be helpful for viewing the values of variables and data structures. The var local command prints all of the local variables at the current point of exclusion. The chapter "When Trouble Strikes" of The Pragmatic Programmer's Guide discusses the debugger in more detail.
To thoroughly debug your program, you will need to construct test cases of your own, based on the project description. If you need help with this, please come to TA office hours.
Remember to save your work frequently---a power failure, network failure, or problem with a phone connection could cost many hours of lost work. For the same reason, submit your project often. You can retrieve previously-submitted versions of your program from the submit server should disaster strike.
Be sure you have read and understand the project grading policies in the course syllabus. Do this well in advance of the project due date.

Academic Integrity

The Campus Senate has adopted a policy asking students to include the following statement on each assignment in every course: "I pledge on my honor that I have not given or received any unauthorized assistance on this assignment." Consequently your program is requested to contain this pledge in a comment near the top.

Please carefully read the academic honesty section of the course syllabus. Any evidence of impermissible cooperation on projects, use of disallowed materials or resources, or unauthorized use of computer accounts, will be submitted to the Student Honor Council, which could result in an XF for the course, or suspension or expulsion from the University. Be sure you understand what you are and what you are not permitted to do in regards to academic integrity when it comes to project assignments. These policies apply to all students, and the Student Honor Council does not consider lack of knowledge of the policies to be a defense for violating them. Full information is found in the course syllabus---please review it at this time.

Data Set Source

The sample data we are using for this project came from the UCI Machine Learning Repository. More information.

Web Accessibility

CMSC 330, Fall 2008

Organization of Programming Languages