Project 5: ReduceOverflow

Due: Tuesday, December 9, 11:59 PM

Project Overview

In this project you will build a tool which takes a code file and searches data from the website StackOverflow for posts relevant to your source using the Hadoop framework for parallel data processing. The idea is for this program to act as a development aid in a basic capacity by revealing common pitfalls along with how to address them. For this project, we will focus on Android-related posts. We divide processing the data into two passes, implemented as separate MapReduce jobs (the output of the first is the input to the second):

Getting Started

  1. Download Hadoop: here
  2. Import the P5 skeleton project into Eclipse and add the appropriate JARs by going to "Build Path" -> "Configure Build Path" and select the "Libraries" tab. Click "Add External JARs" and browse to the folder where you extracted hadoop-0.20.2. Include the following JAR files in your project:
  3. Download and extract the archive of Android-tagged StackOverflow posts from the class webpage. When you unzip this, it will create a directory inputs that contains 12 CSV files.
  4. Download the sample Android source file MainActivity.java.
  5. Check out the Hadoop tutorial to become familiar with the framework, if you haven't already.
  6. If you run Windows, you need a working chmod program, per the lecture notes. See the bottom of this page.

Inputs

There are three classes provided in the skeleton project. You run the program by executing the main method of class Main.java. You will need to provide the following command-line arguments to the program: <source file> <input directory> <output directory>. These should each be specified as a path relative to the project's base directory:

Before running, you must ensure that the output and temp directories do not exist/are deleted, as Hadoop will automatically generate these. (The default temp directory, specified in Main.java in the static temp variable, is in your project directory; you may not see it without explicitly after a run if you don't hit "refresh" in Eclipse, for your project directory.)

Classes

The sections of the provided classes which require implementing code will be marked with //TODO: IMPLEMENT CODE HERE. See the Javadocs for more detail.

What to Turn In

Every file you submit should have your name and UID. To enforce academic integrity, Code WILL BE CHECKED for similarity to other submissions. Each student should make his or her own individual submission.

Submit your code on the Department's Submit Server. Contact the TA if you have any troubles submitting.

Submit a .zip file containing your project files to the Submit Server. You SHOULD NOT include the temporary or output files, or any input data, with your submission.

Hadoop for Windows users

You need to have chmod.exe installed in order to run Hadoop. This is a Unix command for changing permissions on files and directories. But the command is not native to Windows. The easiest way to obtain chmod is to

Note that after you install chmod you will need to restart Eclipse for it to notice the change in your Path variable.

Web Accessibility