CMSC 132: Project #8

Project #8 (Threaded Web Crawler)	CMSC 132
Due Date: Mon Dec 12, 6:00 pm	Object-Oriented Programming II
Type of Homework: Closed	Fall 2005

Objective

The goals of this project are:

Practice Threads and concurrency
Gaining experience using the java API

This homework is considered a closed homework. Make sure you read the Open/Closed policy before continuing working on this project.

Overview

Essentially, this project is a web-crawler, which automatically explores the web by following links from one web page to another to another to another, etc. It performs something similar to a breadth-first-search. As it explores, it maintains a list of any pictures that it encounters. These pictures are then displayed in a window at regular intervals.

You can run your web-crawler starting on any web page you want. In the Documents section (under Web Crawler Pages) of the class web page, we have left a set of web pages that can help you implement your project. You are not required to use these pages.

Your project will be graded as follows:

Tests (90 %)
- Public JUnit tests (60 %)
- Release JUnit tests (30 %)
Style (10 %)

For this project:

Even if you are not planning to work on the project for a while, make sure that you can submit your project using the "Submit Project" option in Eclipse and verify your submission has been received in the submit server. It is important that you submit your project using the "Submit Project" option rather than uploading a jar file to the submit server. If you are having problems with this submission process, contact your TA or instructor immediately.
There are two tokens for release testing.
The project's code distribution is available by checking out the project named p8. The code distribution provides you with the following:
- A folder named src - In this folder you will find the packages crawler, gui, and tests.
- A package named crawler - In this package you will implement the crawler system classes. We have left shells for some of the classes you are expected to implement.
- A package named gui - In this package you will find classes implementing the GUI for this project. You don't have to add anything to this package.
- A package named tests - You will find the JUnit public tests we provided in this package.

Specifications

Primary Data Structures

The project utilizes the following data structures, which will be accessed concurrently by numerous threads:

linkQueue -- A queue of URL's that represent web-pages waiting to be explored (visited)
picQueue -- A queue of URL's that represent pictures waiting to be displayed
beenThere -- A set of web-pages that have been explored (visited)
doneThat -- A set of images that have already been found. This set allows us to avoid placing images in the picQueue that we have already scheduled to display.

These data structures start off empty.

Program Threads

There will be two main types of threads operating at the same time:

Crawler -- The Crawler thread manages the crawling process. It selects the next web site to visit and creates an Extractor thread for each web site to visit. It is through this thread that you can access information about the crawling process (e.g., link queue size, active thread count, etc.)
Extractor -- Many of these will be running at the same time. The constructor of the Extractor is passed a URL representing an html web page that the extractor is supposed to explore. By "exploring", we mean the extractor will go through the web page looking for pictures and links to other web pages. Any URL's representing pictures that it finds it puts into the picQueue; any URL's representing web pages that it finds it puts into the linkQueue. The Extractor dies when it has scanned the entire page it was created to explore.

Program GUI

We have provided a GUI that will allow you to monitor the crawling process and to display the images been discovered. The GUI relies on the Crawler class you will implement. Notice that you can implement the project without using the GUI. We have provided so it makes the project more fun. The GUI has two main components represented by the classes CrawlerGUI and SlideshowGUI. The first class allows you to start and monitor the crawling process. In addition in creates an SlideshowGUI object which displays images. Instances of both classes will run as separate threads. The following is a snapshot of the CrawlerGUI at some point during the program execution:

The following is a snapshot of the SlideshowGUI at some point during the program execution:

To start the GUI just execute the main method of the CrawlerGUI class.

Class You Need to Implement

You will need to implement the following four classes. We have presented them in the order we understand they should be implemented.

MyQueue class

You must implement a queue designed to be accessed by numerous threads concurrently. You may not use Collections.synchronizedList() or something similar. You must write code that ensures that there will be no data races or other problems when many threads each perform a single operation on the queue. Below is a description of the methods which you must provide:

public int size() -- returns the number of items in the queue

public void clear() -- removes all items from the queue

public void enqueue(Object o) -- adds the object to one end of the queue

public Object dequeue() -- removes an object from the end opposite of that to which things are added. If the queue is empty, the thread will wait here until an item becomes available

You must use an ArrayList to hold the elements of the queue. You can rely on basic ArrayList operations in order to implement the above methods. Don't be surprised if each method just takes a couple of lines of code.

MySet class

You must implement a set designed to be accessed by numerous threads concurrently. You may not use Collections.synchronizedSet() or something similar. You must write code that ensures that there will be no data races or other problems when many threads each perform a single operation on the set. Below is a description of the methods which you must provide:

public int size() -- returns the number of items in the set

public void clear() -- removes all items from the set

public boolean remove(Object o) -- removes the object from the set

public boolean add(Object o) -- adds the object to the set

public boolean contains(Object o) -- returns true if the set contains the object, false otherwise

public Object[] getElements() -- returns an array with references to the elements in the set

You must use an HashSet to hold the elements of your set. You can rely on basic HashSet operations in order to implement the above methods. Don't be surprised if each method just takes a couple of lines of code.

Extractor class

This class goes through the contents of a web page and identifies web sites and images. In the code distribution we have left a shell class that defines the instance variables and constants you need, along with the constructor for the class. You will complete the run method for this class as follows.

Extracting Links

Open an InputStream to the url that is associated with the Extractor. Go through the entire contents of the webpage you are now connected to, searching for strings that match the Regular Expression LINK_PATTERN, which is a static member of the Extractor class. This pattern basically matches strings of the following form:

href = "xxxxxxxxx"

For each instance of the pattern found, do the following:

Strip away the part between the quotes, and create a new URL from it. [Hint: Use the URL constructor that takes TWO parameters -- pass it the current url that you are searching through, followed by the substring between the quotes. This ensures that both absolute and relativized links will be made into valid URL objects.] To strip away the part between the quotes, use the group method of the Matcher class with 1 (one) as the argument.

Check that your new URL is not null, and then get it's protocol (a String). If the protocol is null or anything other than "http" then we don't want it, so skip it.

You must define when to insert elements in the different queues and sets.

Extracting Pictures

Start over again scanning the webpage from the top. This time you will search for strings that match the Regular Expression IMAGE_PATTERN, which is a static member of the Extractor class.

For each instance of the pattern that you find, do the following:

Strip away the part of the pattern that represents the location of the picture, and create a new URL from it. The same hint applies choosing the correct URL constructor -- use the one with two parameters. To strip away the part between the quotes, use the group method of the Matcher class with 8 as the argument.

You must define when to insert elements in the different queues and sets.

Crawler class

This class is responsible for selecting the next web site to process, and creating an Extractor thread to handle the web site. The class defines the four main data structures of the system: linkQueue, picQueue, beenThere, doneThat. Please look at the Crawler.java file to see what is already provided. A description of each of the methods you must implement can be found at project javadoc. Feel free to add any instance variables and private method you understand are necessary.

It is important for you to keep in mind that the project javadoc provides description for the methods but it does not specifies whether a method must be synchronized. That is is something you must decide.

Requirements

You must try to use synchronization constructs that allow for the maximum degree of concurrency possible.
Check your program for possible data race conditions.
Do not leave catch block empty unless you understand is necessary. For example, if an URL reference is valid but the actual page is not found (an incorrect link on a page) then you want to ignore the possible exception that could be triggered in this case.

General Requirements

You must follow Java conventions for identifiers. For example, class/interface names must start with capitals.
You must use meaningful variable names.
You must indent your code using three or four spaces or by using the Eclipse "Correct Indentation" option.
You must use the same style for blocks of curly brackets ({ }) throughout your code.

Honor Section Requirements

No additional requirements for the honor section.

Submission

Submit your project using the submit project option associated with Eclipse.

Academic Integrity

Please make sure you read the academic integrity section of the syllabus so you understand what is permissible in our programming projects. We want to remind you that we check your project against other students' projects and any case of academic dishonesty will be referred to the University's Office of Judicial Program

Web Accessibility

Objective

Class You Need to Implement

MyQueue class

MySet class

Extractor class

Crawler class