Project #8 (Threaded Web Crawler)

CMSC 132

Due Date: Mon Dec 12, 6:00 pm

Object-Oriented Programming II

Type of Homework: Closed

Fall 2005


Objective

The goals of this project are:

This homework is considered a closed homework.  Make sure you read the Open/Closed policy before continuing working on this project.

Overview

Essentially, this project is a web-crawler, which automatically explores the web by following links from one web page to another to another to another, etc.  It performs something similar to a breadth-first-search.  As it explores, it maintains a list of any pictures that it encounters.  These pictures are then displayed in a window at regular intervals. 

You can run your web-crawler starting on any web page you want.  In the Documents section (under Web Crawler Pages) of the class web page, we have left a set of web pages that can help you implement your project.  You are not required to use these pages.

Your project will be graded as follows:

For this project:

Specifications

Primary Data Structures

The project utilizes the following data structures, which will be accessed concurrently by numerous threads:

These data structures start off empty. 

Program Threads

There will be two main types of threads operating at the same time:

  1. Crawler -- The Crawler thread manages the crawling process.  It selects the next web site to visit and creates an Extractor thread for each web site to visit.  It is through this thread that you can access information about the crawling process (e.g., link queue size, active thread count, etc.) 

  2. Extractor -- Many of these will be running at the same time.    The constructor of the Extractor is passed a URL representing an html web page that the extractor is supposed to explore.  By "exploring", we mean the extractor will go through the web page looking for pictures and links to other web pages.  Any URL's representing pictures that it finds it puts into the picQueue; any URL's representing web pages that it finds it puts into the linkQueue.  The Extractor dies when it has scanned the entire page it was created to explore.

Program GUI

We have provided a GUI that will allow you to monitor the crawling process and to display the images been discovered.  The GUI relies on the Crawler class you will implement.  Notice that you can implement the project without using the GUI.  We have provided so it makes the project more fun.  The GUI has two main components represented by the classes CrawlerGUI and SlideshowGUI.  The first class allows you to start and monitor the crawling process.  In addition in creates an SlideshowGUI object which displays images.  Instances of both classes will run as separate threads.  The following is a snapshot of the CrawlerGUI at some point during the program execution:

The following is a snapshot of the SlideshowGUI at some point during the program execution:

To start the GUI just execute the main method of the CrawlerGUI class.

Class You Need to Implement

You will need to implement the following four classes.  We have presented them in the order we understand they should be implemented. 

  1. MyQueue class

You must implement a queue designed to be accessed by numerous threads concurrently.  You may not use Collections.synchronizedList() or something similar.  You must write code that ensures that there will be no data races or other problems when many threads each perform a single operation on the queue.  Below is a description of the methods which you must provide:

You must use an ArrayList to hold the elements of the queue.  You can rely on basic ArrayList operations in order to implement the above methods.  Don't be surprised if each method just takes a couple of lines of code.

  1. MySet class

You must implement a set designed to be accessed by numerous threads concurrently.  You may not use Collections.synchronizedSet() or something similar.  You must write code that ensures that there will be no data races or other problems when many threads each perform a single operation on the set.  Below is a description of the methods which you must provide:

You must use an HashSet to hold the elements of your set.   You can rely on basic HashSet operations in order to implement the above methods.  Don't be surprised if each method just takes a couple of lines of code.

  1. Extractor class

This class goes through the contents of a web page and identifies web sites and images.  In the code distribution we have left a shell class that defines the instance variables and constants you need, along with the constructor for the class.  You will complete the run method for this class as follows.

  1. Extracting Links

Open an InputStream to the url that is associated with the Extractor.  Go through the entire contents of the webpage you are now connected to, searching for strings that match the Regular Expression LINK_PATTERN, which is a static member of the Extractor class.  This pattern basically matches strings of the following form:

   href = "xxxxxxxxx"

For each instance of the pattern found, do the following:

  • Strip away the part between the quotes, and create a new URL from it.  [Hint:  Use the URL constructor that takes TWO parameters -- pass it the current url that you are searching through, followed by the substring between the quotes.  This ensures that both absolute and relativized links will be made into valid URL objects.]  To strip away the part between the quotes, use the group method of the Matcher class with 1 (one) as the argument.

  • Check that your new URL is not null, and then get it's protocol (a String).  If the protocol is null or anything other than "http" then we don't want it, so skip it.

You must define when to insert elements in the different queues and sets.

  1. Extracting Pictures

Start over again scanning the webpage from the top.  This time you will search for strings that match the Regular Expression IMAGE_PATTERN, which is a static member of the Extractor class. 

For each instance of the pattern that you find, do the following:

  • Strip away the part of the pattern that represents the location of the picture, and create a new URL from it.  The same hint applies choosing the correct URL constructor -- use the one with two parameters.  To strip away the part between the quotes, use the group method of the Matcher class with 8 as the argument.

You must define when to insert elements in the different queues and sets.

  1. Crawler class

This class is responsible for selecting the next web site to process, and creating an Extractor thread to handle the web site.  The class defines the four main data structures of the system: linkQueue, picQueue, beenThere, doneThat.  Please look at the Crawler.java file to see what is already provided.  A description of each of the methods you must implement can be found at project javadoc. Feel free to add any instance variables and private method you understand are necessary.

It is important for you to keep in mind that the project javadoc provides description for the methods but it does not specifies whether a method must be synchronized.  That is is something you must decide.

Requirements

General Requirements

Honor Section Requirements

No additional requirements for the honor section.

Submission

Submit your project using the submit project option associated with Eclipse. 

Academic Integrity

Please make sure you read the academic integrity section of the syllabus so you understand what is permissible in our programming projects.  We want to remind you that we check your project against other students' projects and any case of academic dishonesty will be referred to the University's Office of Judicial Program