CMSC 132 -- Project 8

Threaded Web Crawler / Slideshow Thingamajig

Project Due:  Thursday 5/12/05 at 11:00 PM

 

Closed Policy

This is a closed project.  You are expected to do all of the work on this project without consulting with anyone other than the CMSC 132 instructors and TAs.  Treat this project as though it were a take home exam.  If you have any questions about whether or not a certain topic is okay to discuss with classmates or friends, please ask your instructor.

 

Goals

The goals of this project are:

 

Overview

Essentially, this project is a web-crawler, which automatically explores the web by following links from one web page to another to another to another, etc.  It performs something similar to a breadth-first-search.  As it explores, it maintains a list of any pictures that it encounters.  These pictures are slowly displayed in a window at regular intervals.

The project maintains two primary data structures, which will be accessed concurrently by numerous threads:

These data structures start off empty. 

There will be four types of threads all operating at the same time:

  1. SlideshowGUI -- a window that will constantly pull pictures out of the picQueue and display them.  Here is an image representing the SlideshowGUI at some point during a typical run of the program:

  2. ExtractorThread -- Many of these will be running at the same time.  (There is a fixed size array of ExtractorThreads, with a default size of  5.  The array starts off with 5 null values, but the Crawler will soon begin creating ExtractorThreads and putting them into the array.)  The constructor of the ExtractorThread is passed a URL representing an html webpage that the extractor is supposed to explore.  By "exploring", we mean the extractor will go through the web page looking for pictures and links to other web pages.  Any URL's representing pictures that it finds it puts into the picQueue; any URL's representing web pages that it finds it puts into the linkQueue.  The ExtractorThread dies when it has scanned the entire page it was created to explore.

  3. Crawler -- The crawler begins by starting up the SlideshowGUI (pictured above) and the CrawlerGUI (pictured below).  Then the crawler constantly cycles through the array of ExtractorThreads, one-by-one.  If the current array entry refers to an ExtractorThread that is still alive, the Crawler moves on to the next element of the array.  If the current array entry is null or refers to an ExtractorThread that has died, then the Crawler pulls a webpage out of the linkQueue, starts a new ExtractorThread that will explore this webpage, and puts this new ExtractorThread into the array in the current position.  Then the crawler moves on to the next position in the array.  Once it has gone through the whole array, it starts again from the beginning.

  4. CrawlerGUI -- the CrawlerGUI appears below.  It serves two purposes.  First, it allows the program to get started by having the user put the very first website into the linkQueue.  (That's what happens when you press the "GO" button.)  Also, it allows you to monitor the progress of the whole system.  Below is an image of the CrawlerGUI at some typical point during the program execution:  

There are two other data structures being maintained, both sets:

These two sets are maintained so that we can ensure that we never put duplicate copies of webpages or pictures into the queues.  (That would be boring!)

The fields in the GUI labelled "Links Discovered", "Pics Discovered", "Links in Queue", and "Pics in Queue" represent the sizes of the four data structures beenThere, doneThat, linkQueue, and picQueue, respectively.

Pressing the "GO" button activates a method that waits for any existing ExtractorThreads to die off, empties the four data structures (linkQueue, picQueue, beenThere, and doneThat), and puts the URL that the user has typed into both linkQueue and beenThere.

 

What you will Implement

Some of the project is written for you, but you will write most of it.  Details follow.

MyQueue class

You must implement a queue designed to be accessed by numerous threads concurrently.  You may not use Collections.synchronizedList() or something similar.  You must write code that ensures that there will be no data races or other problems when many threads each perform a single operation on the queue.  Below is a description of the methods which you must provide:

MySet class

You must implement a set designed to be accessed by numerous threads concurrently.  You may not use Collections.synchronizedSet() or something similar.  You must write code that ensures that there will be no data races or other problems when many threads each perform a single operation on the set.  Below is a description of the methods which you must provide:

Crawler class

Please look at the .java file to see what is already provided.  You must fill in the rest of the main() method.  The method must continually cycle through all elements of the extractors array.  When you have processed the whole array, have the thread sleep for a short period of time to allow the CPU to process other threads.  [I used Thread.sleep(500).]  After the short wait, start over and process the entire array again!  This should go in indefinitely.  For each element in the array, you should perform the following steps:

ExtractorThread class

This portion of the project is similar to part of your project 3, so you may find it useful to recycle some of the code you wrote for project 3!

You will complete the run method for the Thread, as follows.  Be careful that the code you write will not cause any data races!

1.  Extracting Links

Open an InputStream to the url that is associated with the ExtractorThread, just as you did in project 3.  As in project 3, you may find it useful to wrap the InputStream so as to obtain a BufferedReader.

Go through the entire contents of the webpage you are now connected to, searching for strings that match the Regular Expression LINK_PATTERN, which is a static member of the ExtractorThread class.  This pattern basically matches strings of the following form:

   href = "xxxxxxxxx"

For each instance of the pattern found, do the following:

2.  Extracting Pictures

Start over again scanning the webpage from the top.  This time you will search for strings that match the Regular Expression IMAGE_PATTERN, which is a static member of the ExtractorThread class.  This pattern matches pictures as described in project 3.  (Don't you wish you had that when you were writing Project 3?)

For each instance of the pattern that you find, do the following:

CrawlerGUI class

You'll write most of the GUI from scratch!  You will need to occasionally consult the online API to read about some of the methods that are available for the GUI-related objects that you will be using.  There will also be a few simple GUI examples posted on the class wiki that you may want to look over.

Your GUI must have all of the widgets you see in the example below, positioned in the same places:

Here's a quick description of the widgets that you can see:

You can set up the whole thing using one big GridLayout that contains ordinary widgets and other GridLayouts.  (To use a GridLayout inside of another GridLayout, instantiate a JPanel, then set the panel's Layout to be the GridLayout you want to use.  Then put things into the Panel, and when you are done insert this panel into another panel just as you would for any other widget.)

Button Listener

Add an ActionListener to the Button.  When the button is pressed, the ActionListener's actionPerformed method should do the following:

Setting a Timer to Update the Fields on the GUI Periodically

Create a reference to an ActionListener object by writing an anonymous inner class.  The actionPerformed method should do the following:

Now instantiate and a javax.swing.Timer object using the constructor:  javax.Swing.Timer(UPDATE_DELAY_MS, timerListener), where UPDATE_DELAY_MS is a constant of your choosing (I used 777), and timerListener is the ActionListener defined above.  (This is the Observer design pattern).  Don't forget to start the timer, by calling it's start method!  From this point on, the listener will be activated every 777 milliseconds to perform it's task.

 

Exception Handling

Of course you must catch "checked exceptions" (it's never optional), but you may leave your catch blocks empty.  You do not need to catch any unchecked exceptions.

 

Grading

Your grade on this project will be calculated as follows.  THIS IS SUBJECT TO CHANGE IF WE ENCOUNTER PROBLEMS WITH THE AUTOMATED TESTING SYSTEM.