CMSC 132 -- Project 8
Threaded Web Crawler / Slideshow Thingamajig
Project Due: Thursday 5/12/05 at 11:00 PM
This is a closed project. You are expected to do all of the work on this project without consulting with anyone other than the CMSC 132 instructors and TAs. Treat this project as though it were a take home exam. If you have any questions about whether or not a certain topic is okay to discuss with classmates or friends, please ask your instructor.
The goals of this project are:
Practice Threads and concurrency
Practice creating a GUI
Gain more experience looking up stuff in the online java API
Essentially, this project is a web-crawler, which automatically explores the web by following links from one web page to another to another to another, etc. It performs something similar to a breadth-first-search. As it explores, it maintains a list of any pictures that it encounters. These pictures are slowly displayed in a window at regular intervals.
The project maintains two primary data structures, which will be accessed concurrently by numerous threads:
linkQueue -- A queue of URL's that represent web-pages waiting to be explored
picQueue -- A queue of URL's that represent pictures waiting to be displayed
These data structures start off empty.
There will be four types of threads all operating at the same time:
SlideshowGUI -- a window that will constantly pull
pictures out of the picQueue and display them. Here is an image
representing the SlideshowGUI at some point during a typical run of the
program:
ExtractorThread -- Many of these will be running at the same time. (There is a fixed size array of ExtractorThreads, with a default size of 5. The array starts off with 5 null values, but the Crawler will soon begin creating ExtractorThreads and putting them into the array.) The constructor of the ExtractorThread is passed a URL representing an html webpage that the extractor is supposed to explore. By "exploring", we mean the extractor will go through the web page looking for pictures and links to other web pages. Any URL's representing pictures that it finds it puts into the picQueue; any URL's representing web pages that it finds it puts into the linkQueue. The ExtractorThread dies when it has scanned the entire page it was created to explore.
Crawler -- The crawler begins by starting up the SlideshowGUI (pictured above) and the CrawlerGUI (pictured below). Then the crawler constantly cycles through the array of ExtractorThreads, one-by-one. If the current array entry refers to an ExtractorThread that is still alive, the Crawler moves on to the next element of the array. If the current array entry is null or refers to an ExtractorThread that has died, then the Crawler pulls a webpage out of the linkQueue, starts a new ExtractorThread that will explore this webpage, and puts this new ExtractorThread into the array in the current position. Then the crawler moves on to the next position in the array. Once it has gone through the whole array, it starts again from the beginning.
CrawlerGUI -- the CrawlerGUI appears below. It
serves two purposes. First, it allows the program to get started by having
the user put the very first website into the linkQueue. (That's what
happens when you press the "GO" button.) Also, it allows you to monitor
the progress of the whole system. Below is an image of the CrawlerGUI at
some typical point during the program execution:

There are two other data structures being maintained, both sets:
beenThere -- a set containing every webpage URL that has ever been put into the linkQueue
doneThat -- a set containing every picture URL that has ever been put into the picQueue
These two sets are maintained so that we can ensure that we never put duplicate copies of webpages or pictures into the queues. (That would be boring!)
The fields in the GUI labelled "Links Discovered", "Pics Discovered", "Links in Queue", and "Pics in Queue" represent the sizes of the four data structures beenThere, doneThat, linkQueue, and picQueue, respectively.
Pressing the "GO" button activates a method that waits for any existing ExtractorThreads to die off, empties the four data structures (linkQueue, picQueue, beenThere, and doneThat), and puts the URL that the user has typed into both linkQueue and beenThere.
Some of the project is written for you, but you will write most of it. Details follow.
You must implement a queue designed to be accessed by numerous threads concurrently. You may not use Collections.synchronizedList() or something similar. You must write code that ensures that there will be no data races or other problems when many threads each perform a single operation on the queue. Below is a description of the methods which you must provide:
public int size() -- returns the number of items in the queue
public void clear() -- removes all items from the queue
public void enqueue(Object o) -- adds the object to one end of the queue
public Object dequeue() -- removes an object from the end opposite of that to which things are added. If the queue is empty, the thread will wait here until an item becomes available
You must implement a set designed to be accessed by numerous threads concurrently. You may not use Collections.synchronizedSet() or something similar. You must write code that ensures that there will be no data races or other problems when many threads each perform a single operation on the set. Below is a description of the methods which you must provide:
public int size() -- returns the number of items in the set
public void clear() -- removes all items from the set
public boolean remove(Object o) -- removes the object from the set
public boolean add(Object o) -- adds the object to the set
public boolean contains(Object o) -- returns true if the set contains the object, false otherwise
Please look at the .java file to see what is already provided. You must fill in the rest of the main() method. The method must continually cycle through all elements of the extractors array. When you have processed the whole array, have the thread sleep for a short period of time to allow the CPU to process other threads. [I used Thread.sleep(500).] After the short wait, start over and process the entire array again! This should go in indefinitely. For each element in the array, you should perform the following steps:
If the entry in the array is an ExtractorThread that is "alive", then you should do nothing and just move on to the next element.
Once you have found an entry that is null or contains a dead ExtractorThread, get a lock on the linkQueue and check if it is empty. If it is empty, wait on the linkQueue. (Your thread will be notified when an object is added to the linkQueue, assuming you have written the linkQueue correctly!) If you are curious as to why we are doing this step, try removing it and running the program. Can you figure out why it is behaving badly without this step?
Release the lock on the linkQueue.
Acquire a lock on the extractors array.
Take a URL out of the linkQueue. Open a URLConnection object on the URL. Get the "content type" of the URLConnection (this is a String). If it is null or doesn't begin with the sub-string "text/html", then this URL is not a good one, so take another URL out of the linkQueue and repeat this step.
Once you've obtained a good URL, instantiate an ExtractorThread for it, and start the thread.
Release your lock on the extractors array.
This portion of the project is similar to part of your project 3, so you may find it useful to recycle some of the code you wrote for project 3!
You will complete the run method for the Thread, as follows. Be careful that the code you write will not cause any data races!
1. Extracting Links
Open an InputStream to the url that is associated with the ExtractorThread, just as you did in project 3. As in project 3, you may find it useful to wrap the InputStream so as to obtain a BufferedReader.
Go through the entire contents of the webpage you are now connected to, searching for strings that match the Regular Expression LINK_PATTERN, which is a static member of the ExtractorThread class. This pattern basically matches strings of the following form:
href = "xxxxxxxxx"
For each instance of the pattern found, do the following:
Strip away the part between the quotes, and create a new URL from it. [Hint: Use the URL constructor that takes TWO parameters -- pass it the current url that you are searching through, followed by the substring between the quotes. This ensures that both absolute and relativized links will be made into valid URL objects.]
Check that your new URL is not null, and then get it's protocol (a String). If the protocol is null or anything other than "http" or "file", then we don't want it, so skip it. If the protocol looks okay, then check whether or not this URL is already in the set beenThere. If it isn't in the set yet, then add this URL to the linkQueue and also to the beenThere set.
2. Extracting Pictures
Start over again scanning the webpage from the top. This time you will search for strings that match the Regular Expression IMAGE_PATTERN, which is a static member of the ExtractorThread class. This pattern matches pictures as described in project 3. (Don't you wish you had that when you were writing Project 3?)
For each instance of the pattern that you find, do the following:
Just as in project 3, strip away the part of the pattern that represents the location of the picture, and create a new URL from it. The same hint applies choosing the correct URL constructor -- use the one with two parameters.
Check if this URL is already contained in the doneThat set. If not, then add it to the doneThat set and also to the picQueue.
You'll write most of the GUI from scratch! You will need to occasionally consult the online API to read about some of the methods that are available for the GUI-related objects that you will be using. There will also be a few simple GUI examples posted on the class wiki that you may want to look over.
Your GUI must have all of the widgets you see in the example below, positioned in the same places:

Here's a quick description of the widgets that you can see:
The box where the user can type a URL is a JTextField
The button is a JButton
There are four JLabels such as "Links Discovered: 0"
There is a JLabel that says "The following URLS are currently being scanned by Extractor Threads:"
There is one JLabel for each element of the extractors array. In the example above, they say "Unused".
You can set up the whole thing using one big GridLayout that contains ordinary widgets and other GridLayouts. (To use a GridLayout inside of another GridLayout, instantiate a JPanel, then set the panel's Layout to be the GridLayout you want to use. Then put things into the Panel, and when you are done insert this panel into another panel just as you would for any other widget.)
Add an ActionListener to the Button. When the button is pressed, the ActionListener's actionPerformed method should do the following:
Get a lock on the extractors array. (This prevents the crawler from spawning any new Threads until we release the lock, assuming you have followed instructions while writing the Crawler).
Go through the array of extractors. For each one that is not null, "join" the Thread. (This makes the current thread wait for the ExtractorThread to terminate before the current thread moves on.)
After all of the existing ExtractorThreads are dead, clear all four data structures (linkQueue, picQueue, beenThere, doneThat).
Set all elements of the extractor array to null.
Retrieve the contents of the text field from the JTextField where the user is supposed to have typed in a URL (this will be a String), and make a URL out of it. Add the URL to the beenThere set and also to the linkQueue.
Release your lock on the array.
Create a reference to an ActionListener object by writing an anonymous inner class. The actionPerformed method should do the following:
Update each of the four JLabels that report the sizes of the data structures. For example, set the text of the "Links Discovered..." JLabel to say "Links Discovered: x", where x is the current size of the beenThere collection.
For each element of the extractors array, check if the ExtractorThread is non-null and alive. If it is, then set the corresonding JLabel's text to the URL associated with that ExtractorThread. If the element of the array is null or no longer alive, then set the JLabel's text to "Unused".
Now instantiate and a javax.swing.Timer object using the constructor: javax.Swing.Timer(UPDATE_DELAY_MS, timerListener), where UPDATE_DELAY_MS is a constant of your choosing (I used 777), and timerListener is the ActionListener defined above. (This is the Observer design pattern). Don't forget to start the timer, by calling it's start method! From this point on, the listener will be activated every 777 milliseconds to perform it's task.
Of course you must catch "checked exceptions" (it's never optional), but you may leave your catch blocks empty. You do not need to catch any unchecked exceptions.
Your grade on this project will be calculated as follows. THIS IS SUBJECT TO CHANGE IF WE ENCOUNTER PROBLEMS WITH THE AUTOMATED TESTING SYSTEM.