Project #8 (Threaded Web Crawler)
Due Date: Mon Dec 12, 6:00 pm
Object-Oriented Programming II
Type of Homework: Closed
The goals of this project are:
Practice Threads and concurrency
Gaining experience using the java API
This homework is considered a closed homework. Make sure you read the Open/Closed policy before continuing working on this project.
Essentially, this project is a web-crawler, which automatically explores the web by following links from one web page to another to another to another, etc. It performs something similar to a breadth-first-search. As it explores, it maintains a list of any pictures that it encounters. These pictures are then displayed in a window at regular intervals.
You can run your web-crawler starting on any web page you want. In the Documents section (under Web Crawler Pages) of the class web page, we have left a set of web pages that can help you implement your project. You are not required to use these pages.
Your project will be graded as follows:
Tests (90 %)
Public JUnit tests (60 %)
Release JUnit tests (30 %)
Style (10 %)
For this project:
Even if you are not planning to work on the project for a while, make sure that you can submit your project using the "Submit Project" option in Eclipse and verify your submission has been received in the submit server. It is important that you submit your project using the "Submit Project" option rather than uploading a jar file to the submit server. If you are having problems with this submission process, contact your TA or instructor immediately.
Primary Data Structures
The project utilizes the following data structures, which will be accessed concurrently by numerous threads:
linkQueue -- A queue of URL's that represent web-pages waiting to be explored (visited)
picQueue -- A queue of URL's that represent pictures waiting to be displayed
beenThere -- A set of web-pages that have been explored (visited)
doneThat -- A set of images that have already been found. This set allows us to avoid placing images in the picQueue that we have already scheduled to display.
These data structures start off empty.
There will be two main types of threads operating at the same time:
Crawler -- The Crawler thread manages the crawling process. It selects the next web site to visit and creates an Extractor thread for each web site to visit. It is through this thread that you can access information about the crawling process (e.g., link queue size, active thread count, etc.)
Extractor -- Many of these will be running at the same time. The constructor of the Extractor is passed a URL representing an html web page that the extractor is supposed to explore. By "exploring", we mean the extractor will go through the web page looking for pictures and links to other web pages. Any URL's representing pictures that it finds it puts into the picQueue; any URL's representing web pages that it finds it puts into the linkQueue. The Extractor dies when it has scanned the entire page it was created to explore.
We have provided a GUI that will allow you to monitor the crawling process and to display the images been discovered. The GUI relies on the Crawler class you will implement. Notice that you can implement the project without using the GUI. We have provided so it makes the project more fun. The GUI has two main components represented by the classes CrawlerGUI and SlideshowGUI. The first class allows you to start and monitor the crawling process. In addition in creates an SlideshowGUI object which displays images. Instances of both classes will run as separate threads. The following is a snapshot of the CrawlerGUI at some point during the program execution:
The following is a snapshot of the SlideshowGUI at some point during the program execution:
To start the GUI just execute the main method of the CrawlerGUI class.
You will need to implement the following four classes. We have presented them in the order we understand they should be implemented.
You must implement a queue designed to be accessed by numerous threads concurrently. You may not use Collections.synchronizedList() or something similar. You must write code that ensures that there will be no data races or other problems when many threads each perform a single operation on the queue. Below is a description of the methods which you must provide:
public int size() -- returns the number of items in the queue
public void clear() -- removes all items from the queue
public void enqueue(Object o) -- adds the object to one end of the queue
public Object dequeue() -- removes an object from the end opposite of that to which things are added. If the queue is empty, the thread will wait here until an item becomes available
You must use an ArrayList to hold the elements of the queue. You can rely on basic ArrayList operations in order to implement the above methods. Don't be surprised if each method just takes a couple of lines of code.
You must implement a set designed to be accessed by numerous threads concurrently. You may not use Collections.synchronizedSet() or something similar. You must write code that ensures that there will be no data races or other problems when many threads each perform a single operation on the set. Below is a description of the methods which you must provide:
public int size() -- returns the number of items in the set
public void clear() -- removes all items from the set
public boolean remove(Object o) -- removes the object from the set
public boolean add(Object o) -- adds the object to the set
public boolean contains(Object o) -- returns true if the set contains the object, false otherwise
public Object getElements() -- returns an array with references to the elements in the set
You must use an HashSet to hold the elements of your set. You can rely on basic HashSet operations in order to implement the above methods. Don't be surprised if each method just takes a couple of lines of code.
This class goes through the contents of a web page and identifies web sites and images. In the code distribution we have left a shell class that defines the instance variables and constants you need, along with the constructor for the class. You will complete the run method for this class as follows.
Open an InputStream to the url that is associated with the Extractor. Go through the entire contents of the webpage you are now connected to, searching for strings that match the Regular Expression LINK_PATTERN, which is a static member of the Extractor class. This pattern basically matches strings of the following form:
href = "xxxxxxxxx"
For each instance of the pattern found, do the following:
Strip away the part between the quotes, and create a new URL from it. [Hint: Use the URL constructor that takes TWO parameters -- pass it the current url that you are searching through, followed by the substring between the quotes. This ensures that both absolute and relativized links will be made into valid URL objects.] To strip away the part between the quotes, use the group method of the Matcher class with 1 (one) as the argument.
Check that your new URL is not null, and then get it's protocol (a String). If the protocol is null or anything other than "http" then we don't want it, so skip it.
You must define when to insert elements in the different queues and sets.
Start over again scanning the webpage from the top. This time you will search for strings that match the Regular Expression IMAGE_PATTERN, which is a static member of the Extractor class.
For each instance of the pattern that you find, do the following:
Strip away the part of the pattern that represents the location of the picture, and create a new URL from it. The same hint applies choosing the correct URL constructor -- use the one with two parameters. To strip away the part between the quotes, use the group method of the Matcher class with 8 as the argument.
You must define when to insert elements in the different queues and sets.
This class is responsible for selecting the next web site to process, and creating an Extractor thread to handle the web site. The class defines the four main data structures of the system: linkQueue, picQueue, beenThere, doneThat. Please look at the Crawler.java file to see what is already provided. A description of each of the methods you must implement can be found at project javadoc. Feel free to add any instance variables and private method you understand are necessary.
It is important for you to keep in mind that the project javadoc provides description for the methods but it does not specifies whether a method must be synchronized. That is is something you must decide.
Honor Section Requirements
No additional requirements for the honor section.
Submit your project using the submit project option associated with Eclipse.
Please make sure you read the academic integrity section of the syllabus so you understand what is permissible in our programming projects. We want to remind you that we check your project against other students' projects and any case of academic dishonesty will be referred to the University's Office of Judicial Program