The project is due before the stroke of midnight on Tuesday (3/7). Submissions received within two days after that (by 11:59p EST on March 9) will receive a 10 points (of 100) markdown. Submissions will be accepted until MONDAY night (11:59p on March 13) at a loss of 25 points. Thereafter they will not be accepted.
NOTE:The TA is not able to give extensions or waive these rules in any way under any circumstances. Anyone with a problem must contact Prof. Hendler prior to the deadline in question.
For this project you will be writing a web crawler (a/k/a spider, robot). To do this, you will need to write and read HTTP 1.0 headers and to do the right thing given what comes back (details below). Your crawler will be "polite" in that it will have the right headers for identifying itself and will respect the robots exclusion protocol (The good news is we will make this easy to do, see below). Details:
You must use HTTP 1.0 (not 1.* or 1.1) in your headers.
You will be allowed to use java.net.URL to parse urls and retrieve sections of them. However, you may not use its connect or openConnection methods. Instead, you will be given a function that takes as input a hostname, port, and text string (the web request) and will return the response it receives from the web server (details will be made available with the code). You will be responsible for formatting the request and parsing the response.
If you prefer not to use this code, you may use the java socket package directly. The key point is that you may not use a library that creates the HTTP requests for you or parses the responses. You must handle parsing the responses yourself. (Standard parsing tools like the regexp package are allowed of course.)
You will also be given the .jar file from " websphinx." All you may use out of this package is the RobotsExclusion class. Note that any other use of websphinx is strictly prohibited! Before downloading any uri you must use this class to check whether the site has a robots.txt that forbids you from doing so.
Your robot should be polite. As well as not requesting pages that are forbidden to it, it must include appropriate "politeness" headers. These include:
Pointers to the code libraries being made available, the root URI to use, and details of submission of the project will be made available on the Forum page for CMSC 498w (starting on Friday, 2/24).
You will submit your code, as well as a text file containing all the uris visited and their response codes. The output file should be formatted as code, space, URI, space, comment (or redirect), CRLF. For example:
200 http://foo.example.com/ OK 301 http://foo.example.com/bar http://foo.example.com/bar2 200 http://foo.example.com/bar2 OK 999 mailto:hendler@cs.umd.edu mailto 400 http://foo.example.com/baz NG 999 http://foo.example.com/private/ Excluded
You may visit the pages in any search order you prefer.
There is a vague chance that our server will not handle the complete load of this class. If we have problems, we will ask you to put a time delay into your robot, and explain how in class (and this will be noted in the forum). We do not expect this to be the case.