Exposé User's Guide

Contents

Introduction

This short guide provides information on how to configure and use Exposé, the SHOE web-crawler. Exposé searches for web pages with SHOE mark-up, reads the knowledge from them, and loads it into a knowledge base. This knowledge can then be queried using any interfaces provided by the knowledge base.

This version of Exposé uses Parka as its knowledge base and will not work without it. It is possible to design variants of Exposé that work with other types of knowledge bases by creating different implementations of the abstract class ShoeKb.KBInterface.

Exposé is provided under the terms of the GNU General Public License. See the file License.html for details.

Configuration

Before you can start Exposé, you must create a configuration file that specifies a number of parameters. This file should be title init.dat and should have the following format:

FROM_USER=your_email_address
KB_HOST=parka_host_name
KB_PORT=parka_port_number
KBNAME=parka_kb_name
START_URL=url
ALLOW_URL=url
PROHIB_URL=url
MAX_PAGES=integer
MAX_COST=integer
REQUEST_INTERVAL=integer
IMPLICIT_CLAIMS=TRUE_or_FALSE

The definitions of these fields are as follows:

FROM_USER
This must be an e-mail address where you can be contacted. The purpose of this field is so that server administrators will know who is crawling their servers and may contact you if there is a problem.
KB_HOST
This is the name of the host machine on which you have a Parka server running.
KB_PORT
This is the port number on which the Parka server is listening for client requests.
KBNAME
This is the name of the Parka KB to create or store the information from the web crawl in.
START_URL
The URL from which the web-crawl will begin.
ALLOW_URL
Specifies a prefix which a URL must begin with if it is to be searched. Multiple ALLOW_URL fields may be specified. This can be used to restrict the search to specific directories or servers.
PROHIB_URL
Specifies a prefix such that any URLs which begin with it will not be searched. This supersedes any ALLOW_URL prefixes. Multiple PROHIB_URL fields may be specified.
MAX_PAGES
The maximum number of pages that Exposé will visit in a single run.
MAX_COST
Exposé uses a costing mechanism based on the path from the start page to the current page to determine how it will visit pages linked from the current page. This parameter controls how far from the start page the web-crawler will travel by limiting the length of the paths. A path length is computed by adding 1 for each SHOE page in the path and 5 for each non-SHOE page.
REQUEST_INTERVAL
The time (in milliseconds) to wait between each request of a new page. A wait of 30 seconds between requests to the same server is considered good web crawler etiquette, so this should be set to 30000. However, if you are only crawling web servers that you own, then you may decide to reduce this value.
IMPLICIT_CLAIMS
Set this to TRUE if you want Exposé to make additional claims that the instances used in relation arguments must be members of the required types. Otherwise, set to FALSE.

Running Exposé

To run Exposé, switch to the directory that contains your init.dat and (assuming the Expose directory is accessible via your CLASSPATH) type:

java Expose.ExposeApp

The Exposé window should appear. This interface provides means for dynamically editing some of the parameters that were specified in the init.dat file, particularly START_URL (the starting URL), KBNAME (KB Name), ALLOW_URL (Visit URL Prefixes), and PROHIB_URL (Avoid URL Prefixes).

To create a new KB, verify the parameters from above and press New KB. Exposé will begin to crawl the web pages and will log the progress of the crawl in the window just above the button bar. When the search is done, files named kbname.kb and kbname.pred will be created. These are your Parka assertion and predicate files, and can be used with Parka's ncreateKB command to create a new KB.

The Update KB button will go through the list of sites visited in the previous use of the web-crawler and will check to see if any have changed since the last visit. If so, it will dynamically update the KB with the new information.

The Stop button will pause a crawl. It can then be continued using the Resume button.

The Exit button closes Exposé.