Introduction to Web Databases

General: Main page  Relational Databases  Object Oriented Databases  Web Databases
Applications: Oracle  IBM DB2  Access
Other : On Linux

"This shouldn't be hard", you think to yourself. You want to buy a computer, so you figure the Internet is the best place to look. You get on Alta Vista, and type "computers" in the text box. It comes back and tells you that it has found about 1 million pages matching your query terms. You try another tactic - you type in "I want to buy a computer". Only 400,000 pages this time - and glancing at the descriptions for the first 10 pages, you see places to buy Liberal Bumper Stickers, Foreign Currency, even a Volkswagen, but nothing about buying computers. Frustrated, you try one more time - you type in "computer shopping". Alta Vista happily tells you there are 700,000 pages matching your request. The first listing is indeed about shopping for computers - and the rest on the page refer to on-line malls.

Does this sound familiar? If you've done any amount of Web searching, you've undoubtedly run into this same problem of too much information. This tutorial will you by showing you how these Web Databases work, how you can more efficiently search them, and how to select the best Web Database using the AskScott selection tool.

What is a Web Database?

A web database is an organized listing of web pages. It's like the card catalog that you might find in the library. The database holds a "surrogate" (or selected pieces like the title, the headings, etc.) for each web page. The creation of these surrogates is called "indexing", and each web database does it in a different way. Web databases hold surrogates for anywhere from 1 to 30 million web pages. The program also has a search interface, which is the box you type words into (like in Alta Vista or Lycos) or the lists of directories you pick from (like in Yahoo). Thus, each web database has a different indexing method and a different search interface.

Methods of Indexing

There are three methods of indexing used in web database creation - full-text, keyword, and human.

Full-Text Indexing

As its name implies, full-text indexing is where every word on the page is put into a database for searching. Alta Vista and Open Text are examples of full-text databases. Full-text indexing will help you find every examples of a reference to a specific name or terminology. However, a general topic search will not be very useful in these database, and you will have to dig through a lot of "false drops" (or returned pages that have nothing to do with your search).

Keyword Indexing

In keyword indexing, only the "important" words and phrases are put into the database. Lycos and Excite are keyword indexed. This allows a searcher to search on more general subjects and have more accurate results. However, if a name is only mentioned once or twice on a page, it won't be included in the database.

Human Indexing

Yahoo and some of Magellan are two of the few examples of human indexing. In the above two indexing, all of the work was done by a computer program called a "spider" or a "robot". In human indexing, a person examines the page and determines a very few key phrases that describe it. This allows for the user to find a good start of works on a topic - assuming that the topic was picked by the human as something that describes the page. This is how the directory-based web databases are developed.

Spiders, Robots, or People

How do the web databases select which pages are indexed? As there is no centralized Internet computer, there's no one place where these services can learn about new pages. Thus, many services use automated programs called "spiders" or "robots" that travel from site to site, looking for new WWW pages. Some spiders only go to the "What's New" or the "What's Hot" pages and use those for indexing the "popular" sites. Others methodically examine every link leading from a page, and every link leading from that page, and so on... In some cases, people examine the pages brought back from these programs, and don't index the pages that don't meet certain criteria. So, these tools create three classes of web databases - those that look at all WWW pages, those that examine popular WWW pages, and those that examine quality web pages.

Search Engines versus Pick Lists

Now that the web database has a group of pages indexed in their database, how does the user access it.  This is through one of two methods - a search engine or a directory (otherwise known as a pick list). A search engine allows the user to type in any terminology he wishes, and will search the database to find those web pages that match the terms entered. A directory structure has pages organized by subject (like the Yellow Pages), and can then be navigated by selecting things off the directory. The directory structure usually allows a good starting point for a search, assuming that the topic you desire has been selected as a directory entry.

One thing not to get confused about - Yahoo has both a search engine and a directory tree. Instead of searching the pages, however, the search engine just looks through the directory at Yahoo. It can be used as a quick way to find the area of the directory with the information you desire.

Presentation of Results

You've entered in your search terms, the computer has matched them to the indexed database, and you are given a list of results. There are two important concepts here - Relevancy ranking and Abstracts

Relevancy Ranking

The documents are almost always listed in order by relevance. Based upon your search request, the computer ranks all of the documents that contain your search term, and lists the ones that it thinks are most relevant first. That is why you really shouldn't worry about the fact that there are 17 gajillion pages matching your query term. All you care about are the first 20 - 40. The better your search terms, the better ranked the pages will be (and the less work you will have to do).


If there are pages listed that say nothing in the listing about your search topic, you may wonder how they got there. The second important concept to realize is that the abstract presented to you is usually not the same as the database entry used by the computer to search. The abstract is much shorter than the database entry, and this can lead to frustration because you then have to load up the actual page to see why the search program feels this page is relevant to your search.


Each web database is different, and searching in one not appropriate for your search term can be very frustrating. That is where AskScott helps. By telling AskScott if you have a general or specific search topic, whether you want to type in your request or pick from a directory structure, and whether you are looking for all pages, popular pages, or quality pages, AskScott will tell you which database is best for you!

Note: This page is modified from Introduction to Web Databases.