With a research heritage extending back to Luhn's original work, the content-based filtering paradigm is the better developed of the two. In content-based filtering, each user is assumed to operate independently. As a result, document representations in content-based filtering systems can exploit only information that can be derived from document contents. Yan implemented a simple content-based text filtering system for Internet News articles in a system he called SIFT . Profiles for SIFT are constructed manually by specifying words to prefer or avoid, and must be updated manually if the user desires to change them. For each profile, twenty articles are made available each day in a ranked output format. Articles can be selected interactively using a World Wide Web browser. For users lacking interactive access, clippings (the first few lines of each article) can instead be sent by electronic mail. In that case selections must be done without user interaction, so users are offered the option of defining a profile for an exact match text selection technique.
SIFT offers two facilities to assist users with profile construction. Users are initially offered an opportunity to apply candidate profiles against the present day's articles to determine whether appropriate sets of articles are accepted and rejected. If a substantial amount of information on that interest is present on Internet News that day, iterative refinement allows the user to construct a profile which will move the appropriate articles to the top of the list. To facilitate maintenance of profiles over time, words which contributed to the position of each article in the ranked list are highlighted (a technique known as ``Keyword in Context'' or ``KWIC'') when using a World Wide Web browser to access the articles. By examining the context of words which occur with meanings that were unforeseen at the time the profile was constructed, users can select additional words which appear in the same context to add to the list of words to be avoided.
Yan developed SIFT to study efficient algorithms for information filtering. In his implementation, large collections of profiles are compared to every article arriving on Internet News by a central server. Efficiencies are obtained by grouping profiles in ways that permit parts of the filtering process to be performed on groups of profiles rather than individually. SIFT makes no distinction among the words appearing in an article, so words appearing in the newsgroup name (i.e., the specific conference), the author's electronic mail address, the article title, the body of the article, included text, or the ``signature'' information that is routinely added to every document by some users are all equally likely to result in a high rank for a document.
Stevens developed a system called InfoScope which used automatic profile learning to minimize the complexity of exploiting information about the context in which words were used . Also designed to filter Internet News, InfoScope deduced exact-match rules and offered them for approval (possibly with modifications) by the user. These suggestions were based on simple observable actions such as the time spent reading a newsgroup or whether an individual message was saved for future reference. By avoiding the requirement for explicit user feedback about individual articles, InfoScope was designed to minimize the cognitive load of managing the information filtering system.
While SIFT treats Internet News as a monolithic collection of articles, InfoScope was able to make fine-grained distinctions between newsgroups, subjects, and even individual authors. Implementation of such extensive deconstruction led Stevens to introduce a facility to reconstruct levels of abstraction in a way that was meaningful to the user. InfoScope implemented this abstraction at the newsgroup level, suggesting to combine related sets of newsgroups that were regularly examined by the user to form a single ``virtual newsgroup.'' By defining filters for virtual newsgroups with possibly overlapping sources, users were thus provided with a powerful facility to reorganize the information space in accordance with their personal cognitive model of the interesting parts of the discussions they wished to observe.
InfoScope was not without its limitations, however. The experimental system Stevens developed was able to process only information in the header of each article (e.g., subject, author, or newsgroup), a restriction imposed by the limited personal computer processing power available in 1991. In addition, his goal of exploring the potential for synergy between user and machine for profile management led him to choose a rule-based exact match text selection technique. Since users are often able to verbalize the selection rules they apply, Stevens reasoned that users would have less difficulty visualizing the effect of changing rules than the effect of changing the types of profiles commonly found in ranked output systems. InfoScope's key contributions, machine-assisted profile learning, the addition of user-controlled levels of abstraction, and implicit feedback, make it an excellent example of a complete content-based information filtering system intended for interactive use.
Because of their low cost, large volume, and ease of recognizing new information, Internet News and electronic mail have been popular domains for information filtering research. Unfortunately, these domains are poorly suited to formal experiments because reproducible results are difficult to obtain. For this reason, very little is known about the effectiveness of either SIFT or InfoScope. Stevens reported that eight of ten experienced Internet News readers preferred InfoScope to their prior software in his initial study, and that all five users in the second evaluation reported that fewer uninteresting articles were presented and more interesting articles were read in a second half of a 10 week evaluation than in the first. Because SIFT was developed to study efficiency rather than effectiveness issues, even less information is available about its effectiveness. Yan does report, however, that in early 1995 SIFT routinely processed over 13,000 profiles and was adding approximately 1,400 profiles each month . Even though one user may create several profiles, this level of user acceptance makes a powerful statement about the utility of even the simple approach used by SIFT.
Learning more about the effectiveness of a text filtering technique requires that the technique be evaluated under controlled experimental conditions. And because the performance of text filtering techniques varies markedly when different information needs and document collections are used, comparison of results across systems is facilitated when those factors are held constant. The TREC evaluation has provided an unprecedented venue for exactly this type of performance evaluation. Conducted annually since 1992, the most recent conference (TREC-4) attracted participation from 24 universities and 12 corporations .
NIST provides each participant with fifty topics and a large set (typically thousands) of training documents and relevance assessments on those documents for each information need. Participants train their text filtering systems, using this data as if it represented explicit feedback on the utility of each training document to a user, and then must register their profiles with NIST before receiving the evaluation documents. The profiles are then used by the text filtering systems which generated them to rank order a previously unseen set of evaluation documents, and the top several thousand documents are submitted to NIST for evaluation.
In order to achieve reproducible results, it is necessary to make some very strong assumptions about the nature of the information filtering task. In TREC it is assumed that human judgements about whether an information need is satisfied by a document are binary valued (i.e., a document is relevant to an information need or it is not) and constant (i.e., it does not matter who makes that judgement or when they make it). Relevance, the fundamental concept on which this methodology is based, actually fails to satisfy both of those assumptions. Human relevance judgments exhibit significant variability across evaluators, and for the same evaluator across time. Furthermore, evaluators sometimes find it difficult to render a binary relevance judgment on a specific combination of a document and an information need. Nevertheless, performance measures based on a common set of relevance judgements provide a principled basis for for comparing the relative performance of different text filtering techniques.
The TREC filtering evaluation is based on effectiveness measures that are commonly used for text retrieval systems. The effectiveness of exact match text retrieval systems is typically characterized by three statistics: ``precision,'' ``recall,'' and ``fallout.'' Precision is the fraction of the selected documents which are actually relevant to the user's information need, while recall is the fraction of the actual set of relevant documents that are correctly classified as relevant by the text filtering system. When used together, precision and recall measure selection effectiveness. Because both precision and recall are insensitive to the total size of the collection, fallout (the fraction of the non-relevant documents that are selected) is used to measure rejection effectiveness. Table 2 illustrates these relationships.
Table 2: Measures of text selection effectiveness.
In TREC, almost all of the filtering systems produce ranked output. Accordingly, precision and fallout at several values of recall are reported, and ``average precision'' (the area under the precision-recall curve) is reported for use when a single measure of effectiveness is needed . Average precision is computed by choosing successively larger sets of documents from the top of the ranked list that result in evenly spaced values of recall between zero and one. Precision is then computed for each set, and the mean of those values is reported as the average precision for an individual information need. The process is repeated for several information needs, and the mean of the values obtained is reported as the average precision for the system on that test collection. Clearly, larger values of average precision are better.
Only the selected documents must be scored to evaluate precision, but it would be impractical to evaluate recall and fallout by scoring every document in the TREC collection. The solution is to estimate recall and fallout by scoring a sample of the document collection. The approach chosen for TREC, known as ``pooled relevance evaluation'' is to evaluate every document chosen by any participating system and then assume that all unchosen documents are not relevant. Since documents are chosen using a wide variety of text filtering and retrieval techniques in TREC, it is felt that the pooled relevance methodology produces a fairly tight upper bound on recall and an extremely tight lower bound on fallout.
Although TREC investigates only the performance of the selection module, and that evaluation is necessarily based on a somewhat artificial set of assumptions, the resulting data provides a useful basis for choosing between alternative selection techniques. In the TREC-3 evaluation, for example, 25 text filtering systems were evaluated and average precision was observed to vary between 0.25 and 0.41.