With the growth of the Internet and other networked information, research in automatic mediation of access to networked information has exploded in recent years. This report reviews existing work on text filtering, a type of ``information seeking.'' Here we use ``information seeking'' as an overarching term to describe any processes by which users seek to obtain information from automated information systems . Table 1 shows common types of information seeking processes. In the ``information filtering'' process the user is assumed to be seeking information which addresses a specific long-term interest. In this report we will describe general approaches to the information filtering problem and specific techniques that are tailored for ``text filtering,'' the case in which the information sought is in text form.
Table 1: Examples of information seeking processes.
Information filtering systems are typically designed to sort through large volumes of dynamically generated information and present the user with sources of information that are likely to satisfy his or her information requirement. By ``information sources'' we mean entities which contain information in a form that can be interpreted by a user. We commonly refer to information sources which contain text as ``documents,'' but in other contexts these sources may be audio, still or moving images, or even people. The information filtering system may either provide these entities directly (which is practical when the entities are easily replicated), or it may provide the user with references to the entities.
This description of information filtering leads immediately to three subtasks: collecting the information sources, selecting the information sources, and displaying the information sources. Figure 1 depicts this subdivision, one which is applicable to a wide variety of information seeking processes. The same three tasks are also fundamental to a process commonly referred to as ``information retrieval'' in which the system is presented with a query by the user and expected to produce information sources which the user finds useful. ``Text retrieval,'' the specialization of information retrieval to retrieve text, has an extensive research heritage. In one of the classic works on information filtering, this observation led Belkin and Croft to suggest that the information filtering process would be an attractive application for techniques that had already developed for information retrieval systems .
Figure 1: Information seeking task diagram.
The distinction between process and system is fundamental to understanding the difference between information filtering and information retrieval. By ``process'' we mean an activity conducted by humans, perhaps with the assistance of a machine. When we refer to a type of ``system'' we mean an automated system (i.e., a machine) that is designed to support humans who are engaged in that process. So an information filtering system is a system that is intended by its designers to support an information filtering process. Much of the confusion that arises on this issue can be traced back to creative applications of techniques that were designed originally to support one type of information seeking process (e.g., information retrieval) to another (e.g., information filtering).
Any information seeking process begins with the users' goals. The distinguishing features of the information filtering process are that the users' information needs (or ``interests'') are relatively specific (a point we shall come back to when we define browsing), and that those interests change relatively slowly with respect to the rate at which information sources become available. Although the information retrieval process is also restricted to specific information needs, historically information retrieval research has sought to develop systems which use relatively stable information sources to respond to collections of (possibly) unrelated queries. So a traditional information retrieval system can be used to perform an information filtering process by repeatedly accumulating newly arrived documents for a short period, issuing an unchanging query against those documents, and then flushing the unselected documents. But the information filtering process is distinguished from the information retrieval process by the nature of the user's goal. Figure 2 depicts this distinction graphically. While the grand challenge for information seeking systems is to match rapidly changing information with highly variable interests, information retrieval and information filtering both explore important areas of this problem space for which a number of practical applications exist.
Figure 2: Information seeking processes for relatively specific information needs.
It is useful to highlight the distinction between information filtering and information retrieval because systems designed to support the information filtering process can exploit evidence about relatively stable interests to develop sophisticated models of the users' information needs. Information filtering can be viewed as an application of user modeling techniques to facilitate information seeking in dynamic environments. In summary, the design of information filtering systems can be based on two established lines of research, information retrieval and user modeling.