As Belkin and Croft observed, content-based text selection techniques have been extensively evaluated in the context of information retrieval . Every approach to text selection has four basic components:
In each of the text filtering systems we describe in this report, the selection module assigns one or more values to each document, and the display module then uses those values to organize the display. Figure 3 illustrates the representation and comparison process implemented by those systems. The domain of the profile acquisition function p is I, the collection of possible information needs and its range is R, the unified space of profile and document representations. The domain of the document representation function d is D, the collection of documents, and its range is also R. The domain of the comparison function c is and its range is , the set of n-tuples of real numbers between zero and one. In an ideal text filtering system,
where represents the user's judgement of some relationships between an interest and a document, measured on n ordinal scales (e.g., topical similarity or degree of constraint satisfaction).
Figure 3: Text filtering system model.
As we saw in section 4, the representation can exploit information derived from the content of the document, annotations made by others, or some combination of the two. Although syntactic and semantic analysis of documents is possible, content-based text filtering systems typically use representations based on the frequency with which terms occur in each document. One reason for this choice is that it lends itself to efficient implementation. But a more compelling reason is that because no domain-specific information is needed to form the representation, a demonstration of acceptable performance in one application is easily translated into similar performance in another.
Although content-based text filtering systems typically start with this term-frequency representation, they generally apply some type of transformation to that representation before invoking the comparison function c in figure 3 The nature of the transformation depends strongly on which characteristics of that representation the comparison function c is designed to exploit, however. For this reason, we describe the transformations together with thir associated comparison functions in the following paragraphs.
For an exact match text filtering system the range of the comparison function c is restricted to be either zero or one, and it is interpreted as a binary judgement about whether a document satisfies the profile. In this case, a step function that detects term presence is applied to the term-frequency representation when that representation is constructed so that the resulting boolean vector can be easily compared to the boolean expression specified by the profile. Exact match text filtering systems typically provide an unranked set of documents which will (hopefully) satisfy the information need. The exact match approach is well suited to autonomous systems which must take actions (such as storage decisions) without user interaction.
Two common approaches to ranked output generation are the vector space method and the probabilistic method, although variations abound. In the vector space method the range of c is [0,1], and the value is interpreted as the degree to which the content of two documents is similar. Both the profile and the documents are represented as vectors in a vector space, and a comparison technique based on the assumption that documents whose representations are similar to the profile will be likely to satisfy the associated information need is used. The angle between two vectors has been found to be a useful measure of content similarity, so the square of the cosine of that angle (easily computed as the normalized inner product of the two vectors) is used used to rank order the documents.
Experience has shown that the vector space method's effectiveness can be improved substantially by transforming the raw term-frequency vector in ways which amplify the influence of words which occur often in a document but relatively rarely in the whole collection. One common scheme, known as ``term-frequency---inverse document frequency'' weighting, assigns term i in document k a value computed as:
In a text filtering system, advance knowledge of the inverse document frequency portion of that equation is clearly not possible. Estimates of that information based on sampling earlier documents can, however, produce useful inverse document frequency values for domains in which term usage patterns are relatively stable.
Rather than estimate similarity, the probabilistic method seeks to estimate the probability that a document satisfies the information need represented by the profile. The probabilistic method is thus a generalization of the exact match technique in which we seek to rank order documents by the probability that they satisfy the information need rather than by making a sharp decision. To develop this probability, term frequency information (weighted to emphasize within document frequency and to deemphasize across-document frequency) is treated as an observation, and the distribution of the binary event ``document matches profile'' conditioned by that observation is computed. Bayesian inference networks have proven to be a useful technique for computing this conditional probability . Since it is possible to construct a Bayesian inference net which computes the cosine of the angle between two vectors, the vector space method can be interpreted as a special case of the probabilistic method .
Since the comparison function can produce a multiple-valued result, the display module can be designed to exploit the results of both exact match and ranked output techniques. For example, an electronic mail system could reject documents sent by specific users and then rank the remaining documents in order of decreasing content similarity to a prototype document provided by the user. The combination of the profile and the comparison technique in a ranked output text filtering system can be though of as specifying a point of view in the document space. Multiple rank orderings can be combined to produce richer displays that combine multiple points of view, a research area often referred to as ``document visualization'' or ``visual information retrieval interfaces.''
Although only the vector space method actually uses vector operations such as the inner product, all three of these approaches exploit ``feature vectors'' in which the features are based on the frequency with which terms appear within documents and across the collection. The annotations provided by social filtering techniques are an additional source of features that can be exploited by a comparison function. Because annotations can be used even when useful content-based features are difficult to construct, information retrieval systems designed for information that is not in text form have explored matching techniques for feature vectors composed of annotations.
One such application which appears to have reached the critical mass necessary for effective use of annotations is a home video recommendation service developed by Hill and his colleagues at Bellcore in which users' tastes in movies were matched using techniques similar to those implemented in GroupLens . Populated with a large and relatively stable set of movie titles, stable interests could be matched against that database for some time before exhausting the set of movies that might be of interest to a user. This is an interesting case in which the unlabeled corner of the graph in figure 1 is worth exploring.
Hill's system allowed users to provide numeric evaluations (on a scale of one to ten) for movies they had already seen, and then matched those ratings with evaluations of the same movies that had previously been provided by other users. Movies were sorted by category (e.g., drama or comedy), and within a category correlation coefficients between the feature vectors were computed. A set of users with the largest correlations was then selected and regression was performed based on evaluations from those users to predict scores for unseen movies in each category. In this case the profile was the set of annotations provided by the user, the ``document'' features were the annotations provided by others, and the comparison function was a two-step process of feature selection followed by regression.
In addition to showing how annotations can be viewed as features, this example illustrates an important limitation of the information retrieval techniques we have described. In information filtering applications, profiles based on multiple documents (such as the multi-movie evaluation within a category used in Hill's system) are common. But information retrieval research has explored only relatively simple ways of combining this information to form profiles. Relevance feedback, an information retrieval technique in which feature vectors are formed from the content of multiple documents, has shown good results. But the ``one query at a time'' model which underlies much information retrieval research precludes consideration of techniques such as the regression used by Hill and his colleagues.