As Belkin and Croft observed, content-based text selection techniques have been extensively evaluated in the context of information retrieval [2]. Every approach to text selection has four basic components:

- Some technique for representing the documents
- Some technique for representing the information need (i.e., profile construction)
- Some way of comparing the profiles with the document representations
- Some way of using the results of that comparison

In each of the text filtering systems we describe in this report, the
selection module assigns one or more values to each document, and the
display module then uses those values to organize the display.
Figure 3 illustrates the representation and comparison
process implemented by those systems. The domain of the profile
acquisition function **p** is **I**, the collection of possible
information needs and its range is **R**, the unified space of profile
and document representations. The domain of the document
representation function **d** is **D**, the collection of documents, and
its range is also **R**. The domain of the comparison function **c** is
and its range is , the set of **n**-tuples of
real numbers between zero and one. In an ideal text filtering system,

where represents the user's
judgement of some relationships between an interest and a document,
measured on **n** ordinal scales (e.g., topical similarity or degree of
constraint satisfaction).

**Figure 3:** Text filtering system model.

As we saw in section 4, the representation can exploit information derived from the content of the document, annotations made by others, or some combination of the two. Although syntactic and semantic analysis of documents is possible, content-based text filtering systems typically use representations based on the frequency with which terms occur in each document. One reason for this choice is that it lends itself to efficient implementation. But a more compelling reason is that because no domain-specific information is needed to form the representation, a demonstration of acceptable performance in one application is easily translated into similar performance in another.

Although content-based text filtering systems typically start with
this term-frequency representation, they generally apply some type of
transformation to that representation before invoking the comparison
function **c** in figure 3 The nature of the
transformation depends strongly on which characteristics of that
representation the comparison function **c** is designed to exploit,
however. For this reason, we describe the transformations together
with thir associated comparison functions in the following paragraphs.

For an exact match text filtering system the range of the comparison
function **c** is restricted to be either zero or one, and it is
interpreted as a * binary judgement* about whether a document *
satisfies* the profile. In this case, a step function that detects
term presence is applied to the term-frequency representation when
that representation is constructed so that the resulting boolean
vector can be easily compared to the boolean expression specified by
the profile. Exact match text filtering systems typically provide an
unranked set of documents which will (hopefully) satisfy the
information need. The exact match approach is well suited to
autonomous systems which must take actions (such as storage decisions)
without user interaction.

Two common approaches to ranked output generation are the vector space
method and the probabilistic method, although variations abound. In
the vector space method the range of **c** is [0,1], and the value is
interpreted as the degree to which the content of two documents is
similar. Both the profile and the documents are represented as
vectors in a vector space, and a comparison technique based on the
assumption that documents whose representations are similar to the
profile will be likely to satisfy the associated information need is
used. The angle between two vectors has been found to be a useful
measure of content similarity, so the square of the cosine of that
angle (easily computed as the normalized inner product of the two
vectors) is used used to rank order the documents.

Experience has shown that the vector space method's effectiveness can
be improved substantially by transforming the raw term-frequency
vector in ways which amplify the influence of words which occur often
in a document but relatively rarely in the whole collection. One
common scheme, known as ``term-frequency---inverse document
frequency'' weighting, assigns term **i** in document **k** a value
computed as:

In a text filtering system, advance knowledge of the inverse document frequency portion of that equation is clearly not possible. Estimates of that information based on sampling earlier documents can, however, produce useful inverse document frequency values for domains in which term usage patterns are relatively stable.

Rather than estimate similarity, the probabilistic method seeks to
estimate the * probability* that a document * satisfies* the
information need represented by the profile. The probabilistic method
is thus a generalization of the exact match technique in which we seek
to rank order documents by the probability that they satisfy the
information need rather than by making a sharp decision. To develop
this probability, term frequency information (weighted to emphasize
within document frequency and to deemphasize across-document
frequency) is treated as an observation, and the distribution of the
binary event ``document matches profile'' conditioned by that
observation is computed. Bayesian inference networks have proven to be
a useful technique for computing this conditional
probability [41]. Since it is possible to construct a
Bayesian inference net which computes the cosine of the angle between
two vectors, the vector space method can be interpreted as a special
case of the probabilistic method [42].

Since the comparison function can produce a multiple-valued result,
the display module can be designed to exploit the results of both
exact match and ranked output techniques. For example, an electronic
mail system could reject documents sent by specific users and then
rank the remaining documents in order of decreasing content similarity
to a prototype document provided by the user. The combination of the
profile and the comparison technique in a ranked output text filtering
system can be though of as specifying a * point of view* in the
document space. Multiple rank orderings can be combined to produce
richer displays that combine multiple points of view, a research area
often referred to as ``document visualization'' or ``visual
information retrieval interfaces.''

Although only the vector space method actually uses vector operations such as the inner product, all three of these approaches exploit ``feature vectors'' in which the features are based on the frequency with which terms appear within documents and across the collection. The annotations provided by social filtering techniques are an additional source of features that can be exploited by a comparison function. Because annotations can be used even when useful content-based features are difficult to construct, information retrieval systems designed for information that is not in text form have explored matching techniques for feature vectors composed of annotations.

One such application which appears to have reached the critical mass necessary for effective use of annotations is a home video recommendation service developed by Hill and his colleagues at Bellcore in which users' tastes in movies were matched using techniques similar to those implemented in GroupLens [16]. Populated with a large and relatively stable set of movie titles, stable interests could be matched against that database for some time before exhausting the set of movies that might be of interest to a user. This is an interesting case in which the unlabeled corner of the graph in figure 1 is worth exploring.

Hill's system allowed users to provide numeric evaluations (on a scale of one to ten) for movies they had already seen, and then matched those ratings with evaluations of the same movies that had previously been provided by other users. Movies were sorted by category (e.g., drama or comedy), and within a category correlation coefficients between the feature vectors were computed. A set of users with the largest correlations was then selected and regression was performed based on evaluations from those users to predict scores for unseen movies in each category. In this case the profile was the set of annotations provided by the user, the ``document'' features were the annotations provided by others, and the comparison function was a two-step process of feature selection followed by regression.

In addition to showing how annotations can be viewed as features, this example illustrates an important limitation of the information retrieval techniques we have described. In information filtering applications, profiles based on multiple documents (such as the multi-movie evaluation within a category used in Hill's system) are common. But information retrieval research has explored only relatively simple ways of combining this information to form profiles. Relevance feedback, an information retrieval technique in which feature vectors are formed from the content of multiple documents, has shown good results. But the ``one query at a time'' model which underlies much information retrieval research precludes consideration of techniques such as the regression used by Hill and his colleagues.

Sun Apr 27 13:18:52 EDT 1997