Complete feature vectors describing both the document and the user's reaction to it can be constructed for documents which have been read by adjoining the features that represent the document (e.g., term frequency values) with the vector that represents the user's reaction to it (e.g., explicit feedback). For new documents, only those features that represent the document will be known, and it would clearly be useful to be able to estimate the missing information (the user's anticipated reaction to the document). In the field known as ``machine learning'' this is known as the ``supervised learning'' problem.
In the canonical supervised learning problem, the machine is presented
with a sequence of feature vectors (training instances), and then it
is required to predict one or more missing elements in another set of
feature vectors. Predicting these missing values is an
induction process, so induction forms the basis for machine
learning. No induction technique can be justified without reference
to domain knowledge, however. Because it would be possible to explain
any set of observations after the fact, in the absence of some bias in
the induction technique, any values could reasonably be
predicted.
Langley identifies three ways in which this necessary bias
can be introduced in a machine learning system: in the representation,
in the search technique, and as explicit domain
knowledge. [22] The vector space method, in which
profiles are represented as a single vector and documents are ranked
based on the angular similarity of their representation with that
vector, combines both representation bias and search bias.
InfoScope's learning heuristics (e.g., suggest filters for newsgroups
that are read in at least 2 of the most recent 6 sessions) is an
example of domain knowledge bias.
Supervised learning is particularly well suited to exact match filtering systems which use explicit binary feedback, because in that case the training data contains exactly the same information (whether or not to select a document) that must be estimated for newly arrived documents. This is a special case of the ``classification'' problem, in which we wish to sort newly arrived documents into two or more categories (in this case, retained and rejected). Supervised learning can also be applied in ranked output filtering systems that use explicit feedback, assigning as a score for each document the system's estimate of the score that the user would assign. When implicit feedback is used, the ranking can be based on the predicted value of some observed parameter (e.g., reading duration). Alternatively, a manually constructed user model can be used to combine several observed parameters to produce an estimate of utility and then that estimate can be used to augment the training data.
Six classic machine learning approaches have been applied to text filtering: rule induction, instance based learning, statistical classification, regression, neural networks, and genetic algorithms. Stevens' work on InfoScope is an example of rule induction. InfoScope's filter suggestions were implemented as a decision list of parameters (newsgroup, field and word) which, if present in an article, would result in either selection or rejection of that article. These rules (e.g., select if newsgroup is rec.sewing and ``bobbin'' appears in the subject field) are learned using heuristics which can be modified by the user.
Foltz applied an instance based learning technique to selection of Internet News articles [9]. He retained representations of about 100 articles from a training collection which the user designated as interesting, and then ranked new articles by the cosine between their representation and the nearest retained representation. In other words, articles were ranked most highly if they were the most similar (using the cosine measure) to some positive example. In a small (four user) study, he found that this technique produced an average precision of 0.55 (43% above that achieved by random selection), and that a further improvement to 0.61 (11%) could be achieved using a dimensionality reduction technique known as Latent Semantic Indexing (LSI).
This dimensionality reduction is an example of ``feature selection.''
Feature selection can be an important issue when applying machine
learning techniques to vector representations. Langley has observed
that ``many algorithms scale poorly to domains with large numbers of
irrelevant features,'' [22] and it is not uncommon to
have thousands of terms in the vocabulary of a text filtering system.
Schütze and others at Xerox PARC applied two rank reduction
techniques, one using the best 200 terms found with a
measure of dependence between terms and relevant documents, and the
other using a variation of the LSI dimension-reduction technique used
by Foltz [35]. Four each of these feature selection
techniques they evaluated four machine learning techniques, linear
discriminant analysis (a statistical decision theory technique),
logistic regression, a two-layer (linear) neural network, and
three-layer (nonlinear) neural network using training and evaluation
collections from TREC.
Schütze and his colleagues found that using only the LSI feature vectors provided the best filtering effectiveness with linear discriminant analysis and with logistic regression, and that their implementation of linear discriminant analysis was the better of the two techniques. They also found that both the linear and nonlinear networks were able to equal the effectiveness of linear discriminant analysis on the LSI feature vectors, but that both types of networks performed slightly (but not statistically significantly) better when presented with both sets of selected features simultaneously. Finally, they found that a nonlinear neural network resulted in no improvement over their simpler linear network.
Exploring another machine learning technique, Sheth implemented a
genetic algorithm to filter Internet News in a system called ``Newt.''
A genetic algorithm uses algorithmic analogues to the genetic
crossover and mutation operations to generate candidate profiles that
inherit useful features from their ancestors, and uses competition to
identify and retain the best ones. Candidate profiles in Newt were
vectors of term weights. Relevance Feedback based on explicit binary
evaluations of articles was used to improve candidate profiles, moving
them closer in the vector space to the representation of desirable
articles and further from the representation of undesirable ones. In
machine learning this approach is referred to as ``hill climbing.''
The crossover operator was periodically applied to combine segments of
two candidate profiles which were among those that had produced the
highest ranks (using a cosine similarity measure) for articles that
the user later identified as desirable. A mutation operator was
sometimes applied to the newsgroup name to explore whether existing
candidate profiles would perform well on newsgroups with similar
names. All of the candidate profiles contributed to the ranking of
the documents shown to the user, although those which consistently
performed well contributed more strongly to the ranking. Hence, the
profile itself was determined by the population of candidate
profiles, rather than by any individual candidate.
Sheth evaluated Newt using a technique referred to in machine learning as a ``synthetic user.'' By generating (rather than assessing) user preferences, the synthetic user technique allows specific aspects of a machine learning algorithm's performance (e.g., learning rate) to be assessed. Sheth created synthetic users whose interests were deemed to be satisfied whenever at least one word from a list associated with that simulated user appeared in an article. Using this technique he found that although individual candidate profiles were able to learn to satisfy a simulated user quickly, when the simulated user's interest shifted abruptly (simulated by changing the list of words associated with the simulated user) individual candidate profiles were slower to adapt. When evaluating complete profiles made up of populations of individual candidates, Sheth demonstrated the ability to control the adaptation rate by adjusting parameters of the genetic algorithm. Simulated users lack the sophistication of human relevance judgements, but the technique is both economical and reproducible, so it is useful for certain types of evaluations.