Complete feature vectors describing both the document and the user's reaction to it can be constructed for documents which have been read by adjoining the features that represent the document (e.g., term frequency values) with the vector that represents the user's reaction to it (e.g., explicit feedback). For new documents, only those features that represent the document will be known, and it would clearly be useful to be able to estimate the missing information (the user's anticipated reaction to the document). In the field known as ``machine learning'' this is known as the ``supervised learning'' problem.
In the canonical supervised learning problem, the machine is presented with a sequence of feature vectors (training instances), and then it is required to predict one or more missing elements in another set of feature vectors. Predicting these missing values is an induction process, so induction forms the basis for machine learning. No induction technique can be justified without reference to domain knowledge, however. Because it would be possible to explain any set of observations after the fact, in the absence of some bias in the induction technique, any values could reasonably be predicted. Langley identifies three ways in which this necessary bias can be introduced in a machine learning system: in the representation, in the search technique, and as explicit domain knowledge.  The vector space method, in which profiles are represented as a single vector and documents are ranked based on the angular similarity of their representation with that vector, combines both representation bias and search bias. InfoScope's learning heuristics (e.g., suggest filters for newsgroups that are read in at least 2 of the most recent 6 sessions) is an example of domain knowledge bias.
Supervised learning is particularly well suited to exact match filtering systems which use explicit binary feedback, because in that case the training data contains exactly the same information (whether or not to select a document) that must be estimated for newly arrived documents. This is a special case of the ``classification'' problem, in which we wish to sort newly arrived documents into two or more categories (in this case, retained and rejected). Supervised learning can also be applied in ranked output filtering systems that use explicit feedback, assigning as a score for each document the system's estimate of the score that the user would assign. When implicit feedback is used, the ranking can be based on the predicted value of some observed parameter (e.g., reading duration). Alternatively, a manually constructed user model can be used to combine several observed parameters to produce an estimate of utility and then that estimate can be used to augment the training data.
Six classic machine learning approaches have been applied to text filtering: rule induction, instance based learning, statistical classification, regression, neural networks, and genetic algorithms. Stevens' work on InfoScope is an example of rule induction. InfoScope's filter suggestions were implemented as a decision list of parameters (newsgroup, field and word) which, if present in an article, would result in either selection or rejection of that article. These rules (e.g., select if newsgroup is rec.sewing and ``bobbin'' appears in the subject field) are learned using heuristics which can be modified by the user.
Foltz applied an instance based learning technique to selection of Internet News articles . He retained representations of about 100 articles from a training collection which the user designated as interesting, and then ranked new articles by the cosine between their representation and the nearest retained representation. In other words, articles were ranked most highly if they were the most similar (using the cosine measure) to some positive example. In a small (four user) study, he found that this technique produced an average precision of 0.55 (43% above that achieved by random selection), and that a further improvement to 0.61 (11%) could be achieved using a dimensionality reduction technique known as Latent Semantic Indexing (LSI).
This dimensionality reduction is an example of ``feature selection.'' Feature selection can be an important issue when applying machine learning techniques to vector representations. Langley has observed that ``many algorithms scale poorly to domains with large numbers of irrelevant features,''  and it is not uncommon to have thousands of terms in the vocabulary of a text filtering system. Schütze and others at Xerox PARC applied two rank reduction techniques, one using the best 200 terms found with a measure of dependence between terms and relevant documents, and the other using a variation of the LSI dimension-reduction technique used by Foltz . Four each of these feature selection techniques they evaluated four machine learning techniques, linear discriminant analysis (a statistical decision theory technique), logistic regression, a two-layer (linear) neural network, and three-layer (nonlinear) neural network using training and evaluation collections from TREC.
Schütze and his colleagues found that using only the LSI feature vectors provided the best filtering effectiveness with linear discriminant analysis and with logistic regression, and that their implementation of linear discriminant analysis was the better of the two techniques. They also found that both the linear and nonlinear networks were able to equal the effectiveness of linear discriminant analysis on the LSI feature vectors, but that both types of networks performed slightly (but not statistically significantly) better when presented with both sets of selected features simultaneously. Finally, they found that a nonlinear neural network resulted in no improvement over their simpler linear network.
Exploring another machine learning technique, Sheth implemented a genetic algorithm to filter Internet News in a system called ``Newt.'' A genetic algorithm uses algorithmic analogues to the genetic crossover and mutation operations to generate candidate profiles that inherit useful features from their ancestors, and uses competition to identify and retain the best ones. Candidate profiles in Newt were vectors of term weights. Relevance Feedback based on explicit binary evaluations of articles was used to improve candidate profiles, moving them closer in the vector space to the representation of desirable articles and further from the representation of undesirable ones. In machine learning this approach is referred to as ``hill climbing.'' The crossover operator was periodically applied to combine segments of two candidate profiles which were among those that had produced the highest ranks (using a cosine similarity measure) for articles that the user later identified as desirable. A mutation operator was sometimes applied to the newsgroup name to explore whether existing candidate profiles would perform well on newsgroups with similar names. All of the candidate profiles contributed to the ranking of the documents shown to the user, although those which consistently performed well contributed more strongly to the ranking. Hence, the profile itself was determined by the population of candidate profiles, rather than by any individual candidate.
Sheth evaluated Newt using a technique referred to in machine learning as a ``synthetic user.'' By generating (rather than assessing) user preferences, the synthetic user technique allows specific aspects of a machine learning algorithm's performance (e.g., learning rate) to be assessed. Sheth created synthetic users whose interests were deemed to be satisfied whenever at least one word from a list associated with that simulated user appeared in an article. Using this technique he found that although individual candidate profiles were able to learn to satisfy a simulated user quickly, when the simulated user's interest shifted abruptly (simulated by changing the list of words associated with the simulated user) individual candidate profiles were slower to adapt. When evaluating complete profiles made up of populations of individual candidates, Sheth demonstrated the ability to control the adaptation rate by adjusting parameters of the genetic algorithm. Simulated users lack the sophistication of human relevance judgements, but the technique is both economical and reproducible, so it is useful for certain types of evaluations.