next up previous
Next: Conclusion Up: CAR-TR-830 CLIS-TR-96-02 CS-TR-3643 Previous: Privacy

Observations on the State of the Art

Early information filtering systems (then known as SDI) were developed to exploit the availability information in electronic form to manage the process of disseminating scientific information. When the printed page was the dominant information paradigm for text transmission, high production costs led to the development of extensive social structures (e.g., the peer review process) for selecting information worthy of publication. As long as this situation persisted, the dissemination process managed admirably, and SDI improved its performance. With the introduction of personal computing and ubiquitous networking, each participant is now able to also be both a consumer and a producer of information. The drastic reduction in publishing costs has greatly increased the importance of filtering the resulting flood of information, but the resulting variability document quality has also made that filtering task more difficult. Automatic techniques are needed to make this wealth of information accessible, since information that cannot be found is no better than information which does not exist.

Rather than simply removing unwanted information, information filtering actually gives consumers the ability to reorganize the information space [38]. For economic reasons, information spaces have traditionally been organized by producers and, in some cases, reorganized by intermediaries. In book publishing, for example, authors and publishers work together to assign titles to books and to announce their availability. Intermediaries such as libraries, book clubs and book stores obtain those announcements, select items which are likely to be of interest to their customers, and organize information about their selections in ways that serve the needs of those customers. Because such intermediaries typically serve substantial numbers of customers, economic factors usually limit them to providing a few (sometimes only one) perspectives on the information space.

Information filtering is essentially a personal intermediation service. Like a library, a text filtering system can collect information from multiple sources and produce an organization that is useful to the patrons. But by automating the process of organizing the information space it becomes economically feasible to personalize this organization. Of course, automating this intermediation process eliminates the value that could be added by human intermediaries who can apply their judgement to improve the organization of the information space.

Social filtering offers a way of integrating human and automated intermediation. Human intermediaries have traditionally organized the information space through selection and annotation. Selection, however, is simply a special type of annotation (i.e., a document is marked as ``selected by the intermediary''). As with price annotations, the user may find it useful to assign expert annotations an a priori degree of confidence because they come from a source with well understood characteristics. Tapestry's profile specification language provides an example of how such functionality could be incorporated.

Social filtering alone is unlikely to provide a complete solution to users' information filtering needs. Expert annotations require effort and have economic value, so the marketplace will undoubtedly assign them a price. With continued reductions in the cost of computing and communications resources, content-based filtering will offer a competitive source of information on which to base selections. Furthermore, because humans and machines base their evaluations on different features, systems which incorporate both social and content-based filtering will likely be more effective than those which use either technique in isolation. In this light, the work of Schütze and his colleagues suggests that machine learning techniques which effectively exploit multiple sources of evidence can be found [35].

Content-based and social filtering will almost certainly prove to be complementary in other, less easily measured ways as well. A perfect content-based technique would never find anything novel, limiting the range of applications for which it would be useful. Social filtering techniques excel at identifying novelty (because they are guided by humans), but only when the humans who guide them are not overloaded with information. Content-based systems can help to reduce this volume of information to manageable levels. Thus, both content-based and collaborative filtering contribute to the other's effectiveness, allowing an integrated system to achieve both reliability and serendipity.

Social filtering has yet to realize this potential, however. The difficulty of achieving a critical mass of participants makes social filtering experiments expensive. One clear disincentive in present experiments is the additional cognitive load imposed on the user by the requirement to provide explicit feedback. We are not aware of any research in which implicit feedback has been applied to social filtering, but there is some evidence that such an approach could be successful. Hill and his colleagues have reported that readers find it useful to know which portions of a document receive the most attention from other readers. In an analogy to the tendency of well-used paper documents to acquire characteristics which convey similar information, they call this concept ``read wear'' [15]. Coarser measurements such as Morita and Shinoda's reading time metric, or the save and reply decisions explored by Stevens, may also prove to be useful bases for social filtering in some applications. If useful annotations can be acquired without requiring explicit feedback, lesser inducements (such as the improvement that could result from application of a simple content-based filtering technique) may be sufficient to assemble the critical mass of users needed to evaluate social filtering techniques.

Another serious impediment to the large scale evaluation of social filtering techniques is the difficulty of constructing suitable measures of effectiveness. Recall, precision and fallout are of some use when comparing content-based filtering techniques, but their reliance on normative judgements of document relevance suppresses exactly the individual variations that social filtering seeks to exploit. One feasible evaluation technique would be to apply simulated users like those used by Sheth to investigate specific aspects of collaborative behavior. Important issues such as the learning rates and variability in learning behavior across large heterogeneous populations could be investigated with large collections of simulated users whose design was tailored to explore those issues.

Another alternative is to study situated users (i.e., human users performing self-directed tasks), attempt to provide them with desirable documents, and then measure something related to their satisfaction. Those ``dependent variables'' could certainly be the sort of explicit feedback commonly required in present social filtering experiments, but insisting on explicit feedback increases the difficulty of assembling a sufficiently large user population. In suitable sources of implicit feedback can be identified, those same measures would would be a far better choice for the set of dependent variables. Such an experiment design requires that separate training and evaluation document collections be used, a feature easily introduced by withholding implicit feedback from the filtering algorithm during the evaluation period. This approach can be used to evaluate both content-based and social filtering systems, so it would be a natural choice when evaluating systems which applied both types of techniques. It can only be applied, however, after suitable sources of implicit feedback are found. Since implicit feedback has the potential for a high payoff in performance evaluation, filtering effectiveness, and user satisfaction, research on that topic should be accorded a high priority.

next up previous
Next: Conclusion Up: CAR-TR-830 CLIS-TR-96-02 CS-TR-3643 Previous: Privacy

Douglas W. Oard
Sun Apr 27 13:18:52 EDT 1997

Web Accessibility