Knowledge Discovery:
Mining and Searching Data

Synopsis of the Data Mining and Search Session
at the workshop on
The Interface of Three Areas of Computer Science with the Mathematical Sciences



Return to workshop homepage.


In this synopsis, we summarize the presentations and the discussion of open problems.

THE NATURE OF DATA MINING AND SEARCH

The World-Wide Web (the Web) is a remarkable source of information, and this session, moderated by Dianne O'Leary (University of Maryland), focussed on gathering information from it, validating that information, and using the information to draw conclusions and make predictions.

The term ``data mining" is quite old. Jon Kleinberg (Cornell University) noted that up to 1992, it occurred mostly in the statistics literature and mostly in a pejorative context. Today, however, it refers to retrieving useful information from multiple sources, often on the Web, and making useful inferences from it.

There are a remarkable number of approaches to designing search engines for the Web. Google, for example, uses a link analysis technique related to the principle of hubs-and-authorities. Examining the directed graph formed by links between Web pages, if we can identify a tightly-linked bipartite subgraph, then pages with high in-degree are denoted as authorities, since they are cited by many other pages, while those with high out-degree are hubs, or reference sites for specialized subject areas. The identification can be done automatically though the eigenvectors of the Laplacian of the graph. In contrast, search engines such as Alta Vista characterize pages by a series of keywords that are weighted by factors such as frequency and size of type, and then matched to a user's query, perhaps using techniques such as latent semantic indexing and low-rank approximations to matrices. The Alexa search engine uses the recent history of users' Web browsing to tailor query responses. And engines like Yahoo rely on humans to search the Web and create a directed acyclic graph taxonomy for the pages.

The challenge is to develop a new level of capability in automated search and mining. For example, instead of merely being able to retrieve Web pages that concern the safety records of automobiles, we would like to be able to satisfy requests such as, "Compile a report with all available information on the safety of the 1998 Ford Taurus." This requires the ability to retrieve information stored in tables or sentences on various Web pages, judge its relevance and reliability, and present the results in an easily-understood document.

Clearly mathematics, especially graph theory, statistics, and linear algebra, have already played critical roles in data mining and searching, and this session explored the current state and future opportunities for interactions.

PROBLEMS IN DATA MINING

Chid Apte (IBM) discussed extracting "actionable insights" from data. He reminded the participants that data is not ordinarily presented with ease of mining in mind. For instance, some time series data may be raw data, while others have been ``back-fitted," revised in the light of supplementary data. This makes the use of statistical methods particularly difficult because of the hidden correlations.

Problem 1: How can we verify the integrity of data gathered from multiple sources? In particular, we need to distinguish signal from noise.

Problem 2: Are there automated methods to select the best data transformations, dropping to an appropriate subspace? In other words, how can we identify important features and search for alternate models within the space in order to find a near-optimal one?

Problem 3: Problem 3: Can we extract from data robust models that are simultaneously comprehensible to end-users?

Apte predicts that breakthroughs will come though interdisciplinary approaches, drawing from expertise in statistics, machine learning, and data management. In addition, results from computational learning theory, human-computer interaction, understanding of high dimensional systems, visualization, and optimization involving variational problems will be essential.

ENHANCING THE POWER OF AUTOMATED KNOWLEDGE DISCOVERY

Raul Valdes-Perez (Carnegie Mellon) focussed on the problems of knowledge discovery, gathering and analyzing data in order to determine underlying rules and defining properties.

He demonstrated some current technology and presented several challenges. The current capabilities of discovery are quite limited: we cannot easily extract information from multiple sources with multiple formats and reason about it. In addition, the presentation of results to users is quite primitive, for example, as a decision tree or a cluster diagram.

Problem 4: Enlarge the range of automated discovery: for instance, form conjectures or predictions.

Problem 5: Make data mining so automatic that even non-experts in mining can ask questions and get useful information back that is easy to interpret.

WEB SOCIETY AND SEARCH

Jon Kleinberg focussed on the information content and social structure of the Web and on the problems of Web searching. He noted that searching was made difficult by two opposing characteristics. First, only a few sites (if any) contain the answer to a user's query, and there are challenges in natural language processing to infer that the answer is contained within a given Web page. On the other hand, most queries bring back far too many responses, overwhelming the person who formulated the query. Thus effective filtering of responses is essential. We would like to give priority to Web pages that are authorities, in the sense discussed above.

Problem 6: Find good visualization models for the Web, that present the information in a usable way. Thus far, the best tool is the graphical browser.

Kleinberg is fascinated by the problem of discovering cyber communities through the hub-and-authority structure.

Problem 7: The 200 million pages have been reduced to to 500,000 overlapping communities (but not a cover). What is the structure of this aggregation and what properties should we search for? Multiresolution approaches (as proposed for the networking traffic session) might be useful here, as might the study of self-similarity plus perturbation.

Problem 8: Construct a simple random process that yields a graph with properties of the Web: obeying the power law, with many dense bipartite subgraphs, and a small diameter, since the average distance between any two Web pages is 15-20 links.

"Clickstream data," the path that a user takes in surfing the Web, defines a dynamic graph. If a large Internet Service Provider such as AOL made such data available to researchers, it would reveal a lot about communities, enable people to restructure Websites for easier access, and also have great value to advertisers. Such release is problematic, however, because of privacy concerns and the economic value of the data.

Problem 9: Classify Web pages by topic. This problem is analogous to image segmentation, so the challenge is to use vision methods in this context.

In problems such as this one, computational learning theory is a useful tool in trying to deduce function from observations. This has led to interesting interactions between computer science and the mathematical sciences. Valiant's model of PAC learning is connected to Vapnik's ideas about classification and pattern recognition; the computational algorithms minimize a cost function plus a term to restrict the search space, just as in regularization methods for ill-posed problems; and a technique called boosting adds classifiers to take care of difficult examples in stable way and is analogous to column generation in mathematical programming.

FURTHER PROBLEMS

A critical issue is to distinguish between trusted authorities and unreliable ones. The Weighted Majority Algorithm is one approach that can be applied, finding trusted authorities with at most O(k + log n) mistakes, where n is the number of experts and k is the number of mistakes of the most reliable one.

Susan Gauch (University of Kansas) posed a fundamental problem in designing search engines:

Problem 10: How can the search engine solicit feedback from the user about how well the retrieved documents satisfy the query, and how can the engine make use of this information? In addition, how can the retrieval be tailored to an individual user, making intelligent guesses, for instance, about whether a request for "Michael Jordan" refers to the computer scientist, the jazz musician, or the basketball player?

Problem 11: The issue of image retrieval is not well understood, unless there is accompanying text or an index. How can we retrieve images related to a given one by color profile (e.g., mostly red), category (e.g., landscape), or subject (e.g., Isaac Newton)?

Laszlo Lovasz (Microsoft) noted that two interesting graph decompositions have emerged in recent decades. Robertson-Seymour graph minor theory for low density graphs considers graphs G that do not contain K_m as a minor. Such graphs are a small perturbation of 2-dimensional graphs glued together. The Szemeredi lemma notes that graphs in which the number of edges is proportional to n^2, where n is the number of vertices, can be decomposed into pieces that look random and have various lower dimensions.

Problem 12: Is there a randomized way of simulating the Internet as a graph?

These connections highlight the importance of basic linear algebra constructions such as understanding the graph Laplacian.

Lovasz noted that discrete vs. continuous is not the right dichotomy, since all of these graph concepts have analogs in the continuous setting, and the problems are closely related to finding good imbeddings of finite metric spaces into each other with small distortion.

It is essential that algorithms should work on huge problems and thus must run in time close to linear.

He quoted Ron Coifman as saying that in representing data, the most economical representation brings us halfway to the discovery of the underlying structure, and this must be a basic principle in data mining and search.


Copyright 2000, Dianne P. O'Leary oleary@cs.umd.edu
I'm grateful for the comments of several participants that improved a draft of this document, but all remaining errors are my fault alone.
Last updated: June 23, 2000.

Return to workshop homepage.