
LINQS
STATISTICAL RELATIONAL LEARNING GROUP @ UMD
Entity Resolution
Many databases contain imprecise references to real-world entities. For example, a social network database records names of real people. But multiple people can go by the same name and there may be different names which refer to the same person as well. In general, there may be many references to the same real-world entity. The goal of the entity resolution problem is to discover the unobserved entities and cluster the database references according to their entities. Traditionally, entities are resolved on the basis of the attributes of individual references. However, in many domains, such as social networks and academic circles, the underlying entities exhibit strong ties to each other, and as a result, their references often co-occur in the data. We focus on the use of such co-occurrence relationships between references for collective entity resolution , in which the entities for related references are determined jointly.
We explore different techniques for solving the collective entity resolution problem. We have designed a relational clustering algorithm, where references are iteratively clustered into entities taking into account the clusters of co-occurring references. We show that this approach locally minimizes a cut-based clustering cost that considers the co-occurrence relations in addition to the similarity between references. In addition, we have proposed a probabilistic generative model for co-occurring references that uses Latent Dirichlet Allocation to find hidden group structures among the domain entities as evidence for resolving entities. We have developed an efficient unsupervised inference algorithm for this model using Gibbs Sampling techniques. We show that both of these approaches improve performance over attribute baselines in multiple real world and synthetic datasets. Our algorithm also ranked among the top in the Government-sponsored KDD Challenge Competition that was organized by IBM EAS in August 2005 and involved participants from companies and other top-tier universities.
In addition to collective resolution over an entire database, we have investigated the problem of query-centric entity resolution. We have shown that queries can be collectively resolved by recursively exploring and resolving related references. However, collective resolution at query-time is computationally challenging since this recursive approach can span a very large number of references. We have proposed an unsupervised algorithm for adaptively selecting the most informative of the related references for a query. Using this adaptive strategy, queries that otherwise take several minutes to resolve can be answered in seconds, while still preserving the accuracy of collective resolution.
Publications
- Collective Entity Resolution in Relational Data,
Indrajit Bhattacharya and Lise Getoor, ACM Transactions on Knowledge Discovery
from Data (ACM-TKDD), 2007
- Query-Time Entity Resolution, Indrajit Bhattacharya, Louis Licamele and Lise Getoor, The 12th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Philadelphia, USA, August 2006.
- Relational Clustering for Entity Resolution Queries, Indrajit Bhattacharya, Louis Licamele and Lise Getoor, ICML 2006 Workshop on Statistical Relational Learning (SRL), Pittsburgh, USA, June 2006.
- Collective Entity Resolution in Relational Data, Indrajit Bhattacharya and Lise Getoor, IEEE Data Engineering Bulletin, Special Issue on Data Quality, June 2006
- A Latent Dirichlet Model for Unsupervised Entity Resolution, Indrajit Bhattacharya and Lise Getoor, The 6th SIAM Conference on Data Mining (SIAM SDM-06) (Best Research Paper Award)
- Latent Dirichlet Allocation Model for Entity Resolution, Indrajit Bhattacharya and Lise Getoor, University of Maryland Technical Report CS-TR-4740, August 2005.
- Entity Resolution in Graphs, Indrajit Bhattacharya and Lise Getoor, Chapter in Mining Graph Data, Lawrence B. Holder and Diane J. Cook, Editors, Wiley, 2006.
- Entity Resolution in Graph Data, Indrajit Bhattacharya and Lise Getoor, University of Maryland Technical Report CS-TR-4758, October 2005.
- Relational Clustering for Multi-type Entity Resolution, Indrajit Bhattacharya and Lise Getoor, The 11th ACM SIGKDD Workshop on Multi Relational Data Mining (MRDM-05).
- Deduplication and Group Detection using Links, Indrajit Bhattacharya and Lise Getoor, The 10th ACM SIGKDD Workshop on Link Analysis and Group Detection (LinkKDD-04).
- Iterative Record Linkage for Cleaning and Integration, Indrajit Bhattacharya and Lise Getoor, The 9th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD-04).
Datasets
- CiteSeer: The CiteSeer dataset contains
1,504 machine learning documents with 2,892 author
references to 1,165 author entities. For this
dataset, the only attribute information available
for authors is the name. The full last name is
always given, and in some cases the author's full
first name and middle name are given and other times
only the initials are given. The dataset was
originally created by Giles et. al. and the version
which we use includes the author entity ground truth
provided by Aron Culotta and Andrew McCallum,
University of Massachusetts, Amherst. We have
performed further cleaning on it.
DataSet
Format - arXiv: The arXiv dataset describes high
energy physics publications. It was originally used
in KDD
Cup 2003. It contains 29,555 papers with 58,515
references to 9,200 authors. The attribute
information available for this dataset is also just
the author name, with the same variations in form as
described above. The author entity ground truth for
this data set was provided by David Jensen,
University of Massachusetts, Amherst. We have
performed further cleaning on it, extracted the
relevant information for entity resolution and put
it in the same format as the CiteSeer data.
DataSet
Code
- Synthetic Data Generator: We designed a
generator for noisy references with co-occurrence
relationships between them. This generator allows
the user to control several characteristics of the
data, such as degree of collaboration between the
underlying entities, the size of the co-occurrence
relationships, the ambiguity of entity attributes
and relationships and others, in a systematic and
flexible way. Experiments on synthetic data enabled
us to reason beyond specific datasets, understand
the impact of different structural properties of the
data on collective resolution, and also to
empirically verify our performance analysis for
relational clustering in general. The generated data
is in the same format as CiteSeer.
Description
Code - Relational Clustering: The relational
clustering code currently reads in reference data in
the CiteSeer format described above, performs
'blocking' to identify potential duplicate
references, initializes reference clusters using
bootstrapping, and then iteratively merges clusters
considering both atribute and relational similarity
until the similarity of the closest pair drops below
a threshold. All parameters such as the termination
threshold, attribute and relational similarity
measures to be used and the combination weight, can
be specified as command line arguments to the
executable. This is a pre-alpha version of the
code. Watch this page for updates!
Code
Entity Resolution Resources on the Web
- RIDDLE, maintained by Misha Bilenko, is an excellent web directory listing people, papers and datasets.