Entity
Resolution
Our Approaches
Many databases contain imprecise references to real-world
entities. For example, a social network database records
names of real people. But multiple people can go by the
same name and there may be different names which refer to
the same person as well. In general, there may be many
references to the same real-world entity. The goal of the
entity resolution problem is to discover the unobserved
entities and cluster the database references according to
their entities. Traditionally, entities are resolved on
the basis of the attributes of individual
references. However, in many domains, such as social
networks and academic circles, the underlying entities
exhibit strong ties to each other, and as a result, their
references often co-occur in the data. We focus on the use
of such co-occurrence relationships between references for
collective entity resolution , in which the
entities for related references are determined jointly.
We explore different techniques for solving the collective
entity resolution problem. We have designed a
relational clustering algorithm, where references are
iteratively clustered into entities taking into account
the clusters of co-occurring references. We show that this
approach locally minimizes a cut-based clustering cost
that considers the co-occurrence relations in addition to
the similarity between references. In addition, we have
proposed a probabilistic generative model for
co-occurring references that uses Latent Dirichlet
Allocation to find hidden group structures among the
domain entities as evidence for resolving entities. We have
developed an efficient unsupervised inference algorithm
for this model using Gibbs Sampling techniques. We
show that both of these approaches improve performance
over attribute baselines in multiple real world and
synthetic datasets. Our algorithm also ranked among the top
in the Government-sponsored KDD Challenge
Competition that was organized by IBM EAS in August 2005
and involved participants from companies and other
top-tier universities.
In addition to collective resolution over an entire
database, we have investigated the problem of
query-centric entity resolution. We have shown that
queries can be collectively resolved by recursively
exploring and resolving related references. However,
collective resolution at query-time is computationally
challenging since this recursive approach can span a very
large number of references. We have proposed an
unsupervised algorithm for adaptively selecting the most
informative of the related references for a query. Using
this adaptive strategy, queries that otherwise take
several minutes to resolve can be answered in seconds,
while still preserving the accuracy of collective
resolution.
Publications
- "Collective Entity Resolution in Relational Data",
Indrajit Bhattacharya and Lise Getoor, ACM Transactions on Knowledge Discovery
from Data (ACM-TKDD), 2007 (to appear)
BibTex
- "Query-Time Entity
Resolution", Indrajit Bhattacharya, Louis Licamele
and Lise Getoor, The 12th ACM
International Conference on Knowledge Discovery and Data
Mining (SIGKDD), Philadelphia, USA, August
2006.
BibTex
- "Relational Clustering
for Entity Resolution Queries", Indrajit
Bhattacharya, Louis Licamele and Lise Getoor, ICML 2006 Workshop on Statistical
Relational Learning (SRL), Pittsburgh, USA,
June 2006.
BibTex
- "Collective Entity Resolution in Relational Data",
Indrajit Bhattacharya and Lise Getoor, IEEE
Data Engineering Bulletin, Special Issue on Data
Quality, June 2006
BibTex
- "A Latent Dirichlet
Model for Unsupervised Entity Resolution", Indrajit
Bhattacharya and Lise Getoor, The
6th SIAM Conference on Data Mining (SIAM SDM-06)
(Best Research Paper Award)
BibTex
- "Entity Resolution in Graphs", Indrajit Bhattacharya
and Lise Getoor, Chapter in Mining
Graph Data, Lawrence B. Holder and Diane J. Cook,
Editors, Wiley, 2006.
BibTex
- "Entity Resolution
in Graph Data", Indrajit Bhattacharya and Lise
Getoor, University of Maryland
Technical Report CS-TR-4758, October 2005.
BibTex
- "Relational Clustering
for Multi-type Entity Resolution", Indrajit
Bhattacharya and Lise Getoor, The
11th ACM SIGKDD Workshop on Multi Relational Data Mining
(MRDM-05).
BibTex
- "Deduplication and
Group Detection using Links", Indrajit Bhattacharya
and Lise Getoor, The 10th ACM SIGKDD
Workshop on Link Analysis and Group Detection
(LinkKDD-04).
BibTex
-
"Iterative Record Linkage
for Cleaning and Integration", Indrajit
Bhattacharya and Lise Getoor, The
9th ACM SIGMOD Workshop on Research Issues in Data
Mining and Knowledge Discovery (DMKD-04).
BibTex
Datasets
- CiteSeer: The CiteSeer dataset contains
1,504 machine learning documents with 2,892 author
references to 1,165 author entities. For this
dataset, the only attribute information available
for authors is the name. The full last name is
always given, and in some cases the author's full
first name and middle name are given and other times
only the initials are given. The dataset was
originally created by Giles et. al. and the version
which we use includes the author entity ground truth
provided by Aron Culotta and Andrew McCallum,
University of Massachusetts, Amherst. We have
performed further cleaning on it.
DataSet Format
- arXiv: The arXiv dataset describes high
energy physics publications. It was originally used
in KDD
Cup 2003. It contains 29,555 papers with 58,515
references to 9,200 authors. The attribute
information available for this dataset is also just
the author name, with the same variations in form as
described above. The author entity ground truth for
this data set was provided by David Jensen,
University of Massachusetts, Amherst. We have
performed further cleaning on it, extracted the
relevant information for entity resolution and put
it in the same format as the CiteSeer data.
DataSet
Code
- Synthetic Data Generator: We designed a
generator for noisy references with co-occurrence
relationships between them. This generator allows
the user to control several characteristics of the
data, such as degree of collaboration between the
underlying entities, the size of the co-occurrence
relationships, the ambiguity of entity attributes
and relationships and others, in a systematic and
flexible way. Experiments on synthetic data enabled
us to reason beyond specific datasets, understand
the impact of different structural properties of the
data on collective resolution, and also to
empirically verify our performance analysis for
relational clustering in general. The generated data
is in the same format as CiteSeer.
Description Code
- Relational Clustering: The relational
clustering code currently reads in reference data in
the CiteSeer format described above, performs
'blocking' to identify potential duplicate
references, initializes reference clusters using
bootstrapping, and then iteratively merges clusters
considering both atribute and relational similarity
until the similarity of the closest pair drops below
a threshold. All parameters such as the termination
threshold, attribute and relational similarity
measures to be used and the combination weight, can
be specified as command line arguments to the
executable. This is a pre-alpha version of the
code. Watch this page for updates!
Code
Entity Resolution Resources on the Web
- RIDDLE,
maintained by Misha
Bilenko, is an excellent web directory listing
people, papers and datasets.
| |
|