CMSC828L Spring 2012: Link Mining

Course Description

There has been a recent surge of interest in the analysis of data describing all forms of networks, including communications networks, biological networks, social networks, financial transaction networks and more. Despite the diversity of domains, common difficulties and challenges include noisy and incomplete data, dynamic and streaming data, issues of scalability and statistical issues such as identifiability, stationarity, and so on. There are a number of different research communities working on network analysis including social scientists, statisticians, physicists and computer scientists; each comes with their own view on the problem of network analysis, their own set of tools and their own style of analysis.

In this seminar, we will focus on the issue of entity resolution in network data.  Entity resolution is the problem of determining the mapping from data entries to realworld entities.   Often, in order to support this, it requires, either implicitly or explicitly, determining the set of realworld entities, and many algorithms rely on some form of clustering.  Entity resolution is closely connected to topics such as identity management, anonymization and privacy in network data.

Entity resolution (aka deduplication, record linkage, coreference resolution) is a widely studied topic in databases, statistics and other areas of computer science, however, it is an understudied topic in the area of network analysis.  In this class, we will survey the foundations for entity resolution, recent state-of-the-art research in entity resolution, work on entity resolution in network data and theoretical foundations for entity resolution.

The seminar will be very interactive and collaborative. The topics covered and the depth of coverage will depend on the participants' input and interests.

The goal of the course is to give you an indepth look at an important topic in data and network analysis.   The algorithms are applicable in many areas such as bioinformatics, computer vision, computational linguistics, cybersecurity, databases, program analysis, communication networks and more. We hope to provide opportunities for hands on experience with entity resolution algorithms, and with the opportunity to develop new algorithms and theory for ER.  Along the way, you will pick up some practical experience in reading and presenting research papers, synthesizing research across desperate areas, using existing tools, and doing a course project that ideally will lead to a publishable paper.

In tandem with the course, throughout the semester there will be several invited speakers presenting current work in network analysis. Some of these will be during the scheduled course time, while others, due to schedule constraints, will be outside the regular course time. Students are highly encouraged to attend the invited talks and meet with the speakers.

Prerequisites: Mathematical maturity and a basic course in probability required.  Background in algorithms, databases, machine learning, and graphical models suggested.

Course Format

This is a seminar course. Each class will consist of presentations and discussion. Students will be required to do a class project for the course (40%) . A significant portion of the grade will be based on class participation, which includes paper presentations, contributions to the wiki, and demonstrations (60%).

Because of the interactive nature of the course, and space limitations, auditing is discouraged.   Auditors who do agree to participate and present papers will be considered.

Course Credit

This course does not count as a PhD Core or MS Comps course. This course can be used toward PhD coursework as part of the non-core classes required or towards MS coursework (but it is not an MS qualifying course).

Course Information

Time: Wed 10:00am -12:30pm in AVW 4172 - important note*
Professor: Lise Getoor - getoor AT cs.umd.edu
Co-Instructor: Bert Huang - bert AT cs.umd.edu
Office hours: TBA
Web site: http://www.cs.umd.edu/class/spring2012/cmsc828l/

*note: The class time will be adjusted in some cases to accommodate external speakers at the CLIP colloquium (Wed 11-noon), and to accommodate the schedule of the new Yahoo ML seminar (which will take place biweekly Wed 1:30-2:30, starting 2/15).   Students will be encouraged to keep their Wed midday schedule open to accommodate this.  We will also be polling folks the first day to see how feasible this will be.

Course Wiki

The class wiki is for students enrolled in the course to share material and discuss content.

Course Mailing List

The class mailing list is for announcements relevant to the class. If you are enrolled in the class please sign up here http://mailman.cs.umd.edu/mailman/listinfo/cmsc828L-spr2012

 

Schedule / Syllabus (Subject to Change)

Date Topic Notes
Wed 1/25

Introduction

  • Students should add their introductions to the class wiki
  • Link Mining: A Survey,
  • Lise Getoor, Christopher Diehl, SigKDD Explorations Special Issue on Link Mining, Volume 7, Number 2 - December, 2005
Wed 2/1 Entity Resolution - Classics I
split - 3258, 4172
Wed 2/8 Entity Resolution - Classics II - Record Linkage
Guest Lecture: William Winkler, Census Bureau
4172
Wed 2/15 Entity Resolution - Classics III - String Similarity
Wed 2/22 Entity Resolution - Probabilistic Models
Wed 2/29* Entity Resolution - Relational &
Collective Entity Resolution
Wed 3/7 Entity Resolution - Advanced Probabilistic Models
Wed 3/14(*) Advanced Probabilistic Models, cont.  

Wed 3/21

 

spring break

 
Wed 3/28

Summary Discussion

 
Wed 4/4 Hands-on I
Scaling ER
 
Wed 4/11 Scaling ER  
Wed 4/18 Scaling Collective ER & Evaluating ER
Wed 4/25 Hands-on II  
Wed 5/2 Summary & Advanced Topics ER UIs, Identity, Privacy  
Wed 5/9 Project Presentations and Poster Session  
Backup Poster Session, May 11 4pm-7pm

Supplemental Papers