CMSC828E: Probabilistic and Graph Data Management; Large-scale Analytics

Prof. Amol Deshpande;    CSIC 2107;    Mon-Wed 3:30pm-4:45pm


[Home] [Schedule] [Projects] [Resources]

Office Hours:

AVW 3221; By appointment.

Approach:

This is a seminar course, and will be based on reading, and discussing papers from recent conferences.

The course counts for PhD and MS qualifying course in Databases. The course is not valid for MS Comps.

The grading will be based on class participation and paper critiques (20%), two exams - likely take-home (40%), and a class project (40%).

Course Description:

This course will focus on large-scale data management and analysis focusing on three broad loosely-connected topics:
  • Graph-structured Data: The main focus of the class will be graph data management. Some of the key emerging application domains have to deal with very large volumes of dynamic, rapidly-changing graph-structured data. This includes the Web, social networks, sensor networks, biological networks, traffic networks to name a few. Database management systems are not very good at managing such data, or querying over them, especially when we consider node and edge attributes, time-varying characteristics (e.g., edges being valid for a duration of time), and uncertainty (see next). Further, the types of analysis and queries that we typically want to perform (e.g., relationship analytics, ranking, proximity searching) are quite different from standard database queries.
  • Data Uncertainty: Real-world data also often exhibits large amounts of uncertainty of various types. There has been much work in the area of uncertain data management in the recent years, however many challenges remain. Most of the prior work has generally made simplifying assumptions about the types of uncertainty that can be modeled. There has been almost no work in managing uncertain or probabilistic graph data. We will focus on the approach developed by us at UMD, that aims to integrate probabilistic graphical models (e.g. Bayesian networks), and statistical relational models (e.g. Probabilistic Relational Models, Markov Logic Networks) into relational databases, with a focus on the data management, querying, and scalable inference.
  • Large-scale Analytics: Technologies like MapReduce and Hadoop have made it possible to analyze very large volumes of data, using a large number of distributed machines. Here we will focus on declarative abstractions for large-scale statistical analysis, machine learning, and graph analytics on Hadoop. Graph analytics in particular are challenging in such a framework, because graph algorithms are naturally sequential in nature.
See schedule for a tentative, evolving reading list.

Class forum:

We will use a forum for general announcements, to ask/answer questions about the projects/assignments etc. You are required to read the forum on a regular basis. Also, the forum should be the first resort for asking any (non-private) questions. You will need to register to use the forum.
[[ FORUM LINK COMING ]]