AVW 3221; By appointment.
This is a seminar course, and will be based on reading, and discussing
papers from recent conferences.
The course counts for PhD and MS qualifying course in Databases. The course is not valid
for MS Comps.
The grading will be based on class participation and paper critiques (20%), two exams - likely take-home (40%), and a class project (40%).
This course will focus on large-scale data management and analysis focusing on three broad
See schedule for a tentative, evolving reading list.
Graph-structured Data: The main focus of the class will be graph data management. Some of
the key emerging application domains have to deal with very large
volumes of dynamic, rapidly-changing graph-structured data. This includes the Web, social networks, sensor networks, biological networks, traffic
networks to name a few. Database management systems are not very good at managing such data, or
querying over them, especially when we consider node and edge attributes, time-varying characteristics
(e.g., edges being valid for a duration of time), and uncertainty (see next).
Further, the types of analysis and queries that we typically want to perform (e.g., relationship
analytics, ranking, proximity searching) are quite different from standard database queries.
- Data Uncertainty: Real-world data also often exhibits large amounts of uncertainty of various types.
There has been much work in the area of uncertain data management in the recent years, however many
challenges remain. Most of the prior work has generally made simplifying assumptions about the types
of uncertainty that can be modeled. There has been almost no work in managing uncertain or probabilistic
graph data. We will focus on the approach developed by us at UMD, that
aims to integrate probabilistic graphical models (e.g. Bayesian networks), and statistical relational
models (e.g. Probabilistic Relational Models, Markov Logic Networks) into relational databases, with
a focus on the data management, querying, and scalable inference.
- Large-scale Analytics: Technologies like MapReduce and Hadoop have made it possible to analyze very
large volumes of data, using a large number of distributed machines. Here we will focus on declarative abstractions
for large-scale statistical analysis, machine learning, and graph analytics on Hadoop. Graph analytics
in particular are challenging in such a framework, because graph algorithms are naturally sequential
We will use a forum for general announcements, to ask/answer questions about the
projects/assignments etc. You are required to read the forum on a regular
basis. Also, the forum should be the first resort for asking any (non-private)
questions. You will need to register to use the forum.
[[ FORUM LINK COMING ]]