Following is a tentative schedule of the topics we plan to cover and what the assignements will focus on. More details will be added as the course progresses.
Note about assignments: One goal of this class is to get you to be comfortable with using a wide variety of tools (most of which are listed below). You are NOT expected to learn these tools on your own; we will provide step-by-step guidance on getting started with the tools and the actual assignments will be simple.
Note about readings: The links to the two textbooks can only be accessed when you are on the UMD network, because UMD has subscription for the Safari Online Books service.
If you are enrolled in the CMSC828, click here to see the assigned readings
Date Lecture Topics and Materials Assignments
Introduction: What is data science. Major tools used by data scientists. Class overview.
Lecture Notes.
Lab 0: Basic usage of github, VirtualBox, IPython Notebook (Due 9/12)
Basic Statistics: statistical tests, samples, fallacies.
Lecture Notes.
Basic Statistics: linear regression, classification, clustering.
Lecture Notes.
Lab 1: Python basic stats and plotting (Due 9/19)
Data Models: Overview, Why modeling is essential, Commonly used models (Relational, JSON, Protocol Buffers)
Lecture Notes.
Relational Databases, SQL
Lecture Notes.
Lab 2: Basic SQL; Python Pandas and Dataframes; Avro (Due 10/3)
(cntd)
(cntd)
Data scraping and wrangling, Unix tools, GUIs
Lecture Notes.
Lab 3: Advanced SQL and Pandas (Due 10/10)
(cntd)
Data Integration: Overview, Schema mapping, Entity Resolution
(Lecture Notes Continued)
Readings for 828:
  • Querying Heterogeneous Information Sources Using Source Descriptions; Levy et al.; VLDB 1996 [PDF]
  • Swoosh: a generic approach to entity resolution
  • Jiannan Wang, Tim Kraska, Michael J. Franklin, Jianhua Feng CrowdER: Crowdsourcing Entity Resolution, VLDB, Istanbul, Turkey, 2012
Lab 4: Data cleaning using unix tools, Data Wrangler (Due 10/17)
(cntd)
Information Extraction: Overview, Key Techniques
(Lecture Notes Continued)
Readings for 828:
  • Information Extraction; Sarawagi; Foundations and Trends; (Only Chapters 1 and 2) [PDF]
  • WebTables: Exploring the Power of Tables on the Web
Lab 5: Entity Resolution and Information Extraction (Due 10/28)
Implementation of Relational Databases
Lecture Notes.
(cntd)
Distributed programming frameworks: Parallel Databases, MapReduce, Apache Spark, Hadoop Ecosystem Lecture Notes. Lab 6: Hadoop, Spark (Due: 11/7)
MIDTERM
(Cntd Distributed Programming Frameworks)
(cntd)
(cntd) Lab 7: Cassandra and MongoDB (Due: 11/17)
(cntd)
Key-value stores: Basics, Differences from Relational Databases, Consistency/Replication issues
Lecture Notes.
Lab 8: Spark Streaming, Storm (Due: 11/26)
(cntd)
Visualization: D3.js (see Lab 10 for notes)
Data streaming/Real-time analytics: Data streams in relational databases, Spark Streaming, Storm Lecture Notes. Lab 9: Neo4j, GraphX (Due: 12/8)
(cntd)
Graph Databases and Graph Analytics
Lecture Notes.
Lab 10: D3 (Due: 12/11)
(cntd)
Cloud computing: Overview, Virtualization, Data centers, Platform/Infrastrcture-as-a-Service Lecture Notes.
(cntd)

Critiques (only for students taking CMSC 828O)

For each of the assigned papers, you should submit a critique (on Piazza) by 10am the day of the class. The critiques should show evidence of independent thinking, and there are many ways you could structure those. Here are two suggestions: