CMSC641 – Fall 2018

Principles of Data Science

Data Science!?

Instructor: John P. Dickerson
TA: Duncan McElfresh (1/2 TA)
Lectures: Wednesday, 7:00–9:30 PM, CSIC 3120

Description of Course

Data science encapsulates the interdisciplinary activities required to create data-centric products and applications that address specific scientific, socio-political or business questions. It has drawn tremendous attention from both academia and industry and is making deep inroads in industry, government, health and journalism—just ask Nate Silver!

This course focuses on (i) data management systems, (i) exploratory and statistical data analysis, (ii) data and information visualization, and (iv) the presentation and communication of analysis results. It will be centered around case studies drawing extensively from applications, and will yield a publicly-available final project that will strengthen course participants' data science portfolios.

This course will consist primarily of sets of self-contained lectures and assignments that leverage real-world data science platforms when needed; as such, there is no assigned textbook. Each lecture will come with links to required reading, which should be done before that lecture, and (when appropriate) a list of links to other resources on the web.

Requirements

Students enrolled in the course should be comfortable with programming and be reasonably mathematically mature. I understand that the class makeup will be diverse in terms of both academic and work experience—please talk to me about any worries or issues! The course itself will make heavy use of the Python scripting language by way of Jupyter Notebooks, leaning on the Anaconda package manager; we'll give some Python-for-data-science primer lectures early on, so don't worry if you haven't used Python before. Later lectures will delve into statistics and a bit of machine learning (although the bulk of that will be covered in CMSC643) and may make use of basic calculus and basic linear algebra; light mathematical maturity is preferred at roughly the level of a junior CS/Math student.

There will not be a final examination; rather, in the interest of building students' public portfolios, and in the spirit of "learning by doing", students will create a self-contained online tutorial to be posted publicly. This tutorial can be created individually or in a small group. As described here (subject to change!), the tutorial will be a publicly-accessible website that provides an end-to-end walkthrough of identifying and scraping a specific data source, performing some exploratory analysis, and providing some sort of managerial or operational insight from that data.

Final grades will be calculated as:

You can earn full credit for class participation in three ways:

  1. Lecture participation, asking questions and answering your peers questions;
  2. Piazza participation, asking and answering questions on Piazza; and
  3. Regular attendance at office hours.
To earn full credit you should aim to ask or answer a question at least once every two weeks in lecture or on Piazza; or attend office hours at least once a month (this can include just going to my office hours to chat about computer science, data, science, software engineering, etc.).

This course should be accessible to any student of life with some degree of mathematical and statistical maturity, reasonable experience with programming, and an interest in the topic area. If in doubt, e-mail me: john@cs.umd.edu!

Office Hours & Communication

For course-related questions, please use Piazza to communicate with your fellow students, the TAs, and the course instructors. For private correspondance or special situations (e.g., excused absences, DDS accomodations, etc), please email John with [CMSC641] in the email subject line.

Office Hours
Human Time Location
John Dickerson 6–7pm on Wednesdays (i.e., right before lecture). Also by appointment; please email John with [CMSC641] in the email subject line. AVW 3217

University Policies and Resources

Policies relevant to Graduate Courses are found here: https://gradschool.umd.edu/policies, while those relevant to Undergraduate Courses are found here: http://ugst.umd.edu/courserelatedpolicies.html. Topics that are addressed in these various policies include academic integrity, student and instructor conduct, accessibility and accommodations, attendance and excused absences, grades and appeals, copyright and intellectual property.

Course evaluations

Course evaluations are important and the department and faculty take student feedback seriously. Near the end of the semester, students can go to http://www.courseevalum.umd.edu to complete their evaluations.


Schedule

(Schedule subject to change as the semester progresses!)
# Date Topic Reading Slides Lecturer Notes
1 8/29 Introduction What the Fox Knows. pdf, pptx Dickerson Sign up on Piazza!
2 9/5 Scraping Data with Python Anaconda's Test Drive. pdf, pptx Dickerson PDF download script from class: link
3 9/12 NumPy, SciPy, & DataFrames Introduction to pandas. pdf, pptx Dickerson
4 9/19 Data Wrangling I & II: Pandas, Tidy Data, & SQL Hadley Wickham. "Tidy Data." ; Derman & Wilmott's "Financial Modelers' Manifesto." pdf, pptx Dickerson Hould's Tidy Data for Python; SQLite: link; pandasql library: link
5 9/26 Missing Data Pandas tutorial on working with missing data. pdf, pptx Dickerson Scikit-learn's imputation functionality: link
6 10/3 Data Wrangling Wrap-Up: Data Integration, Data Warehousing, Entity Resolution; & EDA I: Summary Statistics, Transformations Data Cleaning: Problems and Current Approaches (Note: this is a reference piece; please don't read the whole thing!); John W. Tukey: His Life and Professional Contributions. pdf, pptx Dickerson Wikipdia article on outliers
7 10/10 Graphs pdf, pptx Dickerson GraphQL language: link
8 10/17 Graphs, & Natural Language NLTK Book. pdf, pptx Dickerson
9 10/24 Natural Language, & Visualization NLTK Book. pdf, pptx Dickerson Python Natural Language Toolkit (NLTK): link; Criticisms of the Turing Test: link
10 10/31 Visualization Edward R. Tufte. The Visual Display of Quantitative Information (examples.) pdf, pptx Dickerson Seaborn visualization library for Python: link
11 11/7 Hypothesis Testing; Data Science Ethics & Best Practices I pdf, pptx Dickerson Got through slide 43 ...
12 11/14 Data Science Ethics & Best Practices II The Atlantic. "Everything We Know About Facebook's Secret Mood Manipulation Experiment" pdf, pptx Dickerson Got through slide 110 ...; SIGCOMM paper that passed IRB review but is widely seen as unethical: link
11/21 Thanksgiving Break
13 11/28 Data Science Ethics & Best Practices III Apple's brief overview of differential privacy: pdf, pptx Dickerson
14 12/5 Class Presentations & Wrap-Up pdf, pptx Everyone!
Final 12/14 Final Exam Date Final versions of tuturials must be posted by 11:59PM. Instructions & rubric: link

Mini-Projects né Homework

In addition to the tutorial to be posted publicly at the end of the semester, there will be four "mini-projects" assigned over the course of the semester (plus one simple setup assignment that will walk you through using git, Docker, and Jupyter). The best way to learn is by doing, so these will largely be applied assignments that provide hands-on experience with the basic skills a data scientist needs in industry.

Posting solutions publicly online without the staff's express consent is a direct violation of our academic integrity policy. Late assignments will not be accepted.

(Assignments will appear over the course of the semester.)
# Description Date Released Date Due Project Link
0 Setting Things Up August 27 September 7 link
1 Fly Me To The Moon September 12 October 3 link
2 Moneyball October 3 October 24 link
3 Baltimore Crime October 28 November 21 link