Introduction to Data Science

Instructor: John P. Dickerson
TAs: Denis Peskov (lead), Anant Dalela, Neil Alberg
Lectures: Tuesday and Thursday, 3:30–4:45 PM, CSIC 3117

Description of Course

Data science encapsulates the interdisciplinary activities required to create data-centric products and applications that address specific scientific, socio-political or business questions. It has drawn tremendous attention from both academia and industry and is making deep inroads in industry, government, health and journalism—just ask Nate Silver!

This course focuses on (i) data management systems, (i) exploratory and statistical data analysis, (ii) data and information visualization, and (iv) the presentation and communication of analysis results. It will be centered around case studies drawing extensively from applications, and will yield a publicly-available final project that will strengthen course participants' data science portfolios.

This course will consist primarily of sets of self-contained lectures and assignments that leverage real-world data science platforms when needed; as such, there is no assigned textbook. Each lecture will come with links to required reading, which should be done before that lecture, and (when appropriate) a list of links to other resources on the web.

Requirements

Students enrolled in the course should be comfortable with programming (for those at UMD, having passed CMSC216 will be good enough!) and be reasonably mathematically mature. The course itself will make heavy use of the Python scripting language by way of the Anaconda distribution and its conda package management system; we'll give some Python-for-data-science primer lectures early on, so don't worry if you haven't used Python before. Later lectures will delve into statistics and machine learning and may make use of basic calculus and basic linear algebra; light mathematical maturity is preferred at roughly the level of a junior CS student.

There will be one written, in-class midterm examination. There will not be a final examination; rather, in the interest of building students' public portfolios, and in the spirit of "learning by doing", students will create a self-contained online tutorial to be posted publicly. This tutorial can be created individually or in a small group. As described here, the tutorial will be a publicly-accessible website that provides an end-to-end walkthrough of identifying and scraping a specific data source, performing some exploratory analysis, and providing some sort of managerial or operational insight from that data.

Final grades will be calculated as:

20% midterm
40% mini-project assignments
35% final tutorial to be posted publicly online (instructions)
5% course, Piazza, and office hours participation

You can earn full credit for class participation in three ways:

Lecture participation, asking questions and answering your peers questions;
Piazza participation, asking and answering questions on Piazza; and
Regular attendance at office hours.

To earn full credit you should aim to ask or answer a question at least once every two weeks in lecture or on Piazza; or attend office hours at least once a month (this can include just going to my office hours to chat about computer science, data, science, software engineering, etc.).

This course is aimed at junior- and senior-level Computer Science majors, but should be accessible to any student of life with some degree of mathematical and statistical maturity, reasonable experience with programming, and an interest in the topic area. If in doubt, e-mail me: john@cs.umd.edu!

Office Hours & Communication

For course-related questions, please use Piazza to communicate with your fellow students, the TAs, and the course instructor. For private correspondance or special situations (e.g., excused absences, DDS accomodations, etc), please email John with [CMSC320] in the email subject line.

Office Hours
Human	Email	Time	Location
Neil Alberg	`nalberg@umd.edu`	Wednesdays, 2PM–3PM	AVW 1120
Anant Dalela	`adalela@umd.edu`	Fridays, 11AM–12PM	AVW 1112
John Dickerson	`john@cs.umd.edu`	Fridays, 1PM–2PM. Also by appointment; please email me with `[CMSC320]` in the email subject line.	AVW 3217
Denis Peskov	`dpeskov@cs.umd.edu`	Mondays, 12–1PM & Thursdays, 12:30PM–1:30PM	AVW 1112

University Policies and Resources

Policies relevant to Undergraduate Courses are found here: http://ugst.umd.edu/courserelatedpolicies.html. Topics that are addressed in these various policies include academic integrity, student and instructor conduct, accessibility and accommodations, attendance and excused absences, grades and appeals, copyright and intellectual property.

Course evaluations

Course evaluations are important and the department and faculty take student feedback seriously. Near the end of the semester, students can go to http://www.courseevalum.umd.edu to complete their evaluations.

Schedule

(Schedule subject to change as the semester progresses!)
#	Date	Topic	Reading	Slides	Lecturer	Notes
1	1/26	Introduction	What the Fox Knows.	pdf, pptx	Dickerson	Sign up on Piazza!
Part I: Data Collection, Storage, & Management
2	1/31	Scraping Data with Python	Anaconda's Test Drive.	pdf, pptx	Dickerson	PDF download script from class: link
3	2/2	NumPy, SciPy, & DataFrames	Introduction to pandas.	pdf, pptx	Dickerson	Pandas tutorials: link
4	2/7	Jupyter notebook lab	—	—	Denis, Anant, & Neil	Collection of Jupyter notebooks: link; Titanic dataset used in class: link
5	2/9	Best Practices for Data Science Projects	Derman & Wilmott's "Financial Modelers' Manifesto."	pdf, pptx	Dickerson	—
6	2/14	Version Control, & Data Wrangling I: Tidy Data	Hadley Wickham. "Tidy Data."	pdf, pptx	Dickerson	Workflows for git: link; Hould's Tidy Data for Python
7	2/16	Data Wrangling II: SQL	—	pdf, pptx	Dickerson	SQLite: link; pandasql library: link
8	2/21	Missing Data	Pandas tutorial on working with missing data.	pdf, pptx	Dickerson	Scikit-learn's imputation functionality: link
9	2/23	Exploratory Data Analysis: Summary Statistics, Transformations, & Visualization	John W. Tukey: His Life and Professional Contributions.	pdf, pptx	Dickerson	Seaborn visualization library for Python: link
10	2/28	Visualization	Edward R. Tufte. The Visual Display of Quantitative Information (examples.)	pdf, pptx	Dickerson	/r/dataisbeautiful: link; (for both good and bad examples of visualization ...)
11	3/2	Graph & Network Processing	—	pdf, pptx	Dickerson	NetworkX: link
12	3/7	Free Text & NLP	NLTK Book.	pdf, pptx	Dickerson	Python Natural Language Toolkit (NLTK): link; Criticisms of the Turing Test: link
Part II: Statistical Modeling, Analysis, & Machine Learning
13	3/9	Introduction to Machine Learning, & Linear Regression	Hal Daumé III. A Course in Machine Learning.	pdf, pptx	Dickerson	Scikit-learn cheat sheet: link
14	3/14	Snow Day!	—	—	—	—
15	3/16	Gradient Descent & Linear Classification	—	pdf, pptx	Dickerson	Tensorflow: link
—	3/21	Spring Break	—	—	—	—
—	3/23	Spring Break	—	—	—	—
16	3/28	Decision Trees & Basic Model Evaluation	Russell & Norvig's Chapter 18 lecture slides:	pdf, pptx	Dickerson	Scikit-learn's basic decision tree functionality: link; Bart Selman's CS4700: link
17	3/30	Decision Trees & Basic Model Evaluation	—	pdf, pptx	Dickerson	xkcd on overfitting: link
18	4/4	Random Forests, K-NN, SVMs	The Curse of Dimensionality.	pdf, pptx	Dickerson	Extremely Randomized Trees: link
19	4/6	Hypothesis Testing	Regina Nuzzo. "Scientific method: Statistical errors."	pdf, pptx	Dickerson	p-hacking: link; Goodhart's Law: link
20	4/11	Midterm Review & Regularization	NLPers blog post on overfitting.	pdf, pptx	Dickerson	Passover–please email John if this causes problems
Intermission: Staring into the Abyss
21	4/13	Midterm	—	—	—	—
Part III: Additional & Advanced Topics
22	4/18	Regularization & Dimensionality Reduction I	—	pdf, pptx	Dickerson	Interactive k-means: link
23	4/20	Homework 3 Day	—	—	Dickerson	—
24	4/25	Dimensionality Reduction II & Collaborative Filtering	Tutorial on PCA.	pdf, pptx	Dickerson	Netflix prize rules: link
25	4/27	Scaling Up: Stochastic Gradient Descent, & Big Data & MapReduce I	—	pdf, pptx	Dickerson	Apache Hadoop: link
26	5/2	Debugging Data Science	—	pdf, pptx	Dickerson	—
27	5/4	Poll: Deep Learning or Organ Allocation	—	pdf, pptx	Dickerson	—
28	5/9	Course wrap-up & data science in the real world	—	—	Denis Peskov	AAMAS
29	5/11	Up in the air for now				AAMAS
Final	5/19	Final Exam Date	Final versions of tuturials must be posted by 10:30AM, the exam time.			Instructions & rubric: link

Mini-Projects né Homework

In addition to the tutorial to be posted publicly at the end of the semester, there will be four "mini-projects" assigned over the course of the semester. The best way to learn is by doing, so these will largely be applied assignments that provide hands-on experience with the basic skills a data scientist needs in industry.

Posting solutions publicly online without the staff's express consent is a direct violation of our academic integrity policy. Late assignments will not be accepted.

(Assignments will appear over the course of the semester.)
#	Description	Date Released	Date Due	Project Link
1	Fly Me To The Moon	February 25	March 11	link
2	Moneyball	March 11	April 2	link
3	Fact Tank	April 4	April 23	Part I, Part II
4	Baltimore Crime	April 24	May 4	link

CMSC320 – Spring 2017