Instructor
Amol Deshpande
Email: amol
at cs
dot umd
dot edu
Office hours: TBA, or by appointment.
Office AVW 3221, Tel. 301-405-2703
Teaching Assistants
Theodoros Rekatsinas,
Abdul Quamar
Office hours: TBA.
Prerequisite
Minimum grade of C- in CMSC351 and CMSC330; and permission of CMNS-Computer Science department. Or Must be in the (Computer Science (Doctoral), Computer Science (Master's)) program.
Goals and Learning Objectives
Data Science, or Data Analytics, is the practice of creating and applying
data-centric software used to extract actionable knowledge and insights from a
collection of heterogeneous data sources that answer specific scientific,
socio-political, or business questions. Data science has been called "the
sexiest job in the 21st century", and the demand for data scientists is expected
to far exceed the available supply in the near future. Data science incorporates
practices from a variety of fields including statistics, machine learning,
databases, distributed systems, algorithms, data warehousing, high-performance
computing, and visualization. Thus, at a minimum, today's data scientist needs
to have familiarity with: data processing and management tools like relational
databases and Hadoop for processing large volumes of data; scripting languages
like Python for quickly writing programs to clean and transform messy raw data;
basic machine learning and data mining algorithms for analyzing the data;
statistical computing environments like R and Matlab for writing analysis
scripts; and visualization tools for presentation and communication of analysis
results.
This is the first of two courses that covers the practice of data science. This
course focuses on the acquisition, cleaning, and integration of data, data
management and processing, and data modeling. The course will cover a broad
variety of tools and techniques, and will focus on breadth rather than depth.
Students will learn how to model and reason about data, and how to process and manipulate it in various ways.
They will be introduced to a variety of data management tools, and will learn the appropriate usage scenario for each tools.
They will learn the key steps in acquiring and integrating data, including how to do data cleaning, entity resolution, information extraction,
and data integration.
The course
will be heavily assignment-based with bi-weekly assignments, and will draw
extensively from applications.
The second course will be offered by
Prof. Hector Corrada Bravo in Spring 2015,
and will focus on exploratory and statistical data analysis, machine learning
and data mining methods, data and information visualization and the effective
presentation and communication of insights obtained by these methods.
List of Topics to be Covered
- Typical workflow of a data scientist/life-cycle of data
- Data modeling: Relational, E/R, XML, JSON data models; Dremel/Protocol buffer nested data
model; Normalization basics
- Data preparation: Data scraping/wrangling; Data integration; Information extraction; Data
cleaning; Workflows
- Data management platforms: Relational databases and SQL; Parallel databases; Data cubes; Map-reduce,
Hadoop ecosystem; Data streaming systems; Key-value stores; Graph databases;
Course Grading
The course will be heavily assignment-based, with bi-weekly assignments focused on learning how to use different tools.
You should be familiar with Java, be comfortable with using Unix/Linux, and also be comfortable with downloading and installing
packages from the Web (we will only use widely-used packages that have extensive documentation). Although we will use other
languages like Python in some cases, sufficient guidance will be provided and prior familiarity with those languages is not expected.
Some examples of tools that we will learn to use include: (1) Amazon EC2 Cloud Computing, (2) Data Wrangler (a data cleaning tool),
(3) PostgreSQL (a relational database), (4) Hadoop/Map-Reduce, (5) MongoDB/HBase/Cassandra.
There will also be in-class exams and a final. Details to be announced later.
Class forum
This semester, we are using Piazza for class discussion (link at top). The system is highly catered
to getting you help fast and efficiently from classmates, TAs, and instructors. Rather than
emailing questions to the teaching staff, we encourage you to post your questions on Piazza. You and
other students can answer a question and edit the answer, with the teaching staff chiming in as
appropriate. Use Piazza to ask anything, from questions about assignments to when the next quiz is.
Textbook
There is no textbook for the course. A list of online resources will be provided later.
Late Submission Policy
TBA.
Excused Absenses Due To Illness
A student claiming a excused absence must apply in writing and furnish
documentary support (such as from a health care professional who treated
the student) for any assertion that the absence qualifies as an excused
absence. The support should explicitly indicate the dates or times the
student was incapacitated due to illness. Self-documentation of illness
is not itself sufficient support to excuse the absence. This instructor
is not under obligation to offer a substitute assignment or to give a
student a make-up assessment unless the failure to perform was due to
an excused absence.
Academic Integrity
The University of Maryland, College Park has a nationally recognized
Code of Academic Integrity, administered by the Student Honor Council.
This Code sets standards for academic integrity at Maryland for all
undergraduate and graduate students. As a student you are responsible
for upholding these standards for this course. It is very important for
you to be aware of the consequences of cheating, fabrication,
facilitation, and plagiarism. For more information on the Code of
Academic Integrity or the Student Honor Council, please visit
http://www.shc.umd.edu.
To further exhibit your commitment to academic integrity, remember to
sign the Honor Pledge on all examinations and assignments: "I pledge on
my honor that I have not given or received any unauthorized assistance
on this examination (assignment).