Instructor

Amol Deshpande
Email: amol at cs dot umd dot edu
Office hours: TBA, or by appointment.
Office AVW 3221, Tel. 301-405-2703

Teaching Assistants

Theodoros Rekatsinas, Abdul Quamar
Office hours: TBA.

Prerequisite

Minimum grade of C- in CMSC351 and CMSC330; and permission of CMNS-Computer Science department. Or Must be in the (Computer Science (Doctoral), Computer Science (Master's)) program.

Goals and Learning Objectives

Data Science, or Data Analytics, is the practice of creating and applying data-centric software used to extract actionable knowledge and insights from a collection of heterogeneous data sources that answer specific scientific, socio-political, or business questions. Data science has been called "the sexiest job in the 21st century", and the demand for data scientists is expected to far exceed the available supply in the near future. Data science incorporates practices from a variety of fields including statistics, machine learning, databases, distributed systems, algorithms, data warehousing, high-performance computing, and visualization. Thus, at a minimum, today's data scientist needs to have familiarity with: data processing and management tools like relational databases and Hadoop for processing large volumes of data; scripting languages like Python for quickly writing programs to clean and transform messy raw data; basic machine learning and data mining algorithms for analyzing the data; statistical computing environments like R and Matlab for writing analysis scripts; and visualization tools for presentation and communication of analysis results.

This is the first of two courses that covers the practice of data science. This course focuses on the acquisition, cleaning, and integration of data, data management and processing, and data modeling. The course will cover a broad variety of tools and techniques, and will focus on breadth rather than depth. Students will learn how to model and reason about data, and how to process and manipulate it in various ways. They will be introduced to a variety of data management tools, and will learn the appropriate usage scenario for each tools. They will learn the key steps in acquiring and integrating data, including how to do data cleaning, entity resolution, information extraction, and data integration. The course will be heavily assignment-based with bi-weekly assignments, and will draw extensively from applications.

The second course will be offered by Prof. Hector Corrada Bravo in Spring 2015, and will focus on exploratory and statistical data analysis, machine learning and data mining methods, data and information visualization and the effective presentation and communication of insights obtained by these methods.

List of Topics to be Covered

Course Grading

The course will be heavily assignment-based, with bi-weekly assignments focused on learning how to use different tools. You should be familiar with Java, be comfortable with using Unix/Linux, and also be comfortable with downloading and installing packages from the Web (we will only use widely-used packages that have extensive documentation). Although we will use other languages like Python in some cases, sufficient guidance will be provided and prior familiarity with those languages is not expected.

Some examples of tools that we will learn to use include: (1) Amazon EC2 Cloud Computing, (2) Data Wrangler (a data cleaning tool), (3) PostgreSQL (a relational database), (4) Hadoop/Map-Reduce, (5) MongoDB/HBase/Cassandra.

There will also be in-class exams and a final. Details to be announced later.

Class forum

This semester, we are using Piazza for class discussion (link at top). The system is highly catered to getting you help fast and efficiently from classmates, TAs, and instructors. Rather than emailing questions to the teaching staff, we encourage you to post your questions on Piazza. You and other students can answer a question and edit the answer, with the teaching staff chiming in as appropriate. Use Piazza to ask anything, from questions about assignments to when the next quiz is.

Textbook

There is no textbook for the course. A list of online resources will be provided later.

Late Submission Policy

TBA.

Excused Absenses Due To Illness

A student claiming a excused absence must apply in writing and furnish documentary support (such as from a health care professional who treated the student) for any assertion that the absence qualifies as an excused absence. The support should explicitly indicate the dates or times the student was incapacitated due to illness. Self-documentation of illness is not itself sufficient support to excuse the absence. This instructor is not under obligation to offer a substitute assignment or to give a student a make-up assessment unless the failure to perform was due to an excused absence.

Academic Integrity

The University of Maryland, College Park has a nationally recognized Code of Academic Integrity, administered by the Student Honor Council. This Code sets standards for academic integrity at Maryland for all undergraduate and graduate students. As a student you are responsible for upholding these standards for this course. It is very important for you to be aware of the consequences of cheating, fabrication, facilitation, and plagiarism. For more information on the Code of Academic Integrity or the Student Honor Council, please visit http://www.shc.umd.edu. To further exhibit your commitment to academic integrity, remember to sign the Honor Pledge on all examinations and assignments: "I pledge on my honor that I have not given or received any unauthorized assistance on this examination (assignment).