Syllabus - CMSC642/498K (Spring 2018)

Rationale and overall theme of the course

Data Science is largely driven by an unprecedented amounts of data that is being generated at a very rapid rate, thanks to the rise of the Internet and social media; the ubiquity of smart phones and miniature sensing devices; the increasing effectiveness of scientific instruments at collecting data like genomic data, astronomical data, medical data, etc.; and increasing use of computational tools, especially simulations, in social and physical sciences. Effectively managing and analyzing such "big data" is a key to the overall data science process. Traditional data processing applications like relational database systems are highly functional, but are often unable to handle such "big data" for several reasons. The scale of the data is often much higher than traditional data management tools are equipped to handle. The data tends to come from a variety of disparate sources and exhibits high heterogeneity and variety. The data is often generated at very high rate (velocity). Finally, the data is often noisy and the veracity of the data can be hard to evaluate.

The course will focus on the diverse set of techniques, tools, and systems commonly used for performing data science on large volumes of data. It covers both relational database systems, still a mainstay in data management systems, as well as the so-called "NoSQL" systems. The goals of the course are to provide a broad overview of data management systems, with an emphasis on foundations and on understanding the strengths and limitations of the different systems. It will also cover some of the principles of cloud computing and data centers.

Prerequisites

For CMSC642 → Must be in the Professional Graduate Certificate Program;
For CMSC498K → Minimum grade of C- in CMSC330 and CMSC351

Staff

Name	Office Number	Role
Alan Sussman	AVW 4121	Instructor
Konstantinos Xirogiannopoulos		TA

Textbooks

There is no required textbook for the course, however the course will draw on material from the following sources.

Title	Authors	Note
Database System Concepts, 6th edition	A. Silberschatz, H.F. Korth, S. Sudarshan	Any other undergraduate database textbook will suffice
Mining Massive Datasets	J. Leskovec, A. Rajaraman, J.D. Ullman	Available for free download

Course Topics (Subject to Change)

Data models and importance of modeling, Normal Forms
Relational databases + SQL
Parallel Databases
Cloud computing and data centers
Map-Reduce Framework

Fundamentals
Writing different algorithms in MR
Hadoop + Spark

NoSQL Systems

Basics, how and where to use them
Key-value stores: Cassandra, HBase
Document stores: MongoDB, Couchbase
Graph Databases: Neo4j, OrientDB, AllegroGraph

Batch graph analytics systems

GraphX, Giraph

Data streaming systems

Storm, Spark Streaming

Grading

50%	Projects
20%	Midterm
30%	Final Exam

Projects

Deadlines - All projects are due at 8 pm on the specified day in the project description. We will not make projects due on class days (Tuesday).
Submitting Projects - TBD
Closed Projects - All programming assignments in this course are to be written individually (unless explicitly indicated otherwise). Students may discuss projects (and homeworks) in groups. However, each student must write and/or program solutions independently.

Regarding Posting of Project Implementations

Do not post your assignments' implementation online (e.g., GitHub, PasteBin) where they can be seen by others. Making your code accessible to others can lead to academic integrity violations.
Posting of your projects in a private repository where only selected people (e.g., potential employers) can see them is OK. Just make sure is not a public site.
Even once the course is over do not make your code publicly available to others.

TA Room/Office Hours

Office hours tend to get busy the day before a project deadline. Therefore do not wait to start your projects.

Once you have been helped by your instructor or TA please leave the room. We have a large number of students in CS classes and the TA room can be very crowded.

Backups

You need to keep backups of your projects as you develop them. No extensions will be granted because you accidentally erased your project. Tools like git can help with that. Do not post code in any online system that is accessible to others (e.g., GitHub).

Piazza

We will be using (Piazza) for class communication. You cannot register for the class Piazza page yourself. I have already registered you using the e-mail you have in the UMD system. If you are not registered, email the instructor.

Excused Absence and Academic Accommodations

See the section titled "Attendance, Absences, or Missed Assignments" available at Course Related Policies.

Disability Support Accommodations

See the section titled "Accessibility" available at Course Related Policies.

Academic Integrity

Note that academic dishonesty includes not only cheating, fabrication, and plagiarism, but also includes helping other students commit acts of academic dishonesty by allowing them to obtain copies of your work. In short, all submitted work must be your own. Cases of academic dishonesty will be pursued to the fullest extent possible as stipulated by the Office of Student Conduct.

It is very important for you to be aware of the consequences of cheating, fabrication, facilitation, and plagiarism. For more information on the Code of Academic Integrity or the Student Honor Council, please visit http://www.shc.umd.edu.

The CS Department takes academic integrity seriously. Information on how the CS Department views and handle academic integrity matters can be found at Academic Integrity.

The following are examples of academic integrity violations:

Hardcoding of results in a project assignment. Hardcoding refers to attempting to make a program appear as if it works correctly (e.g., printing expected results for a test).
Using any code available on the internet/web or any other source. For example, using code from GitHub.
Hiring any online service to complete an assignment for you.
Sharing your code or your tests with any student.
Using online forums (other than Piazza) in order to ask for help regarding assignments.

Miscellaneous

Please bring your laptop to lecture. If you don't have a laptop, talk with the instructor.
At the end of the semester visit (www.courseevalum.umd.edu) to complete your course evaluations.
If you are experiencing difficulties in keeping up with the academic demands of this course, you may contact the Learning Assistance Service located at 2202 Shoemaker Building.
UMD Course related policies can be found at http://www.ugst.umd.edu/courserelatedpolicies.html

All course materials are copyright UMCP, Department of Computer Science © 2018. All rights reserved. Students are permitted to use course materials for their own personal use only. Course materials may not be distributed publicly or provided to others (excepting other students in the course), in any way or format.