Data Science is largely driven by an unprecedented amounts of data that is being generated at a very rapid rate, thanks to the rise of the Internet and social media; the ubiquity of smart phones and miniature sensing devices; the increasing effectiveness of scientific instruments at collecting data like genomic data, astronomical data, medical data, etc.; and increasing use of computational tools, especially simulations, in social and physical sciences. Effectively managing and analyzing such "big data" is a key to the overall data science process. Traditional data processing applications like relational database systems are highly functional, but are often unable to handle such "big data" for several reasons. The scale of the data is often much higher than traditional data management tools are equipped to handle. The data tends to come from a variety of disparate sources and exhibits high heterogeneity and variety. The data is often generated at very high rate (velocity). Finally, the data is often noisy and the veracity of the data can be hard to evaluate.
The course will focus on the diverse set of techniques, tools, and systems commonly used for performing data science on large volumes of data. It covers both relational database systems, still a mainstay in data management systems, as well as the so-called "NoSQL" systems. The goals of the course are to provide a broad overview of data management systems, with an emphasis on foundations and on understanding the strengths and limitations of the different systems. It will also cover some of the principles of cloud computing and data centers.
For CMSC642 → Must be in the Professional Graduate
For CMSC498K → Minimum grade of C- in CMSC330 and CMSC351
There is no required textbook for the course, however the course will draw on material from the following sources.
|Database System Concepts, 6th edition||A. Silberschatz, H.F. Korth, S. Sudarshan||Any other undergraduate database textbook will suffice|
|Mining Massive Datasets||J. Leskovec, A. Rajaraman, J.D. Ullman||Available for free download|
Office hours tend to get busy the day before a project deadline. Therefore do not wait to start your projects.
You need to keep backups of your projects as you develop them. No extensions will be granted because you accidentally erased your project. Tools like git can help with that. Do not post code in any online system that is accessible to others (e.g., GitHub).
We will be using (Piazza) for class communication. You cannot register for the class Piazza page yourself. I have already registered you using the e-mail you have in the UMD system. If you are not registered, email the instructor.
See the section titled "Attendance, Absences, or Missed Assignments" available at Course Related Policies.
See the section titled "Accessibility" available at Course Related Policies.
Note that academic dishonesty includes not only cheating, fabrication, and plagiarism, but also includes helping other students commit acts of academic dishonesty by allowing them to obtain copies of your work. In short, all submitted work must be your own. Cases of academic dishonesty will be pursued to the fullest extent possible as stipulated by the Office of Student Conduct.
It is very important for you to be aware of the consequences of cheating, fabrication, facilitation, and plagiarism. For more information on the Code of Academic Integrity or the Student Honor Council, please visit http://www.shc.umd.edu.
The CS Department takes academic integrity seriously. Information on how the CS Department views and handle academic integrity matters can be found at Academic Integrity.
The following are examples of academic integrity violations:
All course materials are copyright UMCP, Department of Computer Science © 2018. All rights reserved. Students are permitted to use course materials for their own personal use only. Course materials may not be distributed publicly or provided to others (excepting other students in the course), in any way or format.