**Class Time:** Mon 7:00pm--9:30pm. CSI 3120.

**Overview:** As large amounts of data are
being created it is important to understand how to analyse
the data to extract interesting trends and patterns. Since
the volume of data is large, it may not be feasible to make
more than a single pass over the data. Stream processing
methods provide effective ways to extract useful information
from large data sets by making very few passes on the data.
Surprisingly, a lot of information can be gleaned by making
a single pass over the data, or a small number of passes
over the data. The first part of the course will cover
random sampling and stream processing methods. We will also
consider privacy issues in data bases and how these should
be handled.

**Course Work:** Course work will consist of homeworks and
two exams. The relative weights of these will be 30% for the homeworks,
30% for the midterm and 40% for the final exam.

**Prerequisites:** CMSC 351. I expect
familiarity with basic algorithms. This course is an
algorithmic oriented course, with proofs of correctness etc.

**Office Hours:** Mon 5:30pm-6:45pm. If you cannot make
these hours, please make an appointment to see me at a
different time.

**Office Hours:** TuTh 4:00pm-5:00pm. If you cannot make
these hours, please make an appointment to see me at a
different time.

*I will update this page every week during the semester. I
will place all homeworks as well as solutions to homeworks
here. If you have any trouble accessing them, please let me
know.
*

- Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman, Jeff Ullman. We will be mainly using this book

- Fundations of Data Science by Avrim Blum, John Hopcroft, and Ravindran Kannan.
- Data Streaming by S. Muthukrishnan.

- Lecture 1 (Jan 29): Overview, data analysis, algorithms, Sampling
- Lecture 2 (Feb 5): Bonferroni Principle, Sampling, Hash function [Chapter 4]
- Lecture 3 (Feb 12): Streaming, Frequency Estimation, Distinct Element Estimation, Finding Frequent Elements [Chapter 4]
- Lecture 4 (Feb 19): K-center, and guest speaker Prof. David Mount covered Coresets. [Chapter 5]
- Lecture 5 (Feb 26): Linear Programming, Incremental K-center
- Lecture 6 (March 5): Guest lecture by Dr. Jessica on Linear Programming and Gurobi. Files to be used
- Lecture 7 (March 12): K-shingles, min-hash [Chapter 3]
- Lecture 8 (March 26): LSH, Link Analysis [Chapter 3, 5]
- Lecture 9 (April 2): Information Visualization (Ben Shneiderman), midterm
- Lecture 10 (April 9): Social Network Analysis [Chapter 10?]
- Lecture 11 (April 16): Guest lecture by Dr. Jessica, topic TBD
- Lecture 12 (April 23): TBD
- Lecture 13 (April 30): TBD
- Lecture 14 (May 7): Final exam

Here is the schudule of some previous term. This gives some ideas about the course, but the material for this term will **NOT** be the same!

- Lecture 1 (Jan 29): Overview of the course. Finding Min/Max with (3/2) n comparisons. Discussion of lower bounds. Discussion of Divide and Conquer based Selection. READ: Intro to Algorithms, by Cormen, Leiserson, Rivest and Stein (Chap 9).
- Lecture 2 (Jan 31): Streaming algorithm for selection by Munro and Paterson.
- Lecture 3 (Feb 5): More on selection (Munro-Paterson). See handout as well. Simple sampling based selection algorithm also covered.
- Lecture 4 (Feb 7): Start on the Manku-Rajagopalan-Lindsay paper for approximate selection in a single pass. Will distribute for other material as well.
- Lecture 5 (Feb 12): Random Sampling and Reservoir Sampling.
- Lecture 6 (Feb 14): More on Sampling (see notes), especially Markov, and Chernoff bounds and sample size required to estimate number of blue balls in a bin and Bloom Filters.
- Lecture 7 (Feb 19): Finding heavy hitters (see handout).
- Lecture 8 (Feb 21): Finish proof from last time and how do we prove correctness of algorithms?
- Lecture 9 (Feb 26): Finding Frequent elements (2 pass scheme) and amortized analysis.
- Lecture 10 (Feb 28): Hash functions
- Lecture 11 (Mar 4): Triangle counting in streams.
- Lecture 12 (Mar 6): Counting distinct elements. NOTES
- Lecture 13 (Mar 11): Review?
- Lecture 14 (Mar 13): MIDTERM (Closed Notes/Closed Book)
- Lecture 15 (Mar 25): Clustering of Streaming Data (Lecture by Matt McCutchen)
- Lecture 16 (Mar 27): Data streams and histograms.
- Lecture 17 (Apr 1): Mainly discussed possible class presentations and some stuff on Locally Sensitive Hashing and Voronoi Diagrams
- Lecture 18 (Apr 3): Data Layout Problems
- Lecture 19 (Apr 8): Support Vector Machines and Principal Component Analysis (Guest lecture by Dr. Rezarta Islamaj)
- Lecture 20 (Apr 10): Data Visualization (Guest lecture by Prof. Varshney)
- Lecture 21 (Apr 15): Data Mining (kNN and Classification)
- Lecture 22 (Apr 17): Data Mining (More on kNN approaches)
- Lecture 23 (Apr 22): Noah Easterly (Frequent Items/Assoc Rules) Slides Zaki paper Wang paper Matt McCutchen (Clustering with Outliers) Clustering paper
- Lecture 24 (Apr 24): Sharath Srinivas (Top K monitoring) Slides Aaron Cordova (Page Rank and Map Reduce)
- Lecture 25 (Apr 29): Dave Malhotra (Decision Trees) Slides (DM) Rosa Cowan (Predictive Classifiers Chapter 10) Slides (RC)
- Lecture 26 (May 1): I/O Eff. Algs. Nick Kuilema (Chap 12) & Antoni Gmurczyk (Chap 10) Slides (NK) Slides (AG)
- Lecture 27 (May 6): Marianna Martindale (Statistical Machine Translation 1) (Statistical Machine Translation 2) (Slides (MM)) Hyoungtae Cho (Collaborative Filtering 1) (Collaborative Filtering 2) Slides (HC)
- Lecture 28 (May 8): Martin Paraskevov (Locally Sensitive Hashing) Rustin Rockstroh (Fraud Detection)
- Lecture 29 (May 13): Tim Creech & Charles Hawkins (Karp's random streaming model)