(Algorithms for Analyzing Massive Data Sets and Data Mining)

**Instructor:**
Samir Khuller Office: AVW 3369. Office phone: (301) 405--6765.
E-mail: samir@cs.umd.edu.

**Office Hours:** TuTh 11-12
If you cannot make these hours, please make an appointment to see
me at a different time. On Wed (Apr 2) I will office hours in the morning
as well 10-12. Please stop by to talk about your class presentation next week.

*
I will update this page every week during the
semester. I will place all homeworks as well
as solutions to homeworks here. If you have any trouble
accessing them, please let me know. These are in postscript.
*

**Class Time:** Tu and Th 9:30--10:45am. CSI 3118.

**Course Overview:**

As large amounts of data are being created it is important to understand how to analyse the data to extract interesting trends and patterns. This is called "Data mining". Since the volume of data is large, it may not be feasible to make more than a single pass over the data. Stream processing methods provide effective ways to extract useful information from large data sets by making very few passes on the data. Surprisingly, a lot of information can be gleaned by making a single pass over the data, or a small number of passes over the data. The first part of the course will cover random sampling and stream processing methods. We will also consider privacy issues in data bases and how these should be handled. The second part of the course will cover more traditional data mining topics such as clustering, statistical inference and association rule learning. Data warehousing and online analytical processing is another topic we will cover.

**Useful Readings:**

*Principles of Data Mining*by Hand, Mannila and Smyth (MIT press).-
*Data Streaming*by S. Muthukrishnan. Available on his web page.

*Data Mining: Concepts and Techniques* by Han and Kamber
(Morgan Kaufmann).

**Course Work:**
Course work will consist of homeworks, an in-class presentation
and two exams.
The relative weights of these will be 20% for the homeworks,
10% for the in-class
presentation, 30% for the midterm and 40% for the final exam.

**Homeworks:**

** Homework 1 (due 2/21)**
** Solution
1 (Matt) Average: 67**

** Homework 2
(due 3/6)**
** Solution
2 (Matt)**
** Solution
2 (Noah) Average: 68**
Comment: Read both solutions and learn from them.

Problem (3): The Chernoff bound I gave out in class mistakenly had a 2 which should have been a 3. Chernoff bounds are different when we bound the probability that the variable is much larger than the mean from the bound when the variable is much smaller than the mean. The chance that the variable is much larger than the mean has a "3" and the chance that the variable is much smaller than the mean has a "2". So when we make the term 2 e^{-mu delta^2/X} its safe to put "3" in place of the X. For Problem (4) its simpler to derive the chance that a node is chosen simply based on its degree. In fact this leads to Turan's Theorem that shows that every graph has a large Indep. Set. For problem (5) there is a simpler algorithm based on Breadth First Search to check if the graph is bipartite or not.

** Homework 4
(due April 1)
Solution 4
(Matt) Average: 66
**

** Homework 5
(due May 6)
Solution 5
Average: 74
**

** Midterm Average : 80**

Our final is scheduled for Fri, May 16th 8-10am (show up by 8:30am!).

**Prerequisites:**
CMSC 351. I expect familiarity with basic algorithms. This course
is an algorithmic oriented course, with proofs of correctness etc.

**Lectures**

- Lecture 1 (Jan 29): Overview of the course. Finding Min/Max with (3/2) n comparisons. Discussion of lower bounds. Discussion of Divide and Conquer based Selection. READ: Intro to Algorithms, by Cormen, Leiserson, Rivest and Stein (Chap 9).
- Lecture 2 (Jan 31): Streaming algorithm for selection by Munro and Paterson.
- Lecture 3 (Feb 5): More on selection (Munro-Paterson). See handout as well. Simple sampling based selection algorithm also covered.
- Lecture 4 (Feb 7): Start on the Manku-Rajagopalan-Lindsay paper for approximate selection in a single pass. Will distribute for other material as well.
- Lecture 5 (Feb 12): Random Sampling and Reservoir Sampling.
- Lecture 6 (Feb 14): More on Sampling (see notes), especially Markov, and Chernoff bounds and sample size required to estimate number of blue balls in a bin and Bloom Filters.
- Lecture 7 (Feb 19): Finding heavy hitters (see handout).
- Lecture 8 (Feb 21): Finish proof from last time and how do we prove correctness of algorithms?
- Lecture 9 (Feb 26): Finding Frequent elements (2 pass scheme) and amortized analysis.
- Lecture 10 (Feb 28): Hash functions
- Lecture 11 (Mar 4): Triangle counting in streams.
- Lecture 12 (Mar 6): Counting distinct elements. NOTES
- Lecture 13 (Mar 11): Review?
- Lecture 14 (Mar 13): MIDTERM (Closed Notes/Closed Book)
- Lecture 15 (Mar 25): Clustering of Streaming Data (Lecture by Matt McCutchen)
- Lecture 16 (Mar 27): Data streams and histograms.
- Lecture 17 (Apr 1): Mainly discussed possible class presentations and some stuff on Locally Sensitive Hashing and Voronoi Diagrams
- Lecture 18 (Apr 3): Data Layout Problems
- Lecture 19 (Apr 8): Support Vector Machines and Principal Component Analysis (Guest lecture by Dr. Rezarta Islamaj)
- Lecture 20 (Apr 10): Data Visualization (Guest lecture by Prof. Varshney)
- Lecture 21 (Apr 15): Data Mining (kNN and Classification)
- Lecture 22 (Apr 17): Data Mining (More on kNN approaches)
- Lecture 23 (Apr 22): Noah Easterly (Frequent Items/Assoc Rules) Slides Zaki paper Wang paper Matt McCutchen (Clustering with Outliers) Clustering paper
- Lecture 24 (Apr 24): Sharath Srinivas (Top K monitoring) Slides Aaron Cordova (Page Rank and Map Reduce)
- Lecture 25 (Apr 29): Dave Malhotra (Decision Trees) Slides (DM) Rosa Cowan (Predictive Classifiers Chapter 10) Slides (RC)
- Lecture 26 (May 1): I/O Eff. Algs. Nick Kuilema (Chap 12) & Antoni Gmurczyk (Chap 10) Slides (NK) Slides (AG)
- Lecture 27 (May 6): Marianna Martindale (Statistical Machine Translation 1) (Statistical Machine Translation 2) (Slides (MM)) Hyoungtae Cho (Collaborative Filtering 1) (Collaborative Filtering 2) Slides (HC)
- Lecture 28 (May 8): Martin Paraskevov (Locally Sensitive Hashing) Rustin Rockstroh (Fraud Detection)
- Lecture 29 (May 13): Tim Creech & Charles Hawkins (Karp's random streaming model)