Home Page for CMSC 498K
(Algorithms for Analyzing Massive Data Sets and Data Mining)

Instructor: Samir Khuller Office: AVW 3369. Office phone: (301) 405--6765. E-mail: samir@cs.umd.edu.

Office Hours: TuTh 11-12 If you cannot make these hours, please make an appointment to see me at a different time. On Wed (Apr 2) I will office hours in the morning as well 10-12. Please stop by to talk about your class presentation next week.

I will update this page every week during the semester. I will place all homeworks as well as solutions to homeworks here. If you have any trouble accessing them, please let me know. These are in postscript.

Class Time: Tu and Th 9:30--10:45am. CSI 3118.

Course Overview:

As large amounts of data are being created it is important to understand how to analyse the data to extract interesting trends and patterns. This is called "Data mining". Since the volume of data is large, it may not be feasible to make more than a single pass over the data. Stream processing methods provide effective ways to extract useful information from large data sets by making very few passes on the data. Surprisingly, a lot of information can be gleaned by making a single pass over the data, or a small number of passes over the data. The first part of the course will cover random sampling and stream processing methods. We will also consider privacy issues in data bases and how these should be handled. The second part of the course will cover more traditional data mining topics such as clustering, statistical inference and association rule learning. Data warehousing and online analytical processing is another topic we will cover.

Readings

Useful Readings:

Additional Readings:

Data Mining: Concepts and Techniques by Han and Kamber (Morgan Kaufmann).

Course Work: Course work will consist of homeworks, an in-class presentation and two exams. The relative weights of these will be 20% for the homeworks, 10% for the in-class presentation, 30% for the midterm and 40% for the final exam.

Homeworks:

Homework 1 (due 2/21) Solution 1 (Matt) Average: 67

Homework 2 (due 3/6) Solution 2 (Matt) Solution 2 (Noah) Average: 68 Comment: Read both solutions and learn from them.

Problem (3): The Chernoff bound I gave out in class mistakenly had a 2 which should have been a 3. Chernoff bounds are different when we bound the probability that the variable is much larger than the mean from the bound when the variable is much smaller than the mean. The chance that the variable is much larger than the mean has a "3" and the chance that the variable is much smaller than the mean has a "2". So when we make the term 2 e^{-mu delta^2/X} its safe to put "3" in place of the X. For Problem (4) its simpler to derive the chance that a node is chosen simply based on its degree. In fact this leads to Turan's Theorem that shows that every graph has a large Indep. Set. For problem (5) there is a simpler algorithm based on Breadth First Search to check if the graph is bipartite or not.

Homework 4 (due April 1) Solution 4 (Matt) Average: 66

Homework 5 (due May 6) Solution 5 Average: 74

HANDOUT

Midterm Average : 80

Our final is scheduled for Fri, May 16th 8-10am (show up by 8:30am!).

Prerequisites: CMSC 351. I expect familiarity with basic algorithms. This course is an algorithmic oriented course, with proofs of correctness etc.

Lectures