CMSC 828M Spring 2008 Hanan Samet Tu Th 2-3:15PM Title: Foundations of Multidimensional & Metric Data Structures Abstract: The representation of multidimensional data is an important issue in diverse fields including database management systems, computer graphics, game programming, computer vision, geographic information systems (GIS), pattern recognition, similarity retrieval, solid modeling, computer-aided design, robotics, image processing, computational geometry, and numerous others. In this course we will discuss representations and algorithms for points, objects, intervals (including rectangles), high dimensional data, and metric data (i.e., objects for which we only know their separation or distance from other objects and the fact that these distances satisfy the triangle inequality). The object representations are discussed on the basis of whether they are designed to respond to the query of where an object is located or whether they are designed to yield the set of objects associated with a given location. The multidimensional point objects are discussed on the basis of whether they are tree-based or trie-based. In addition, a number of bucket methods are discussed which are designed for processing large volumes of data which are too large to fit in main memory. The discussion of high dimensional data is in the context of how it supports similarity queries which in essence are nearest neighbor queries. The depth-first and best-first nearest neighbor finding strategies are discussed in great detail for both conventional data of high dimensions and metric data. We also address the observation that for high-dimensional data, the use of multidimensional indexing methods often fails to yield any improvement in performance over a simple sequential scan of the data. This is in part due to the cost of computing distance in high dimensions, the nature of the data distribution, and the relatively low volume of available data vis-a-vis its dimensionality (i.e., the curse of dimensionality). One response is to take note of the observation that the ``inherent dimensionality'' of a data set is often much lower than that of the underlying space. In particular, this has led to the development of techniques to reduce the dimensionality of the data using methods such as singular value decomposition (SVD). An alternative solution which we discuss is to make use of contractive embedding methods which embed the data in a lower dimensional space, and then make use of a filter-and-refine algorithm in conjunction with multidimensional indexing methods to prune irrelevant candidates from consideration. We discuss a number of embedding methods including FastMap, SparseMap, MetricMap, and Lipschitz embeddings. These techniques are useful in a number of database applications including multimedia and bioinformatics. Workload: occasional quizzes, final, some homework exercises, and a course project which can either be a small research topic or a review of some papers on a particular subject. Prerequisites: CMSC 420 Graduate Program Requirements Satisfaction: * PhD qualifying course in DB * MS qualifying course in DB * MS Comp: Final in DB Outline: A. Point Data 1. Tree-based 2. Trie-based 3. Bucket methods 4. Instantiation methods B. Object-based and Image-based Data Object Data 1. Interior-based representations 2. Boundary-based representations 3. Difference-based compaction methods C. Intervals and Small Rectangle Data 1. Plane-sweep methods 2. Point-based methods 3. Area-based methods D. High-dimensional and Metric Data 1. Best-first nearest neighbor finding 2. Depth-first k nearest neighbor finding 3. Skyline queries 4. Nearest neighbor finding along a network 5. Approximate nearest neighbor finding 6. Multidimensional indexing methods 7. Distance-based indexing methods 8. Nonconventional distance metrics a. Hausdorff distance b. Earth mover's distance 9. SASH (searching in the absence of a metric) 10. Dimension-reduction methods 11. Embedding methods 12. Locality sensitive hashing 13. Geometric hashing E. Peer-to-Peer Data Structures 1. Chord Method a. MX-CIF quadtree adaptation b. Loose quadtree c. M-Chord for M-trees 2. CAN Method F. Position-Independent Indexing and Applications in Astrometry G. Advanced Computer Architectures for Spatial and Metric Data Structures 1. GPUs 2. Cloud computing like Map Reduce and Hadoop These topics will be discussed in varying amount of detail depending on the interests and background of the students. Text: H. Samet, Foundations of Multidimensional and Metric Data Structures, Morgan-Kaufmann, San Francisco, 2006. ISBN-10: 0123694469 ISBN-13: 978-0123694461