Database Group
url: www.cs.umd.edu/areas/Databases/
In recent years, we have seen a tremendous increase in the data available in the digital format - the World Wide Web being a prominent example - and this trend is expected to accelerate with the increasing proliferation of devices, ranging from genome sequencing machines to microscopic biomedical sensors, that are capable of capturing even the minutest details of our everyday world. The database group at the University of Maryland at College Park carries out a multi-faceted and diverse research agenda addressing the full spectrum of data management challenges in today's information age. Some of the most important focus areas over the last few years include life sciences and biological databases, sensor networks, probabilistic databases, mobile databases, P2P networks, and unstructured text databases. At the same time, the database group has continued innovating in the traditional data management topics such as managing and querying data warehouses, spatial databases, query processing and optimization, data streams, approximate query processing, and data mining. Aside from the synergies in research interests among the group in areas such as data integration, probabilistic databases, and sensor networks, the database group also has extensive collaborations with the other groups in the department like computer vision, theory and the CLIP group.
The UMD Database group has a long and illustrious history that dates back to more than 25 years with the pioneering work by Jack Minker on deductive databases. In 1981, Nick Roussopoulos joined the group and led research on database systems, data warehousing, and spatial indexing. Around the same time, Hanan Samet turned his attention to Geographical Information Systems and Spatial Databases, and led several projects in these areas. The group was at its largest during the period 1985-2000 with the additions of Christos Faloutsos1 (1985), Leo Mark2 (1985), Timos Sellis3 (1986), Louiqa Raschid (1987), Ken Salem4 (1988), V.S. Subrahmanian (1989), Mike Franklin5 (1993), and Sudarshan Chawathe6 (1998). The group recently welcomed the additions of Lise Getoor in 2002 and Amol Deshpande in 2005.
The Database group has consistently been ranked among the best in the world. Alumni from the group have gone on to faculty and industrial positions around the world, including AT&T Research, Arizona State University, Brown University, Georgia Tech., Google, IBM Almaden Research Center, IBM Haifa Labs, National Technical University of Athens, Purdue University, RPI, Seoul Technical University, UC Irvine, University of Athens, University of British Columbia, University of Pittsburgh, and University of Waterloo, to name a few.
Brief descriptions of the individual faculty research interests follow. More information about the members of the group and the projects can be found at the database group website: http://www.cs.umd.edu/areas/Databases/.
Hanan Samet's research is a continuing effort to investigate the applicability of hierarchical spatial data structures to geographic information systems, computer graphics, image processing, image databases, and visualization. This work has culminated in the publication of the book "Foundations of Multidimensional and Metric Data Structures" published by Morgan-Kaufmann, an imprint of Elsevier, in 2006. This work also addresses algorithmic issues arising in applications such as the display of point cloud data, finding nearest neighbors in spatial networks like road maps, similarity searching in medical image databases like breast cancer images, and position-independent indexing for use in pictorially-specified queries on symbolic image databases. Hanan's research on integration of spatial and nonspatial data into a DBMS has resulted in the development of two systems: QUILT GIS, a working geographic information system and SAND, a home-grown database system that allows specifying spatial queries graphically. He has also been developing the "STEWARD" system, a spatio-textual document search engine that is being deployed on the web site of the Department of Housing and Urban Development.
Nick Roussopoulos is interested in data storage reduction techniques using data aggregation (OLAP techniques), data correlation, and distributed data acquisition. This research has resulted in a data store called "Dwarf" for aggregating high dimensional data with deep levels of hierarchies in them. The Dwarf store algorithms discover and eliminate prefix and suffix redundancies and fuse exponential number of aggregates into very compressed Dwarf data cubes. Theoretical bounds on time and space complexity were published showing very low polynomial order although for most real life data sets, the complexity is linear in the number of records or dimensions. The Dwarf technology was recently patented (US 7,133,876). Nick Roussopoulos is also working on adaptive search algorithms in Peer-to- Peer networks where data discovery improves with usage and groups are formed dynamically based on common usage, and on automatic replication and collection of data and services from and to mobile devices and sensors.
V.S. Subrahmanian is looking at the problems of extracting interesting information from large, unstructured, multilingual text collections. His group has developed a system called TREX ("The RDF EXtractor") that extracts RDF ontologies from text sources. TREX can take a set of URLs and a topic schema as input, and produce as output an RDF instance of the schema. For example, an application may wish to instantiate information about companies (e.g. number of employees, main plants, number of employees for each of these plants, and so on). This structural information is viewed as a schema. TREX currently processes about 50,000 URLs per day, and has been used to provide the US Army's 10th Mountain Division critical information prior to their deployment in Afghanistan. V.S. Subrahmanian and his group have also developed a system called OASYS ("Opinion Analysis System"), that takes a topic as input and tries to identify the intensity of opinion on that topic in document collections. OASYS, a winner of Computerworld Magazine's 2006 Horizon Awards for most innovative pre-commercial software of 2006, currently identifies opinions on arbitrary topics in 8 languages, from 16 countries, and has processed over 4 million news articles to date.
Louiqa Raschid investigates data management, data integration and performance issues for applications in the life sciences, health information systems, humanitarian IT applications and Grid computing. Her current projects include "Biofast", a project that applies mediation-based technology, cost modeling and optimization to bioinformatics workflows, and "Lslinks", a methodology to enrich links between entries in bioinformatics resources with semantics; this semantic information is used to rank answers for navigational queries. She is also developing a framework forWeb and Grid resource monitoring, called "ProMo", that allows expressing client data delivery requirements and server capabilities using notification rules. The semantics of these rules can be exploited to reason about the timeline of events, and to determine appropriate schedule to both pull data from servers and to push notifications to clients. Finally, the "DisDM" project aims to support the diverse data management that occurs in conjunction with disasters. The challenges include discovering and modeling relevant data sources, determining their capabilities, content and quality, and performing on-the-fly integration. This research is in conjunction with the Sahana FOSS project for disaster management.
Lise Getoor's research interests are in machine learning and probabilistic reasoning applied to structured data (including relational data, semistructured data and graph data). Her group develops theory and algorithms for statistical relational learning and link mining. Her group has developed algorithms for link-based classification, collective entity resolution and link prediction. The techniques developed are useful for many important database and information management problems including fundamental problems such as representation of uncertainty in databases, entity resolution, schema and ontology integration, information extraction and selectivity estimation. The domains in which their techniques have been applied include social networks, citation networks, email collections, geospatial data and biological sequences and networks.
Amol Deshpande's research is motivated by the challenges in managing and processing real-world data generated by distributed measurement infrastructures like wide area sensor networks. The data generated by such infrastructures is typically incomplete, imprecise, erroneous, and hence rarely useful as it is. Further, the data is typically generated continuously at very high rates, and needs to be processed in real-time. To solve these problems, Amol and his group are developing a system called "MauveDB" that aims to make it easy to process and to reason about streaming data through use of statistical modeling tools. By choosing an appropriate statistical model to be applied to the data, MauveDB can be used to remove noise and errors from noisy data, to extrapolate and fill in gaps in incomplete data, and to predict unobserved variables or future states, in real-time. In addition, Amol is developing models and tools for representing and querying uncertain, probabilistic data inside a relational database system. He is also developing adaptive techniques for processing the high rate data streams robustly, and is looking at ways to exploit parallelism for efficient query execution. Finally, Amol is interested in efficient extraction of data from sensor networks; he is currently working on theoretical bounds and practical algorithms for distributed compression in sensor networks.
1Currently at CMU, 2NTUA, 3Georgia Tech, 4University of Waterloo, 5UC Berkeley, 6University of Maine.

