You are granted permission for the non-commercial reproduction, distribution, display, and performance of this technical report in any format. However, this permission is only for a period of 45 (forty-five) days from the most recent time that you verified that this technical report is still available from the Department of Computer Science of the University of Maryland at College Park under terms that include this permission. All other rights are reserved by the author(s).
Designing Access Methods for Bitemporal Databases. Anil Kumar. Vassilis J. Tsotras. Christos Faloutsos. March 1997.
By supporting the valid and transaction time dimensions, bitemporal databases represent reality more accurately than conventional databases. In this paper we examine the issues involved in designing efficient access methods for bitemporal databases and propose the partial-persistence and the double-tree methodologies. The partial- persistence methodology reduces bitemporal queries to partial persistence problems for which an efficient access method is then designed. The double-tree methodology "sees" each bitemporal data object as consisting of two intervals (a valid-time and a transaction- time interval), and divides objects into two categories according to whether the right endpoint of the transaction time interval is already known. A common characteristic of both methodologies is that they take into account the properties of each time dimension. Their performance is compared with a straightforward approach that "sees" the intervals associated with a bitemporal object as composing one rectangle which is stored in a single multidimensional access method. Given that some limited additional space is available, our experimental results show that the partial- persistence methodology provides the best overall performance, especially for transaction timeslice queries. For those applications that require ready, off-the-shelf, access methods the double-tree methodology is a good alternative. (Also cross-referenced as UMIACS-TR-97-24) University of Maryland Institute for Advanced Computer Studies, Dept. of Computer Science, Univ. of Maryland,
Quantifiable Data Mining Using Principal Component Analysis. Flip Korn. Alexandros Labrinidis. Yannis Kotidis. Christos Faloutsos. Alex Kaplunovich. Dejan Perkovic. February 1997.
Association Rule Mining algorithms operate on a data matrix (e.g., customers x products) to derive rules. We propose a single-pass algorithm for mining linear rules in such a matrix based on Principal Component Analysis. PCA detects correlated columns of the matrix, which correspond to, e.g., products that sell together. The first contribution of this work is that we propose to quantify the ``goodness'' of a set of discovered rules. We define the ``guessing error'': the root-mean-square error of the reconstructed values of the cells of the given matrix, when we pretend that they are unknown. The second contribution is a novel method to guess missing/hidden values from the linear rules that our method derives. For example, if somebody bought $10 of milk and $3 of bread, our rules can ``guess'' the amount spent on, say, butter. Thus, we can perform a variety of important tasks such as forecasting, `what-if' scenarios, outlier detection, and visualization. Moreover, we show that we can compute the principal components with a single pass over the dataset. Experiments on real datasets (e.g., NBA statistics) demonstrate that the proposed method consistently achieves a ``guessing error'' of up to 5 times lower than the straightforward competitor. (Also cross-referenced as UMIACS-TR-97-13) University of Maryland Institute for Advanced Computer Studies, Dept. of Computer Science, Univ. of Maryland,
Fast Nearest Neighbor Search in Medical Image Databases. Flip Korn. Nikolaos Sidiropoulos. Christos Faloutsos. Eliot Siegel. Zenon Protopapas. March 1996.
We examine the problem of finding similar tumor shapes. Starting from a natural similarity function (the so-called `max morpholog- ical distance'), we showed how to lower-bound it and how to search for nearest neighbors in large collections of tumor-like shapes. Specifically, we used state-of-the-art concepts from morphology, namely the `pattern spectrum' of a shape, to map each shape to a point in $n$-dimensional space. Following \cite{Faloutsos94Fast,Jagadish91Retrieval}, we organized the $n$-d points in an R-tree. We showed that the $L_infty$ (= max) norm in the $n$-d space lower-bounds the actual distance. This guarantees no false dismissals for range queries. In addition, we developed a nearest-neighbor algorithm that also guarantees no false dismissals. Finally, we implemented the method, and we tested it against a testbed of realistic tumor shapes, using an established tumor- growth model of Murray Eden \cite{Eden:61}. The experiments showed that our method is up to 27 times faster than straightfor- ward sequential scanning. (Also cross-referenced as UMIACS-TR-96-17) University of Maryland Institute for Advanced Computer Studies, Dept. of Computer Science, Univ. of Maryland,
Analysis of the Clustering Properties of Hilbert Space-filling Curve. March 1996.
Bongki Moon. H.V. Jagadish. Christos Faloutsos. Joel Saltz. Several schemes for linear mapping of multidimensional space have been proposed for many applications such as access methods for spatio-temporal databases, image compression and so on. In all these applications, one of the most desired properties from such linear mappings is clustering, which means the locality between objects in the multidimensional space is preserved in the linear space. It is widely believed that Hilbert space-filling curve achieves the best clustering. In this paper we provide closed-form formulas of the number of clusters required by a given query region of an arbitrary shape (e.g., polygons and polyhedra) for Hilbert space-filling curve. Both the asymptotic solution for a general case and the exact solution for a special case generalize the previous work, and they agree with the empirical results that the number of clusters depends on the hyper-surface area of the query region and not on its hyper-volume. We have also shown that Hilbert curve achieves better clustering than z-curve. From the practical point of view, the formulas given in this paper provide a simple measure which can be used to predict the required disk access behaviors and hence the total access time. (Also cross-referenced as UMIACS-TR-96-20) University of Maryland Institute for Advanced Computer Studies, Dept. of Computer Science, Univ. of Maryland,
A Survey of Information Retrieval and Filtering Methods. Christos Faloutsos. Douglas W. Oard. August 1995.
We survey the major techniques for information retrieval. In the first part, we provide an overview of the traditional ones (full text scanning, inversion, signature files and clustering). In the second part we discuss attempts to include semantic information (natural language processing, latent semantic indexing and neural networks). Dept. of Computer Science, Univ. of Maryland,
Estimating the Selectivity of Spatial Queries Using the `Correlation'. Alberto Belussi. Christos Faloutsos. February 1995.
We examine the estimation of selectivities for range and spatial join queries in real spatial databases. As we have shown earlier, real point sets: (a) violate consistently the "uniformity" and "independence" assumptions, (b) can often be described as "fractals", with non-integer (fractal) dimension. In this paper we show that, among the infinite family of fractal dimensions, the so called "Correlation Dimension" D2 is the one that we need to predict the selectivity of spatial join. The main contribution is that, for all the real and synthetic point-sets we tried, the average number of neighbors for a given point of the point-set follows a power law, with D2 as the exponent. This immediately solves the selectivity estimation for spatial joins, as well as for "biased" range queries (i.e., queries whose centers prefer areas of high point density). We present the formulas to estimate the selectivity for the biased queries, including an integration constant (Kshape) for each query shape. Finally, we show results on real and synthetic point sets, where our formulas achieve very low relative errors (typically about 10%, versus 40%-100% of the uniform assumption). University of Maryland Institute for Advanced Computer Studies, Dept. of Computer Science, Univ. of Maryland,
Similarity Searching in Large Image Databases. Euripides G.M. Petrakis. Christos Faloutsos. December 1994.
We propose a method to handle approximate searching by image content in large image databases. Image content is represented by attributed relational graphs holding features of objects and relationships between objects. The method relies on the assumption that a fixed number of ``labeled'' or ``expected'' objects (e.g., ``heart'', ``lungs'' etc.) are common in all images of a given application domain in addition to a variable number of ``unexpected'' or ``unlabeled'' objects (e.g., ``tumor'', ``hematoma'' etc.). The method can answer queries by example such as ``{\em find all X-rays that are similar to Smith's X-ray}''. The stored images are mapped to points in a multidimensional space and are indexed using state-of-the-art database methods (R-trees). The proposed method has several desirable properties: (a) Database search is approximate so that all images up to a pre-specified degree of similarity (tolerance) are retrieved, (b) it has no ``false dismissals'' (i.e., all images qualifying query selection criteria are retrieved) and (c) it scales-up well as the database grows. We implemented the method and ran experiments on a database of synthetic (but realistic) medical images. The experiments showed that our method significantly outperforms sequential scanning by up to an order of magnitude. (Also cross-referenced as UMIACS-TR-94-134) University of Maryland Institute for Advanced Computer Studies, Dept. of Computer Science, Univ. of Maryland,
FastMap: A Fast Algorithm for Indexing, Data-Mining and. Christos Faloutsos. King-Ip (David) Lin. January 1995.
A very promising idea for fast searching in traditional and multimedia databases is to map objects into points in k-d space, using k feature-extraction functions, provided by a domain expert rJag91]. Thus. we can subsequently use highly fine-tuned spatia l access methods (SAMs), to answer several types of queries, including the 'Query By Example' type (which translates to a range query); the 'all pairs' query (which translates to a spatial join [BKSS94]); the nearest-neighbor or best-match query, etc. However, designing feature extraction functions can be hard. It is relatively easier for a domain expert to assess the similarity/distance of two objects. Given only the distance information though, it is not obvious how to map objects into points. This is exactly the topic of this paper. We describe a fast algorithm to map objects into points in some k-dimensional space (k is user-defined), such that the dissimilarities are preserved. There are two benefits from this mapping: (a) efficient retriev al, in conjunction with a SAM, as discussed before and (b) visualization and data-mining: the objects can now be plotted as points in 2-d or Sd space, revealing potential clusters, correlations among attributes and other regularities that data-mining is l ooking for. We introduce an older method from pattern recognition, namely, Multi-Dimcnsional Scaling (MDS) [Tor52]; although unsuitable for indexing, we use it as yardstick for our method. Then, we propose a much faster algorithm to solve the problem in hand, while in addition it allows for indexing. Experiments on real and synthetic data indeed show that the proposed algorithm is significantly faster than MDS, (being linear, as opposed to quadratic, on the database size N), while it manages to preserve distances an d the overall structure of the data-set. (Also cross-referenced as UMIACS-TR-94-132) University of Maryland Institute for Advanced Computer Studies, Dept. of Computer Science, Univ. of Maryland,
Analysis of the n-dimensional quadtree decomposition for arbitrary. Christos Faloutsos. H.V. Jagadish. Yannis Manolopoulos. December 1994.
We give a closed-form expression for the average number of $n$-dimensional quadtree nodes (`pieces' or `blocks') required by an $n$-dimensional hyper-rectangle aligned with the axes. Our formula includes as special cases the formulae of previous efforts for 2-dimensional spaces \cite{Faloutsos92Analytical}. It also agrees with theoretical and empirical results that the number of blocks depends on the hyper-surface of the hyper-rectangle and not on its hyper-volume. The practical use of the derived formula is that it allows the estimation of the space requirements of the $n$-dimensional quadtree decomposition. Quadtrees are used extensively in 2-dimensional spaces (geographic information systems and spatial databases in general), as well in higher dimensionality spaces (as oct-trees for 3-dimensional spaces, e.g. in graphics, robotics and 3-dimensional medical images [Arya et al., 1994]. Our formula permits the estimation of the space requirements for data hyper-rectangles when stored in an index structure like a ($n$-dimensional) quadtree, as well as the estimation of the search time for query hyper-rectangles. A theoretical contribution of the paper is the observation that the number of blocks is a piece-wise linear function of the sides of the hyper-rectangle. (Also cross-referenced as UMIACS-TR-94-130) University of Maryland Institute for Advanced Computer Studies, Dept. of Computer Science, Univ. of Maryland,
Beyond Uniformity and Independence:. February 1994.
Christos Faloutsos. Ibrahim Kamel. We propose the concept of fractal dimension of a set of points, in order to quantify the deviation from the uniformity distribution. Using measurements on real data sets (road intersections of U.S. counties, star coordinates from NASA's Infrared-Ultraviolet Explorer etc.) we provide evidence that real data indeed are skewed, and, moreover, we show that they behave as mathematical fractals, with a measurable, non-integer fractal dimension. Armed with this tool, we then show its practical use in predicting the performance of spatial access methods, and specifically of the R-trees. We provide the {\em first} analysis of R-trees for skewed distributions of points: We develop a formula that estimates the number of disk accesses for range queries, given only the fractal dimension of the point set, and its count. Experiments on real data sets show that the formula is very accurate: the relative error is usually below 5\%, and it rarely exceeds 10\%. We believe that the fractal dimension will help replace the uniformity and independence assumptions, allowing more accurate analysis for {\em any} spatial access method, as well as better estimates for query optimization on multi-attribute queries. NOTE - Appeared in PODS 1994. Christos Faloutsos and Ibrahim Kamel. "Beyond Uniformity and Independence: Analysis of R-Trees Using the Concept of Fractal Dimension", Proc. ACM SIGACT-SIGMOD-SIGART PODS. Minneapolis, MN (May 1994), pp. 4-13. (Also cross-referenced as UMIACS-TR-93-130) University of Maryland Institute for Advanced Computer Studies, Dept. of Computer Science, Univ. of Maryland,
Christos Faloutsos. M. Ranganathan. Yannis Manolopoulos. December 1993.
Fast Subsequence Matching in Time-Series Databases. We present an efficient indexing method to locate 1-dimensional subsequences within a collection of sequences, such that the subsequences match a given (query) pattern within a specified tolerance. The idea is to map each data sequence into a small set of multidimensional rectangles in feature space. Then, these rectangles can be readily indexed using traditional spatial access methods, like the R*-tree \cite{Beckmann90R}. In more detail, we use a sliding window over the data sequence and extract its features; the result is a trail in feature space. We propose an efficient and effective algorithm to divide such trails into sub-trails, which are subsequently represented by their Minimum Bounding Rectangles (MBRs). We also examine queries of varying lengths, and we show how to handle each case efficiently. We implemented our method and carried out experiments on synthetic and real data (stock price movements). We compared the method to sequential scanning, which is the only obvious competitor. The results were excellent: our method accelerated the search time from 3 times up to 100 times. Appeared in ACM SIGMOD 1994, pp 419-429. Given "Best Paper award" (Also cross-referenced as UMIACS-TR-93-131) Dept. of Computer Science, Univ. of Maryland,
Ibrahim Kamel. Christos Faloutsos. Hilbert R-tree: An improved Rtree using fractals. February 1993.
We propose a new R-tree structure that outperforms all the older ones. The heart of the idea is to pack rectangles into the Entree nodes according to a linear ordering as opposed to the more complex heuristics used in the R*Ñtree. This ordering has to be 'good', in the sense that it should cluster 'similar' data rectangles together, to minimize the area and perimeter of the resulting minimum bounding rectangles (MBRs). Among the orderings we tried, the '2bc' method, the one that uses the (2d) Filbert value of the center of the rectangles, gives the best results. For a static database, the proposed ordering achieves superior packing, outperforming older packing methods [25], and the best dynamic method (R*-trees [3]). The savings are as high as 36% on real data. Moreover, we introduce a dynamic variation, the Eilbc7t R-tree: Given the ordering, every node has a well-defined set of sibling nodes; thus, we can use splitting algorithms that are similar to the deferred splitting of the B*-trees. By adjusting the split policy, the Filbert R-tree can achieve as high utilization as desired. To the contrary, the R*-tree has no control over the utilization, typically achieving up to 70%. We designed the manipulation algorithms in detail, and we did a full implementation of the Filbert R-tree. Our experiments show that a '2-to-3' split policy achieves good results, consistently outperforming the best competitor (R*-trees), with up to 28% savings on real data. (Also cross-referenced as UMIACS-TR-93-12) University of Maryland Institute for Advanced Computer Studies, Dept. of Computer Science, Univ. of Maryland,
Christos Faloutsos. Ibrahim Kamel. Packed R-trees Using Fractals. December 1992.
We propose a new packing technique for R-trees for static databases. Given a collection of rectangles, we sort them and we build the Rtree bottom-up. There are several ways to sort the rectangles; the innovation of this work is the use of fractals, and specifically the hilbert curve, to achieve better ordering of the rectangles and eventually better packing. We proposed and implemented several variations and performed experiments on synthetic, as well as real data (a TIGER file from the U.S. Bureau of Census). The winning variation ('2D-c') was the one that sorts the rectangles according to the hilbert value of the center. This variation consistently outperforms the R*-trees [3], Guttman's R-trees [13], as well as the packing method of Roussopoulos and Leifker [24]. The performance gain of the our method seems to increase with the skeweness of the data distribution; specifically, on the (highly skewed) TIGER dataset, it achieves up to 38% improvement in response time over the best known R-tree variant and up to 58% over the older packing algorithm. (Also cross-referenced as UMIACS-TR-92-133) University of Maryland Institute for Advanced Computer Studies, Dept. of Computer Science, Univ. of Maryland,
Ibrahim Kamel. Christos Faloutsos. Parallel R-trees. January 6, 1992.
We consider the problem of exploiting parallelism to accelerate the performance of spatial access methods and specifically, R-trees [14]. Our goal is to design a server for spatial data, so that to maximize the throughput of range queries. This can be achieved by (a) maximizing parallelism for large range queries, and (b) by engaging as few disks as possible on point queries [26]. We propose a simple hardware architecture consisting of one processor with several disks attached to it. On this architecture, we propose to distribute the nodes of a traditional R-tree, with crossdisk pointers ('Multiplexed' R-tree). The R-tree code is identical to the one for a single-disk R-tree, with the only addition that we have to decide which disk a newly created R-tree node should be stored in. We propose and examine several criteria to choose a disk for a new node. The most successful one, termed 'proximity index' or PI, estimates the similarity of the new node with the other Rtree nodes already on a disk, and chooses the disk with the lowest similarity. Experimental results show that our scheme consistently outperforms all the other heuristics for node-to-disk assignments, achieving up to 55% gains over the Round Robin one. Experiments also indicate that the multiplexed Rtree with PI heuristic gives better response time than the disk-stripping (="Super-node") approach, and imposes lighter load on the I/O sub-system. The speed up of our method is close to linear speed up, increasing with the size of the queries. (Also cross-referenced as UMIACS-TR-92-1) University of Maryland Institute for Advanced Computer Studies, Dept. of Computer Science, Univ. of Maryland,
Christos Faloutsos. Raymond Lee. Catherine Plaisant. Ben Shneiderman. June 1989.
Incorporating String Search in a Hypertext System:User. Hypertext systems provide an appealing mechanism for informally browsing databases by traversing selectable links. However, in many fact finding situations string search is an effective complement to browsing. This paper describes the application of the signature file method to achieve rapid and convenient string search in small personal computer hypertext environments. The method has been implemented in a prototype, as well as in a commercial product. Performance data for search times and storage space are presented from a commercial hypertext database. User interface issues are then discussed. Experience with the string search interface indicates that it was used sucessfully by novice users. (Also cross-referenced as CAR-TR-448) Human Computer Interaction Laboratory, Center for Automation Research, Dept. of Computer Science, Univ. of Maryland, Institute for Systems Research,
Christos Faloutsos. Shari Roseman. Fractals for Secondary Key Retrieval. May 1989.
In this paper we propose the use of fractals and especially the Hilbert curve, in order to design good distance-preserving mappings. Such mappings improve the performance of secondary-key- and spatial- access methods, where multi-dimensional points have to be stored on an 1-dimensional medium (e.g., disk). Good clustering reduces the number of disk accesses on retrieval, improving the response time. Our experiments on range queries and nearest neighbor queries showed that the proposed Hilbert curve achieves better clustering than older methods ("bit-shuffling", or Peano curve), for every situation we tried. (Also cross-referenced as UMIACS-TR-89-47) University of Maryland Institute for Advanced Computer Studies, Dept. of Computer Science, Univ. of Maryland,
Last Generated Fri Aug 11 04:01:01 EDT 2000