Efficiently Supporting Ad Hoc Queries in Large Datasets of Time Sequences

Flip Korn


Ad hoc querying is difficult on very large datasets, since it is usually not possible to have the entire dataset on disk. While compression can be used to decrease the size of the dataset, compressed data is notoriously difficult to index or access.

In this paper we consider a very large dataset comprising multi- ple distinct time sequences. Each point in the sequence is a nu- merical value. We show how to compress such a dataset into a format that supports ad hoc querying, provided that a small error can be tolerated when the data is uncompressed. Experiments on large, real world datasets (AT&T customer calling patterns) show that the proposed method achieves an average of less than 5% er- ror in any data value after compressing to a mere 2.5% of the original space (i.e., a 40:1 compression ratio), with these numbers not very sensitive to dataset size. Experiments on ag- gregate queries achieved a 0.5% reconstruction error with under 2% space requirement.

Joint work with H.V. Jagadish, Christos Faloutsos

Back to the Spring 1997 dbchat index