Time sequences of real numbers constitute a significant portion of real world data in many application domains, ranging from finance through multimedia. In this thesis, we examine efficient methods for two important, and closely related, problems in large-scale databases of time sequences: (a) similarity-based search and (b) quantitative data mining.
We first investigate the similarity-based search problem. Unlike traditional, simple data types, comparisons between time sequences should be based on similarity rather than exact matching. The choice of a particular similarity measure depends upon applications. Hence, a time sequence database management system must support a diverse class of similarity models. In this thesis, we examine the ``time warping'' distance as our similarity model and develop efficient indexing/retrieval techniques based on it.
Next, we examine Lp norm-based similarity models that include the popular Euclidean distance as a special case. While it is possible to support each Lp norm separately, our goal is to develop a single indexing scheme for all Lp norms simultaneously.
Finally, we investigate scalable methods for data mining in a large set of co-evolving time sequences. In many applications, the data of interest comprises of multiple sequences that evolve together over time. Examples include currency exchange rates, network traffic data, and demographic data on multiple variables. We develop a fast method to analyze such co-evolving time sequences jointly, to allow (1) estimation/forecasting of missing, delayed, or future values, (2) quantitative data mining, discovering correlations (with or without lag) among the given sequences, and (3) outlier detection.
Back to the Spring 2000 dbchat index.