You are granted permission for the non-commercial reproduction, distribution, display, and performance of this technical report in any format. However, this permission is only for a period of 45 (forty-five) days from the most recent time that you verified that this technical report is still available from the Department of Computer Science of the University of Maryland at College Park under terms that include this permission. All other rights are reserved by the author(s).
Andreas Mueller. Fast Sequential and Parallel Algorithms for Association Rule Mining:. August 1995.
The field of knowledge discovery in databases, or "Data Mining", has received increasing attention during recent years as large organizations have begun to realize the potential value of the information that is stored implicitly in their databases. One specific data mining task is the mining of Association Rules, particularly from retail data. The task is to determine patterns (or rules) that characterize the shopping behavior of customers from a large database of previous consumer transactions. The rules can then be used to focus marketing efforts such as product placement and sales promotions. Because early algorithms required an unpredictably large number of IO operations, reducing IO cost has been the primary target of the algorithms presented in the literature. One of the most recent proposed algorithms, called PARTITION, uses a new TID-list data representation and a new partitioning technique. The partitioning technique reduces IO cost to a constant amount by processing one database portion at a time in memory. We implemented an algorithm called SPTID that incorporates both TID-lists and partitioning to study their benefits. For comparison, a non-partitioning algorithm called SEAR, which is based on a new prefix-tree data structure, is used. Our experiments with SPTID and SEAR indicate that TID-lists have inherent inefficiencies; furthermore, because all of the algorithms tested tend to be CPU-boundn trading CPU-overhead against I/O operations by partitioning did not lead to better performance. In order to scale mining algorithms to the huge databases (e.g., multiple Terabytes) that large organizations will manage in the near future, we implemented parallel versions of SEAR and SPEAR (its partitioned counterpart). The performance results show that, while both algorithms parallelize easily and obtain good speedup and scale-up results, the parallel SEAR version performs better than parallel SPEAR, despite the fact that it uses more communication. Dept. of Computer Science, Univ. of Maryland,
Last Generated Fri Aug 11 04:01:01 EDT 2000