SH-tree: Disk-based Space Partitioning Multidimensional Indexing Structure for High Dimensional Rectangular Datasets
This page is under heavy construction.
Currently, SH-tree ver.2.1 and KDB-tree ver.1.0 is available.
|
My classification of multidimensional indexing trees is slightly different from
that of Hybrid tree paper. The criterion of this classification is based on the
data structures of internal nodes.
Space Partitioning Method:
o The internal node of space partitioning methods is represented by a single split
dimension and one or two split positions.
Data Partitioning Method:
o The internal node of data partitioning methods is represented by a list of
bounding boxes of child nodes.
KDB-tree
o Space partitioning method
o One of the earliest space partitioning method
o No overlapping partitions
o Pros:
- Dimension-independent fan-out
o Cons:
- Downward cascading split problems due to non-overlapping partitions
- No guaranteed utilization
- Point data only (Or, transformation method is needed for non-point data)
Download KDB-tree source library ver 1.0
My source code is NOT based on R-trees unlike some other implementations.
R-tree style rectangular representation of internal nodes makes the number
of fan-outs dimension dependent. But the internal node of KDB-tree must be
represented as a binary KD-tree to be dimension independent.
If you find any bug, please let me know it.
Spatial KD-tree (SKD-tree)
o Space partitioning method
o One of the earliest variants of binary KD-tree
o Overlapping partitions allowed only when data objects intersect more
than one partition.
o Pros:
- Non-point data can be indexed.
o Cons:
- Unbalanced binary tree (Not a disk based index structure)
R-tree
o Data partitioning method
o Overlapping Partitions allowed
o Pros:
- Non-point data
- No downward cascading splits
- Guaranteed utilization
o Cons:
- Dimension dependent fan-out
- Overlapping regions => search performance problem.
- R*-tree: reinsertion to reduce overlapping region
- X-tree: increase the size of a node instead of creating highly
overlapping two nodes.
R+-tree
o Data partitioning method
o Overlapping partitions are not allowed
o Worse variant of KDB-tree
o Same with KDB-tree for point data
- But internal nodes maintain the list of child bounding boxes, instead
of split dimension & position => dimension dependent fan-out (waste of space).
o For rectangle data, R+-tree does NOT work.
- If there is a hot spot that overlaps every child bounding boxes and the node
is going to split, R+-tree crashes due to infinite recursive calls. Object
duplication methods can't handle non-point data.
- R+tree paper says R+tree can handle non-zero size spatial objects, but it's not
true.
o R+-tree is KDB-tree. (# of fan-outs in R+-tree is dimension
dependent while KDB-tree is not though.)
X-tree
o Data partitioning method
o If highly overlapping partitions are not avoidable, linear scan is faster.
- Instead of highly overlapping split, increase the capacity of the
overlapping node. (Supernode)
o By maintaining split history, find overlap-minimal split.
o Pros:
- Fast index search for point data and high-dimensional non-point data.
o Cons:
- overlap-minimal split based on split history does not help indexing
non-point data objects.
- Supernode is a major performance problem for data insertion.
The insertion cost is even much higher than R*-tree.
Hybrid-tree
o Space partitioning method
o Hybrid approach of KDB-tree and R-tree
o Overlapping partitions allowed only when downward cascading splits can't be avoided.
o Pros:
- Fast index creation
- Fast index search for point data
- Guaranteed utilization
- Dimension independent fan-out
o Cons:
- Point data only
SH-tree (Spatial Hybrid-tree)
Download SH-tree library source ver. 2.1
o Space partitioning method
o Overlapping partitions allowed both when downward cascading splits are not
avoidable and data objects intersect more than one partitions by updating
internal split positions dynamically.
o Pros:
- Fast index insertion & deletion
- Fast index search for point & non-point data
- Guaranteed utilization
- Dimension independent fan-out
o Cons:
- Cascading overlap problem, which is not too serious problem especially
with chunked rectangular datasets. :)
SH-tree References
[1] "A Comparative Study of Spatial Indexing Techniques
for Multidimensional Scientific Datasets" [pdf]
SSDBM 2004 Presentation [ppt]
Technical Report CS-TR-4556 and UMIACS-TR-2004-03,
University of Maryland, Jan 2004.
[2] "Indexing Cached Multidimensional Objects in Large Main
Memory Systems" [pdf]
Technical Report CS-TR-4795 and UMIACS-TR-2006-16,
University of Maryland, Jan 2006
FAQ
Q1. Why R*-tree is slower than R-tree for insertion in your paper?
A. R*-tree paper insists that the insertion to R*-tree is faster than R-tree
due to its better structure. But they used cache to keep recent path in
memory. By this way, they could avoid expensive cost of forced reinsertion.
In our experiment, we didn't exploit this cache effect.
Q2. Why is SH-tree faster than the others?
A. One of the reason why SH-tree is faster than some other trees
is that we experimented with chunked rectangles which does not overlap
or overlap a little bit.
If there are not much overlap between the rectangles, SH-tree does not
create large overlap region, which makes SH-tree performs good.
Another reason is that SH-tree has dimension independent number of
fan-outs, which makes the tree height short in high dimensions.
While SH-tree needs 2 split positions and 1 split dimension,
(2*sizeof(float)+sizeof(int)) for each child, R-tree based structures
need 2*dimension*sizeof(float) space for each child. Therefore
even in 2 dimensions, SH-tree has more number of fan-outs. In our
implementation, we made all trees have the same number of fan-outs
in 2 dimension.
Q3. It looks like the index performance graph in your paper is not linear.
A. The X axis in the graph is log scale. This is because we partitioned
fixed size datasets into different size chunks. Thus the number of data
we obtained is log scale.