Äîêóìåíò âçÿò èç êýøà ïîèñêîâîé ìàøèíû. Àäðåñ îðèãèíàëüíîãî äîêóìåíòà : http://www.sai.msu.su/~megera/postgres/gist/papers/vldb95-gist.ps
Äàòà èçìåíåíèÿ: Thu Jan 4 20:25:37 2001
Äàòà èíäåêñèðîâàíèÿ: Sat Dec 22 07:25:44 2007
Êîäèðîâêà:

Generalized Search Trees for Database Systems
(Extended Abstract)
Joseph M. Hellerstein
University of Wisconsin, Madison
jmh@cs.berkeley.edu
Jeffrey F. Naughton \Lambda
University of Wisconsin, Madison
naughton@cs.wisc.edu
Avi Pfeffer
University of California, Berkeley
avi@cs.berkeley.edu
Abstract
This paper introduces the Generalized Search Tree (GiST), an index
structure supporting an extensible set of queries and data types. The
GiST allows new data types to be indexed in a manner supporting
queries natural to the types; this is in contrast to previous work on
tree extensibility which only supported the traditional set of equality
and range predicates. In a single data structure, the GiST provides
all the basic search tree logic required by a database system, thereby
unifying disparate structures such as B+trees and Rtrees in a single
piece of code, and opening the application of search trees to general
extensibility.
To illustrate the #exibility of the GiST, we provide simple method
implementations that allow it to behave like a B+tree, an Rtree, and
an RDtree, a new index for data with setvalued attributes. We also
present a preliminary performance analysis of RDtrees, which leads
to discussion on the nature of tree indices and how they behave for
various datasets.
1 Introduction
An ef®cient implementation of search trees is crucial for any
database system. In traditional relational systems, B+trees
[Com79] were suf®cient for the sorts of queries posed on the
usual set of alphanumeric data types. Today, database sys
tems are increasingly being deployed to support new appli
cations such as geographic information systems, multimedia
systems, CAD tools, document libraries, sequence databases,
®ngerprint identi®cation systems, biochemical databases, etc.
To support the growing set of applications, search trees must
be extended for maximum #exibility. This requirement has
motivated two major research approaches in extending search
tree technology:
1. Specialized Search Trees: A large variety of search trees
has been developed to solve speci®c problems. Among
the best known of these trees are spatial search trees such
as Rtrees [Gut84]. While some of this work has had
signi®cant impact in particular domains, the approach of
\Lambda Hellerstein and Naughton were supported by NSF grant IRI9157357.
Permission to copy without fee all or part of this material is granted provided
that the copies are not made or distributed for direct commercial advantage,
the VLDB copyright notice and the title of the publication and its date appear,
and notice is given that copying is by permission of the Very Large Data Base
Endowment. To copy otherwise, or to republish, requires a fee and/or special
permission from the Endowment.
Proceedings of the 21st VLDB Conference
Zurich, Switzerland, 1995
developing domainspeci®c search trees is problematic.
The effort required to implement and maintain such data
structures is high. As new applications need to be sup
ported, new tree structures have to be developed from
scratch, requiring new implementations of the usual tree
facilities for search, maintenance, concurrency control
and recovery.
2. Search Trees For Extensible Data Types: As an alter
native to developing new data structures, existing data
structures such as B+trees and Rtrees can be made ex
tensible in the data types they support [Sto86]. For ex
ample, B+trees can be used to index any data with a lin
ear ordering, supporting equality or linear range queries
over that data. While this provides extensibility in the
data that can be indexed, it does not extend the set of
queries which can be supported by the tree. Regardless
of the type of data stored in a B+tree, the only queries
that can bene®t from the tree are those containing equal
ity and linear range predicates. Similarly in an Rtree,
the only queries that can use the tree are those contain
ing equality, overlap and containment predicates. This
in#exibility presents signi®cant problems for new appli
cations, since traditional queries on linear orderings and
spatial location are unlikely to be apropos for new data
types.
In this paper we present a third direction for extending
search tree technology. We introduce a new data structure
called the Generalized Search Tree (GiST), which is easily ex
tensible both in the data types it can index and in the queries it
can support. Extensibility of queries is particularly important,
since it allows new data types to be indexed in a manner that
supports the queries natural to the types. In addition to pro
viding extensibility for new data types, the GiST uni®es pre
viously disparate structures used for currently common data
types. For example, both B+trees and Rtrees can be imple
mented as extensions of the GiST, resulting in a single code
base for indexing multiple dissimilar applications.
The GiST is easy to con®gure: adapting the tree for dif
ferent uses only requires registering six methods with the
database system, which encapsulate the structure and behav
ior of the object class used for keys in the tree. As an il
lustration of this #exibility, we provide method implemen
tations that allow the GiST to be used as a B+tree, an R
tree, and an RDtree, a new index for data with setvalued
attributes. The GiST can be adapted to work like a variety
of other known search tree structures, e.g. partial sum trees
Page 1

[WE80], kDBtrees [Rob81], Chtrees [KKD89], Exodus
large objects [CDG + 90], hBtrees [LS90], Vtrees [MCD94],
TVtrees [LJF94], etc. Implementing a new set of methods
for the GiST is a signi®cantly easier task than implementing a
new tree package from scratch: for example, the POSTGRES
[Gro94] and SHORE [CDF + 94] implementations of Rtrees
and B+trees are on the order of 3000 lines of C or C++ code
each, while our method implementations for the GiST are on
the order of 500 lines of C code each.
In addition to providing an uni®ed, highly extensible data
structure, our general treatment of search trees sheds some ini
tial light on a more fundamental question: if any dataset can
be indexed with a GiST, does the resulting tree always provide
ef®cient lookup? The answer to this question is #no#, and in
our discussion we illustrate some issues that can affect the ef
®ciency of a search tree. This leads to the interesting ques
tion of how and when one can build an ef®cient search tree
for queries over nonstandard domains # a question that can
now be further explored by experimenting with the GiST.
1.1 Structure of the Paper
In Section 2, we illustrate and generalize the basic nature of
database search trees. Section 3 introduces the Generalized
Search Tree object, with its structure, properties, and behav
ior. In Section 4 we provide GiST implementations of three
different sorts of search trees. Section 5 presents some per
formance results that explore the issues involved in building
an effective search tree. Section 6 examines some details that
need to be considered when implementing GiSTs in a full
#edged DBMS. Section 7 concludes with a discussion of the
signi®cance of the work, and directions for further research.
1.2 Related Work
A good survey of search trees is provided by Knuth [Knu73],
though Btrees and their variants are covered in more detail
by Comer [Com79]. There are a variety of multidimensional
search trees, such as Rtrees [Gut84] and their variants: R*
trees [BKSS90] and R+trees [SRF87]. Other multidimen
sional search trees include quadtrees [FB74], kDBtrees
[Rob81], and hBtrees [LS90]. Multidimensional data can
also be transformed into unidimensional data using a space
®lling curve [Jag90]; after transformation, a B+tree can be
used to index the resulting unidimensional data.
Extensiblekey indices were introduced in POSTGRES
[Sto86, Aok91], and are included in Illustra [Ill94], both of
which have distinct extensible B+tree and Rtree implemen
tations. These extensible indices allow many types of data to
be indexed, but only support a ®xed set of query predicates.
For example, POSTGRES B+trees support the usual order
ing predicates (!; Ÿ; =; –; ?), while POSTGRES Rtrees
support only the predicates Left, Right, OverLeft, Overlap,
OverRight, Right, Contains, Contained and Equal [Gro94].
Extensible Rtrees actually provide a sizable subset of the
GiST's functionality. To our knowledge this paper represents
the ®rst demonstration that Rtrees can index data that has
Internal Nodes (directory)
Leaf Nodes (linked list)
key1 key2 ...
Figure 1: Sketch of a database search tree.
not been mapped into a spatial domain. However, besides
their limited extensibility Rtrees lack a number of other fea
tures supported by the GiST. Rtrees provide only one sort
of key predicate (Contains), they do not allow user speci®ca
tion of the PickSplit and Penalty algorithms described below,
and they lack optimizations for data from linearly ordered do
mains. Despite these limitations, extensible Rtrees are close
enough to GiSTs to allow for the initial method implementa
tions and performance experiments we describe in Section 5.
Analyses of Rtree performance have appeared in [FK94]
and [PSTW93]. This work is dependent on the spatial nature
of typical Rtree data, and thus is not generally applicable to
the GiST. However, similar ideas may prove relevant to our
questions of when and how one can build ef®cient indices in
arbitrary domains.
2 The Gist of Database Search Trees
As an introduction to GiSTs, it is instructive to review search
trees in a simpli®ed manner. Most people with database ex
perience have an intuitive notion of how search trees work,
so our discussion here is purposely vague: the goal is simply
to illustrate that this notion leaves many details unspeci®ed.
After highlighting the unspeci®ed details, we can proceed to
describe a structure that leaves the details open for user spec
i®cation.
The canonical rough picture of a database search tree ap
pears in Figure 1. It is a balanced tree, with high fanout. The
internal nodes are used as a directory. The leaf nodes contain
pointers to the actual data, and are stored as a linked list to
allow for partial or complete scanning.
Within each internal node is a series of keys and point
ers. To search for tuples which match a query predicate q, one
starts at the root node. For each pointer on the node, if the as
sociated key is consistent with q, i.e. the key does not rule out
the possibility that data stored below the pointer may match
q, then one traverses the subtree below the pointer, until all
the matching data is found. As an illustration, we review the
notion of consistency in some familiar tree structures. In B+
trees, queries are in the form of range predicates (e.g. #®nd
all i such that c 1 Ÿ i Ÿ c 2 #), and keys logically delineate a
range in which the data below a pointer is contained. If the
query range and a pointer's key range overlap, then the two
are consistent and the pointer is traversed. In Rtrees, queries
are in the form of region predicates (e.g. #®nd all i such that
(x 1 ; y 1 ; x 2 ; y 2 ) overlaps i#), and keys delineate the bounding
Page 2

box in which the data below a pointer is contained. If the
query region and the pointer's key box overlap, the pointer is
traversed.
Note that in the above description the only restriction on
a key is that it must logically match each datum stored be
low it, so that the consistency check does not miss any valid
data. In B+trees and Rtrees, keys are essentially #con
tainment# predicates: they describe a contiguous region in
which all the data below a pointer are contained. Contain
ment predicates are not the only possible key constructs,
however. For example, the predicate #elected of®cial(i) “
has criminal record(i)# is an acceptable key if every data
item i stored below the associated pointer satis®es the pred
icate. As in Rtrees, keys on a node may #overlap#, i.e. two
keys on the same node may hold simultaneously for some tu
ple.
This #exibility allows us to generalize the notion of a
search key: a search key may be any arbitrary predicate that
holds for each datum below the key. Given a data structure
with such #exible search keys, a user is free to form a tree by
organizing data into arbitrary nested subcategories, labelling
each with some characteristic predicate. This in turn lets us
capture the essential nature of a database search tree: it is a
hierarchy of partitions of a dataset, in which each partition
has a categorization that holds for all data in the partition.
Searches on arbitrary predicates may be conducted based on
the categorizations. In order to support searches on a predi
cate q, the user must provide a Boolean method to tell if q is
consistent with a given search key. When this is so, the search
proceeds by traversing the pointer associated with the search
key. The grouping of data into categories may be controlled
by a usersupplied node splitting algorithm, and the character
ization of the categories can be done with usersupplied search
keys. Thus by exposing the key methods and the tree's split
method to the user, arbitrary search trees may be constructed,
supporting an extensible set of queries. These ideas form the
basis of the GiST, which we proceed to describe in detail.
3 The Generalized Search Tree
In this section we present the abstract data type (or #object#)
Generalized Search Tree (GiST). We de®ne its structure, its
invariant properties, its extensible methods and its builtin al
gorithms. As a matter of convention, we refer to each in
dexed datum as a #tuple#; in an ObjectOriented or Object
Relational DBMS, each indexed datum could be an arbitrary
data object.
3.1 Structure
A GiST is a balanced tree of variable fanout between kM
and M , 2
M Ÿ k Ÿ 1
2
, with the exception of the root node,
which may have fanout between 2 and M . The constant k is
termed the minimum ®ll factor of the tree. Leaf nodes contain
(p; ptr) pairs, where p is a predicate that is used as a search
key, and ptr is the identi®er of some tuple in the database.
Nonleaf nodes contain (p; ptr) pairs, where p is a predicate
used as a search key and ptr is a pointer to another tree node.
Predicates can contain any number of free variables, as long
as any single tuple referenced by the leaves of the tree can in
stantiate all the variables. Note that by using #key compres
sion#, a given predicate p may take as little as zero bytes of
storage. However, for purposes of exposition we will assume
that entries in the tree are all of uniform size. Discussion of
variablesized entries is deferred to Section 6. We assume in
an implementation that given an entry E = (p; ptr), one can
access the node on which E currently resides. This can prove
helpful in implementing the key methods described below.
3.2 Properties
The following properties are invariant in a GiST:
1. Every node contains between kM and M index entries
unless it is the root.
2. For each index entry (p; ptr) in a leaf node, p is true
when instantiated with the values from the indicated tu
ple (i.e. p holds for the tuple.)
3. For each index entry (p; ptr) in a nonleaf node, p is
true when instantiated with the values of any tuple reach
able from ptr. Note that, unlike in Rtrees, for some
entry (p 0 ; ptr 0 ) reachable from ptr, we do not require
that p 0 ! p, merely that p and p 0 both hold for all tuples
reachable from ptr 0 .
4. The root has at least two children unless it is a leaf.
5. All leaves appear on the same level.
Property 3 is of particular interest. An Rtree would re
quire that p 0 ! p, since bounding boxes of an Rtree are ar
ranged in a containment hierarchy. The Rtree approach is un
necessarily restrictive, however: the predicates in keys above
a node N must hold for data below N , and therefore one need
not have keys on N restate those predicates in a more re®ned
manner. One might choose, instead, to have the keys at N
characterize the sets below based on some entirely orthogonal
classi®cation. This can be an advantage in both the informa
tion content and the size of keys.
3.3 Key Methods
In principle, the keys of a GiST may be arbitrary predicates.
In practice, the keys come from a userimplemented object
class, which provides a particular set of methods required by
the GiST. Examples of key structures include ranges of inte
gers for data from Z (as in B+trees), bounding boxes for re
gions in R n (as in Rtrees), and bounding sets for setvalued
data, e.g. data from P(Z) (as in RDtrees, described in Sec
tion 4.3.) The key class is open to rede®nition by the user,
with the following set of six methods required by the GiST:
ffl Consistent(E,q): given an entry E = (p; ptr), and a
query predicate q, returns false if p“q can be guaranteed
Page 3

unsatis®able, and true otherwise. Note that an accurate
test for satis®ability is not required here: Consistent may
return true incorrectly without affecting the correctness
of the tree algorithms. The penalty for such errors is in
performance, since they may result in exploration of ir
relevant subtrees during search.
ffl Union(P ): given a set P of entries (p 1 ; ptr 1 ); : : :
(p n ; ptr n ), returns some predicate r that holds for all
tuples stored below ptr 1 through ptr n . This can be
done by ®nding an r such that (p 1 — : : : — pn ) ! r.
ffl Compress(E): given an entry E = (p; ptr) returns an
entry (ú; ptr) where ú is a compressed representation
of p.
ffl Decompress(E): given a compressed representation
E = (ú; ptr), where ú = Compress(p), returns an en
try (r; ptr) such that p ! r. Note that this is a poten
tially #lossy# compression, since we do not require that
p $ r.
ffl Penalty(E 1 ; E 2 ): given two entries E 1 = (p 1 ; ptr 1 );
E 2 = (p 2 ; ptr 2 ), returns a domainspeci®c penalty for
inserting E 2 into the subtree rooted at E 1 . This is used
to aid the Split and Insert algorithms (described below.)
Typically the penalty metric is some representation of the
increase of size from E 1 :p 1 to Union(fE 1 ; E 2 g). For
example, Penalty for keys from R 2 can be de®ned as
area(Union(fE 1 ; E 2 g)) \Gamma area(E 1 :p 1 ) [Gut84].
ffl PickSplit(P ): given a set P of M + 1 entries (p; ptr),
splits P into two sets of entries P 1 ; P 2 , each of size at
least kM . The choice of the minimum ®ll factor for a
tree is controlled here. Typically, it is desirable to split
in such a way as to minimize some badness metric akin
to a multiway Penalty, but this is left open for the user.
The above are the only methods a GiST user needs to sup
ply. Note that Consistent, Union, Compress and Penalty have
to be able to handle any predicate in their input. In full gener
ality this could become very dif®cult, especially for Consis
tent. But typically a limited set of predicates is used in any
one tree, and this set can be constrained in the method imple
mentation.
There are a number of options for key compression. A sim
ple implementation can let both Compress and Decompress
be the identity function. A more complex implementation can
have Compress((p; ptr)) generate a valid but more compact
predicate r, p ! r, and let Decompress be the identity func
tion. This is the technique used in SHORE's Rtrees, for ex
ample, which upon insertion take a polygon and compress it
to its bounding box, which is itself a valid polygon. It is also
used in pre®x B+trees [Com79], which truncate split keys
to an initial substring. More involved implementations might
use complex methods for both Compress and Decompress.
3.4 Tree Methods
The key methods in the previous section must be provided by
the designer of the key class. The tree methods in this sec
tion are provided by the GiST, and may invoke the required
key methods. Note that keys are Compressed when placed on
a node, and Decompressed when read from a node. We con
sider this implicit, and will not mention it further in describing
the methods.
3.4.1 Search
Search comes in two #avors. The ®rst method, presented in
this section, can be used to search any dataset with any query
predicate, by traversing as much of the tree as necessary to sat
isfy the query. It is the most general search technique, analo
gous to that of Rtrees. A more ef®cient technique for queries
over linear orders is described in the next section.
Algorithm Search(R; q)
Input: GiST rooted at R, predicate q
Output: all tuples that satisfy q
Sketch: Recursively descend all paths in tree whose
keys are consistent with q.
S1: [Search subtrees] If R is not a leaf, check
each entry E on R to determine whether
Consistent(E; q). For all entries that are Con
sistent, invoke Search on the subtree whose
root node is referenced by E:ptr.
S2: [Search leaf node] If R is a leaf,
check each entry E on R to determine whether
Consistent(E; q). If E is Consistent, it is a
qualifying entry. At this point E:ptr could
be fetched to check q accurately, or this check
could be left to the calling process.
Note that the query predicate q can be either an exact match
(equality) predicate, or a predicate satis®able by many val
ues. The latter category includes #range# or #window# pred
icates, as in B+ or Rtrees, and also more general predicates
that are not based on contiguous areas (e.g. setcontainment
predicates like #all supersets of f6, 7, 68g#.)
3.4.2 Search In Linearly Ordered Domains
If the domain to be indexed has a linear ordering, and queries
are typically equality or rangecontainment predicates, then a
more ef®cient search method is possible using the FindMin
and Next methods de®ned in this section. To make this option
available, the user must take some extra steps when creating
the tree:
1. The #ag IsOrdered must be set to true. IsOrdered is a
static property of the tree that is set at creation. It defaults
to false.
Page 4

2. An additional method Compare(E 1 ; E 2 ) must be regis
tered. Given two entries E 1 = (p 1 ; ptr 1 ) and E 2 =
(p 2 ; ptr 2 ), Compare reports whether p 1 precedes p 2 , p 1
follows p 2 , or p 1 and p 2 are ordered equivalently. Com
pare is used to insert entries in order on each node.
3. The PickSplit method must ensure that for any entries
E 1 on P 1 and E 2 on P 2 , Compare(E 1 ; E 2 ) reports #pre
cedes#.
4. The methods must assure that no two keys on a node
overlap, i.e. for any pair of entries E 1 ; E 2 on a node,
Consistent(E 1 ; E 2 :p) = false.
If these four steps are carried out, then equality and range
containment queries may be evaluated by calling FindMin
and repeatedly calling Next, while other query predicates may
still be evaluated with the general Search method. Find
Min/Next is more ef®cient than traversing the tree using
Search, since FindMin and Next only visit the nonleaf nodes
along one roottoleaf path. This technique is based on the
typical rangelookup in B+trees.
Algorithm FindMin(R; q)
Input: GiST rooted at R, predicate q
Output: minimum tuple in linear order that satis®es q
Sketch: descend leftmost branch of tree whose keys
are Consistent with q. When a leaf node is
reached, return the ®rst key that is Consistent
with q.
FM1: [Search subtrees] If R is not a leaf, ®nd the
®rst entry E in order such that
Consistent(E; q) 1 . If such an E can be found,
invoke FindMin on the subtree whose root
node is referenced by E:ptr. If no such en
try is found, return NULL.
FM2: [Search leaf node] If R is a leaf, ®nd the
®rst entry E on R such that Consistent(E; q),
and return E. If no such entry exists, return
NULL.
Given one element E that satis®es a predicate q, the Next
method returns the next existing element that satis®es q, or
NULL if there is none. Next is made suf®ciently general to
®nd the next entry on nonleaf levels of the tree, which will
prove useful in Section 4. For search purposes, however, Next
will only be invoked on leaf entries.
1 The appropriate entry may be found by doing a binary search of the en
tries on the node. Further discussion of intranode search optimizations ap
pears in Section 6.
Algorithm Next(R; q; E)
Input: GiST rooted at R, predicate q, current entry E
Output: next entry in linear order that satis®es q
Sketch: return next entry on the same level of the tree
if it satis®es q. Else return NULL.
N1: [next on node] If E is not the rightmost entry
on its node, and N is the next entry to the right
of E in order, and Consistent(N; q), then re
turn N . If :Consistent(N; q), return NULL.
N2: [next on neighboring node] If E is the righ
most entry on its node, let P be the next node
to the right of R on the same level of the tree
(this can be found via tree traversal, or via
sideways pointers in the tree, when available
[LY81].) If P is nonexistent, return NULL.
Otherwise, let N be the leftmost entry on P .
If Consistent(N; q), then return N , else return
NULL.
3.4.3 Insert
The insertion routines guarantee that the GiST remains bal
anced. They are very similar to the insertion routines of R
trees, which generalize the simpler insertion routines for B+
trees. Insertion allows speci®cation of the level at which to
insert. This allows subsequent methods to use Insert for rein
serting entries from internal nodes of the tree. We will assume
that level numbers increase as one ascends the tree, with leaf
nodes being at level 0. Thus new entries to the tree are in
serted at level l = 0.
Algorithm Insert(R; E; l)
Input: GiST rooted at R, entry E = (p; ptr), and
level l, where p is a predicate such that p holds
for all tuples reachable from ptr.
Output: new GiST resulting from insert of E at level l.
Sketch: ®nd where E should go, and add it there, split
ting if necessary to make room.
I1. [invoke ChooseSubtree to ®nd where E
should go] Let L = ChooseSubtree(R; E; l)
I2. If there is room for E on L, install E on L
(in order according to Compare, if IsOrdered.)
Otherwise invoke Split(R; L; E).
I3. [propagate changes upward]
AdjustKeys(R; L).
ChooseSubtree can be used to ®nd the best node for in
sertion at any level of the tree. When the IsOrdered property
Page 5

holds, the Penalty method must be carefully written to assure
that ChooseSubtree arrives at the correct leaf node in order.
An example of how this can be done is given in Section 4.1.
Algorithm ChooseSubtree(R; E; l)
Input: subtree rooted at R, entry E = (p; ptr), level
l
Output: node at level l best suited to hold entry with
characteristic predicate E:p
Sketch: Recursively descend tree minimizing Penalty
CS1. If R is at level l, return R;
CS2. Else among all entries F = (q; ptr 0 ) on R
®nd the one such that Penalty(F; E) is mini
mal. Return ChooseSubtree(F:ptr 0 ; E; l).
The Split algorithm makes use of the userde®ned Pick
Split method to choose how to split up the elements of a node,
including the new tuple to be inserted into the tree. Once the
elements are split up into two groups, Split generates a new
node for one of the groups, inserts it into the tree, and updates
keys above the new node.
Algorithm Split(R; N;E)
Input: GiST R with node N , and a new entry E =
(p; ptr).
Output: the GiST with N split in two and E inserted.
Sketch: split keys of N along with E into two groups
according to PickSplit. Put one group onto a
new node, and Insert the new node into the
parent of N .
SP1: Invoke PickSplit on the union of the elements
of N and fEg, put one of the two partitions on
node N , and put the remaining partition on a
new node N 0 .
SP2: [Insert entry for N 0 in parent] Let EN 0 =
(q; ptr 0 ), where q is the Union of all entries
on N 0 , and ptr 0 is a pointer to N 0 . If there
is room for EN 0
on Parent(N ), install EN 0
on
Parent(N ) (in order if IsOrdered.) Otherwise
invoke Split(R; Parent(N); EN 0 ) 2 .
SP3: Modify the entry F which points to N , so that
F:p is the Union of all entries on N .
2 We intentionally do not specify what technique is used to ®nd the Parent
of a node, since this implementation interacts with issues related to concur
rency control, which are discussed in Section 6. Depending on techniques
used, the Parent may be found via a pointer, a stack, or via retraversal of the
tree.
Step SP3 of Split modi®es the parent node to re#ect the
changes in N . These changes are propagated upwards
through the rest of the tree by step I3 of the Insert algorithm,
which also propagates the changes due to the insertion of N 0 .
The AdjustKeys algorithm ensures that keys above a set
of predicates hold for the tuples below, and are appropriately
speci®c.
Algorithm AdjustKeys(R; N)
Input: GiST rooted at R, tree node N
Output: the GiST with ancestors of N containing cor
rect and speci®c keys
Sketch: ascend parents from N in the tree, making the
predicates be accurate characterizations of the
subtrees. Stop after root, or when a predicate
is found that is already accurate.
PR1: If N is the root, or the entry which points to N
has an alreadyaccurate representation of the
Union of the entries on N , then return.
PR2: Otherwise, modify the entry E which points to
N so that E:p is the Union of all entries on N .
Then AdjustKeys(R, Parent(N ).)
Note that AdjustKeys typically performs no work when
IsOrdered = true, since for such domains predicates on each
node typically partition the entire domain into ranges, and
thus need no modi®cation on simple insertion or deletion. The
AdjustKeys routine detects this in step PR1, which avoids
calling AdjustKeys on higher nodes of the tree. For such do
mains, AdjustKeys may be circumvented entirely if desired.
3.4.4 Delete
The deletion algorithms maintain the balance of the tree, and
attempt to keep keys as speci®c as possible. When there is
a linear order on the keys they use B+treestyle #borrow or
coalesce# techniques. Otherwise they use Rtreestyle rein
sertion techniques. The deletion algorithms are omitted here
due to lack of space; they are given in full in [HNP95].
4 The GiST for Three Applications
In this section we brie#y describe implementations of key
classes used to make the GiST behave like a B+tree, an R
tree, and an RDtree, a new Rtreelike index over setvalued
data.
4.1 GiSTs Over Z (B+trees)
In this example we index integer data. Before compression,
each key in this tree is a pair of integers, representing the in
terval contained below the key. Particularly, a key !a; b?
represents the predicate Contains([a; b); v) with variable v.
Page 6

The query predicates we support in this key class are Con
tains(interval, v), and Equal(number, v). The interval in the
Contains query may be closed or open at either end. The
boundary of any interval of integers can be trivially converted
to be closed or open. So without loss of generality, we assume
below that all intervals are closed on the left and open on the
right.
The implementations of the Contains and Equal query
predicates are as follows:
ffl Contains([x; y); v) If x Ÿ v ! y, return true. Otherwise
return false.
ffl Equal(x; v) If x = v return true. Otherwise return false.
Now, the implementations of the GiST methods:
ffl Consistent(E; q) Given entry E = (p; ptr) and query
predicate q, we know that p = Contains([x p ; y p ); v), and
either q = Contains([x q ; y q ); v) or q = Equal(x q ; v).
In the ®rst case, return true if (x p ! y q ) “ (y p ? x q )
and false otherwise. In the second case, return true if
x p Ÿ x q ! y p , and false otherwise.
ffl Union(fE 1 ; : : : ; En g) Given
E 1 = ([x 1 ; y 1 ); ptr 1 ); : : : ; En = ([x n ; yn ); ptr n )),
return [MIN(x 1 ; : : : ; xn ); MAX(y 1 ; : : : ; yn )).
ffl Compress(E = ([x; y); ptr)) If E is the leftmost key
on a nonleaf node, return a 0byte object. Otherwise re
turn x.
ffl Decompress(E = (ú; ptr)) We must construct an in
terval [x; y). If E is the leftmost key on a nonleaf node,
let x = \Gamma1. Otherwise let x = ú. If E is the rightmost
key on a nonleaf node, let y = 1. If E is any other
key on a nonleaf node, let y be the value stored in the
next key (as found by the Next method.) If E is on a leaf
node, let y = x + 1. Return ([x; y); ptr).
ffl Penalty(E = ([x 1 ; y 1 ); ptr 1 ); F = ([x 2 ; y 2 ); ptr 2 ))
If E is the leftmost pointer on its node, return MAX(y 2 \Gamma
y 1 ; 0). If E is the rightmost pointer on its node, return
MAX(x 1 \Gamma x 2 ; 0). Otherwise return MAX(y 2 \Gamma y 1 ; 0) +
MAX(x 1 \Gamma x 2 ; 0).
ffl PickSplit(P ) Let the ®rst b jP j
2 c entries in order go in the
left group, and the last d jP j
2 e entries go in the right. Note
that this guarantees a minimum ®ll factor of M
2
.
Finally, the additions for ordered keys:
ffl IsOrdered = true
ffl Compare(E 1 = (p 1 ; ptr 1 ); E 2 = (p 2 ; ptr 2 )) Given
p 1 = [x 1 ; y 1 ) and p 2 = [x 2 ; y 2 ), return #precedes# if
x 1 ! x 2 , #equivalent# if x 1 = x 2 , and #follows# if x 1 ?
x 2 .
There are a number of interesting features to note in this
set of methods. First, the Compress and Decompress methods
produce the typical #split keys# found in B+trees, i.e. n \Gamma 1
stored keys for n pointers, with the leftmost and rightmost
boundaries on a node left unspeci®ed (i.e. \Gamma1 and 1). Even
though GiSTs use key/pointer pairs rather than split keys, this
GiST uses no more space for keys than a traditional B+tree,
since it compresses the ®rst pointer on each node to zero bytes.
Second, the Penalty method allows the GiST to choose the
correct insertion point. Inserting (i.e. Unioning) a new key
value k into a interval [x; y) will cause the Penalty to be pos
itive only if k is not already contained in the interval. Thus
in step CS2, the ChooseSubtree method will place new data
in the appropriate spot: any set of keys on a node partitions
the entire domain, so in order to minimize the Penalty, Choos
eSubtree will choose the one partition in which k is already
contained. Finally, observe that one could fairly easily sup
port more complex predicates, including disjunctions of inter
vals in query predicates, or ranked intervals in key predicates
for supporting ef®cient sampling [WE80].
4.2 GiSTs Over Polygons in R 2 (Rtrees)
In this example, our data are 2dimensional polygons on
the Cartesian plane. Before compression, the keys in this
tree are 4tuples of reals, representing the upperleft and
lowerright corners of rectilinear bounding rectangles for 2d
polygons. A key (x ul ; y ul ; x lr ; y lr ) represents the predicate
Contains((x ul ; y ul ; x lr ; y lr ); v), where (x ul ; y ul ) is the upper
left corner of the bounding box, (x lr ; y lr ) is the lower right
corner, and v is the free variable. The query predicates we
support in this key class are Contains(box, v), Overlap(box,
v), and Equal(box, v), where box is a 4tuple as above.
The implementations of the query predicates are as fol
lows:
ffl Contains((x 1
ul ; y 1
ul ; x 1
lr ; y 1
lr ); (x 2
ul ; y 2
ul ; x 2
lr ; y 2
lr )) Return
true if
(x 1
lr – x 2
lr ) “ (x 1
ul Ÿ x 2
ul ) “ (y 1
lr Ÿ y 2
lr ) “ (y 1
ul – y 2
ul ):
Otherwise return false.
ffl Overlap((x 1
ul ; y 1
ul ; x 1
lr ; y 1
lr ); (x 2
ul ; y 2
ul ; x 2
lr ; y 2
lr )) Return
true if
(x 1
ul Ÿ x 2
lr ) “ (x 2
ul Ÿ x 1
lr ) “ (y 1
lr Ÿ y 2
ul ) “ (y 2
lr Ÿ y 1
ul ):
Otherwise return false.
ffl Equal((x 1
ul ; y 1
ul ; x 1
lr ; y 1
lr ); (x 2
ul ; y 2
ul ; x 2
lr ; y 2
lr )) Return
true if
(x 1
ul = x 2
ul ) “ (y 1
ul = y 2
ul ) “ (x 1
lr = x 2
lr ) “ (y 1
lr = y 2
lr ):
Otherwise return false.
Now, the GiST method implementations:
Page 7

ffl Consistent((E; q) Given entry E = (p; ptr), we
know that p = Contains((x 1
ul ; y 1
ul ; x 1
lr ; y 1
lr ); v), and
q is either Contains, Overlap or Equal on the argu
ment (x 2
ul ; y 2
ul ; x 2
lr ; y 2
lr ). For any of these queries, return
true if Overlap((x 1
ul ; y 1
ul ; x 1
lr ; y 1
lr ); (x 2
ul ; y 2
ul ; x 2
lr ; y 2
lr )),
and return false otherwise.
ffl Union(fE 1 ; : : : ; En g) Given
E 1 = ((x 1
ul ; y 1
ul ; x 1
lr ; y 1
lr ); ptr 1 ); : : : ;
En = (x n
ul ; y n
ul ; x n
lr ; y n
lr )), return (MIN(x 1
ul ; : : : ; x n
ul ),
MAX(y 1
ul ; : : : ; y n
ul ), MAX(x 1
lr ; : : : ; x n
lr ), MIN(y 1
lr ; : : : ;
y n
lr )).
ffl Compress(E = (p; ptr)) Form the bounding box
of polygon p, i.e., given a polygon stored as a set
of line segments l i = (x i
1 ; y i
1 ; x i
2 ; y i
2 ), form ú =
(8 i MIN(x i
ul ), 8 i MAX(y i
ul ), 8 i MAX(x i
lr ), 8 i MIN(y i
lr )).
Return (ú; ptr).
ffl Decompress(E = ((x ul ; y ul ; x lr ; y lr ); ptr)) The iden
tity function, i.e., return E.
ffl Penalty(E 1 ; E 2 ) Given E 1 = (p 1 ; ptr 1 ) and E 2 =
(p 2 ; ptr 2 ), compute q = Union(fE 1 ; E 2 g), and return
area(q) \Gamma area(E 1 :p). This metric of #change in area# is
the one proposed by Guttman [Gut84].
ffl PickSplit(P ) A variety of algorithms have been pro
posed for Rtree splitting. We thus omit this method im
plementation from our discussion here, and refer the in
terested reader to [Gut84] and [BKSS90].
The above implementations, along with the GiST algo
rithms described in the previous chapters, give behavior iden
tical to that of Guttman's Rtree. A series of variations on R
trees have been proposed, notably the R*tree [BKSS90] and
the R+tree [SRF87]. The R*tree differs from the basic R
tree in three ways: in its PickSplit algorithm, which has a va
riety of small changes, in its ChooseSubtree algorithm, which
varies only slightly, and in its policy of reinserting a number
of keys during node split. It would not be dif®cult to imple
ment the R*tree in the GiST: the R*tree PickSplit algorithm
can be implemented as the PickSplit method of the GiST, the
modi®cations to ChooseSubtree could be introduced with a
careful implementation of the Penalty method, and the rein
sertion policy of the R*tree could easily be added into the
builtin GiST tree methods (see Section 7.) R+trees, on the
other hand, cannot be mimicked by the GiST. This is because
the R+tree places duplicate copies of data entries in multiple
leaf nodes, thus violating the GiST principle of a search tree
being a hierarchy of partitions of the data.
Again, observe that one could fairly easily support more
complex predicates, including ndimensional analogs of the
disjunctive queries and ranked keys mentioned for B+
trees, as well as the topological relations of Papadias, et
al. [PTSE95] Other examples include arbitrary variations of
the usual overlap or ordering queries, e.g. #®nd all polygons
that overlap more than 30% of this box#, or #®nd all polygons
that overlap 12 to 1 o'clock#, which for a given point p returns
all polygons that are in the region bounded by two rays that
exit p at angles 90 ffi and 60 ffi in polar coordinates. Note that
this in®nite region cannot be de®ned as a polygon made up of
line segments, and hence this query cannot be expressed using
typical Rtree predicates.
4.3 GiSTs Over P(Z) (RDtrees)
In the previous two sections we demonstrated that the GiST
can provide the functionality of two known data structures:
B+trees and Rtrees. In this section, we demonstrate that the
GiST can provide support for a new search tree that indexes
setvalued data.
The problem of handling setvalued data is attracting in
creasing attention in the ObjectOriented database commu
nity [KG94], and is fairly natural even for traditional rela
tional database applications. For example, one might have a
university database with a table of students, and for each stu
dent an attribute courses passed of type setof(integer). One
would like to ef®ciently support containment queries such as
#®nd all students who have passed all the courses in the pre
requisite set f101, 121, 150g.#
We handle this in the GiST by using sets as containment
keys, much as an Rtree uses bounding boxes as containment
keys. We call the resulting structure an RDtree (or #Russian
Doll# tree.) The keys in an RDtree are sets of integers, and
the RDtree derives its name from the fact that as one traverses
a branch of the tree, each key contains the key below it in the
branch. We proceed to give GiST method implementations
for RDtrees.
Before compression, the keys in our RDtrees are sets of
integers. A key S represents the predicate Contains(S; v) for
setvalued variable v. The query predicates allowed on the
RDtree are Contains(set, v), Overlap(set, v), and Equal(set,
v).
The implementation of the query predicates is straightfor
ward:
ffl Contains(S; T ) Return true if S ' T , and false other
wise.
ffl Overlap(S; T ) Return true if S ``T 6= ;, false otherwise.
ffl Equal(S; T ) Return true if S = T , false otherwise.
Now, the GiST method implementations:
ffl Consistent(E = (p; ptr); q) Given our keys and pred
icates, we know that p = Contains(S; v), and either q =
Contains(T ; v), q = Overlap(T ; v) or q = Equal(T ; v).
For all of these, return true if Overlap(S; T ), and false
otherwise.
ffl Union(fE 1 = (S 1 ; ptr 1 ); : : : ; En = (Sn ; ptr n )g)
Return S 1 [ : : : [ Sn .
ffl Compress(E = (S; ptr)) A variety of compression
techniques for sets are given in [HP94]. We brie#y
Page 8

describe one of them here. The elements of S are
sorted, and then converted to a set of n disjoint ranges
f[l 1 ; h 1 ]; [l 2 ; h 2 ]; : : : ; [l n ; hn ]g where l i Ÿ h i , and h i !
l i+1 . The conversion uses the following algorithm:
Initialize: consider each element am 2 S
to be a range [am ; am ].
while (more than n ranges remain) f
find the pair of adjacent ranges
with the least interval
between them;
form a single range of the pair;
g
The resulting structure is called a rangeset. It can be
shown that this algorithm produces a rangeset of n items
with minimal addition of elements not in S [HP94].
ffl Decompress(E = (rangeset; ptr)) Rangesets are eas
ily converted back to sets by enumerating the elements
in the ranges.
ffl Penalty(E 1 = (S 1 ; ptr 1 ); E 2 = (S 2 ; ptr 2 ) Return
jE 1 :S 1 [ E 2 :S 2 j \Gamma jE 1 :S 1 j. Alternatively, return the
change in a weighted cardinality, where each element of
Z has a weight, and jSj is the sum of the weights of the
elements in S.
ffl PickSplit(P ) Guttman's quadratic algorithm for Rtree
split works naturally here. The reader is referred to
[Gut84] for details.
This GiST supports the usual Rtree query predicates, has
containment keys, and uses a traditional Rtree algorithm for
PickSplit. As a result, we were able to implement these meth
ods in Illustra's extensible Rtrees, and get behavior identi
cal to what the GiST behavior would be. This exercise gave
us a sense of the complexity of a GiST class implementation
(c. # 500 lines of C code), and allowed us to do the performance
studies described in the next section. Using Rtrees did limit
our choices for predicates and for the split and penalty algo
rithms, which will merit further exploration when we build
RDtrees using GiSTs.
5 GiST Performance Issues
In balanced trees such as B+trees which have
nonoverlapping keys, the maximum number of nodes to be
examined (and hence I/O's) is easy to bound: for a point
query over duplicatefree data it is the height of the tree, i.e.
O(log n) for a database of n tuples. This upper bound can
not be guaranteed, however, if keys on a node may overlap,
as in an Rtree or GiST, since overlapping keys can cause
searches in multiple paths in the tree. The performance of a
GiST varies directly with the amount that keys on nodes tend
to overlap.
There are two major causes of key overlap: data overlap,
and information loss due to key compression. The ®rst issue
is straightforward: if many data objects overlap signi®cantly,
then keys within the tree are likely to overlap as well. For
B+trees
0 1
0
1
Data
Overlap
Compression Loss
Figure 2: Space of Factors Affecting GiST Performance
example, any dataset made up entirely of identical items will
produce an inef®cient index for queries that match the items.
Such workloads are simply not amenable to indexing tech
niques, and should be processed with sequential scans instead.
Loss due to key compression causes problems in a slightly
more subtle way: even though two sets of data may not
overlap, the keys for these sets may overlap if the Com
press/Decompress methods do not produce exact keys. Con
sider Rtrees, for example, where the Compress method pro
duces bounding boxes. If objects are not boxlike, then the
keys that represent them will be inaccurate, and may indi
cate overlaps when none are present. In Rtrees, the prob
lem of compression loss has been largely ignored, since most
spatial data objects (geographic entities, regions of the brain,
etc.) tend to be relatively boxshaped. 3 But this need not be
the case. For example, consider a 3d Rtree index over the
dataset corresponding to a plate of spaghetti: although no sin
gle spaghetto intersects any other in three dimensions, their
bounding boxes will likely all intersect!
The two performance issues described above are displayed
as a graph in Figure 2. At the origin of this graph are trees with
no data overlap and lossless key compression, which have the
optimal logarithmic performance described above. Note that
B+trees over duplicatefree data are at the origin of the graph.
As one moves towards 1 along either axis, performance can
be expected to degrade. In the worst case on the x axis, keys
are consistent with any query, and the whole tree must be tra
versed for any query. In the worst case on the y axis, all the
data are identical, and the whole tree must be traversed for any
query consistent with the data.
In this section, we present some initial experiments we
have done with RDtrees to explore the space of Figure 2. We
chose RDtrees for two reasons:
1. We were able to implement the methods in Illustra R
trees.
2. Set data can be #cooked# to have almost arbitrary over
3 Better approximations than bounding boxes have been considered for
doing spatial joins [BKSS94]. However, this work proposes using bound
ing boxes in an R*tree, and only using the more accurate approximations in
main memory during postprocessing steps.
Page 9

0 0.1 0.2 0.3 0.4 0.5 0
0.2
0.4
0.6
0.8
1
1000
2000
3000
4000
5000
Compression Loss
Data Overlap
Avg. Number of I/Os
Figure 3: Performance in the Parameter Space
This surface was generated from data presented
in [HNP95]. Compression loss was calculated as
(numranges \Gamma 20)=numranges, while data overlap
was calculated as overlap=10.
lap, as opposed to polygon data which is contiguous
within its boundaries, and hence harder to manipulate.
For example, it is trivial to construct n distant #hot spots#
shared by all sets in an RDtree, but is geometrically dif
®cult to do the same for polygons in an Rtree. We thus
believe that setvalued data is particularly useful for ex
perimenting with overlap.
To validate our intuition about the performance space, we
generated 30 datasets, each corresponding to a point in the
space of Figure 2. Each dataset contained 10000 setvalued
objects. Each object was a regularly spaced set of ranges,
much like a comb laid on the number line (e.g. f[1; 10];
[100001; 100010]; [200001; 200010]; : : :g). The #teeth# of
each comb were 10 integers wide, while the spaces between
teeth were 99990 integers wide, large enough to accommo
date one tooth from every other object in the dataset. The 30
datasets were formed by changing two variables: numranges,
the number of ranges per set, and overlap, the amount that
each comb overlapped its predecessor. Varying numranges
adjusted the compression loss: our Compress method only al
lowed for 20 ranges per rangeset, so a comb of t ? 20 teeth
had t \Gamma 20 of its intertooth spaces erroneously included into
its compressed representation. The amount of overlap was
controlled by the left edge of each comb: for overlap 0, the
®rst comb was started at 1, the second at 11, the third at 21,
etc., so that no two combs overlapped. For overlap 2, the ®rst
comb was started at 1, the second at 9, the third at 17, etc.
The 30 datasets were generated by forming all combinations
of numranges in f20, 25, 30, 35, 40g, and overlap in f0, 2, 4,
6, 8, 10g.
For each of the 30 datasets, ®ve queries were performed.
Each query searched for objects overlapping a different tooth
of the ®rst comb. The query performance was measured in
number of I/Os, and the ®ve numbers averaged per dataset. A
chart of the performance appears in [HNP95]. More illustra
tive is the 3d plot shown in Figure 3, where the x and y axes
are the same as in Figure 2, and the z axis represents the aver
age number of I/Os. The landscape is much as we expected:
it slopes upwards as we move away from 0 on either axis.
While our general insights on data overlap and compres
sion loss are veri®ed by this experiment, a number of perfor
mance variables remain unexplored. Two issues of concern
are hot spots and the correlation factor across hot spots. Hot
spots in RDtrees are integers that appear in many sets. In
general, hot spots can be thought of as very speci®c predicates
satis®able by many tuples in a dataset. The correlation factor
for two integers j and k in an RDtree is the likelihood that
if one of j or k appears in a set, then both appear. In general,
the correlation factor for two hot spots p; q is the likelihood
that if p — q holds for a tuple, p “ q holds as well. An inter
esting question is how the GiST behaves as one denormalizes
data sets to produce hot spots, and correlations between them.
This question, along with similar issues, should prove to be a
rich area of future research.
6 Implementation Issues
In previous sections we described the GiST, demonstrated its
#exibility, and discussed its performance as an index for sec
ondary storage. A full#edged database system is more than
just a secondary storage manager, however. In this section we
point out some important database system issues which need
to be considered when implementing the GiST. Due to space
constraints, these are only sketched here; further discussion
can be found in [HNP95].
ffl InMemory Ef®ciency: The discussion above shows
how the GiST can be ef®cient in terms of disk access.
To streamline the ef®ciency of its inmemory computa
tion, we open the implementation of the Node object to
extensibility. For example, the Node implementation for
GiSTs with linear orderings may be overloaded to sup
port binary search, and the Node implementation to sup
port hBtrees can be overloaded to support the special
ized internal structure required by hBtrees.
ffl Concurrency Control, Recovery and Consistency:
High concurrency, recoverability, and degree3 consis
tency are critical factors in a full#edged database sys
tem. We are considering extending the results of Kor
nacker and Banks for Rtrees [KB95] to our implemen
tation of GiSTs.
ffl VariableLength Keys: It is often useful to allow
keys to vary in length, particularly given the Compress
method available in GiSTs. This requires particular care
in implementation of tree methods like Insert and Split.
ffl Bulk Loading: In unordereddomains, it is not clear how
to ef®ciently build an index over a large, preexisting
dataset. An extensible BulkLoad method should be im
plemented for the GiST to accommodate bulk loading
for various domains.
Page 10

ffl Optimizer Integration: To integrate GiSTs with a
query optimizer, one must let the optimizer know which
query predicates match each GiST. The question of esti
mating the cost of probing a GiST is more dif®cult, and
will require further research.
ffl Coding Details: We propose implementing the GiST in
two ways. The Extensible GiST will be
runtimeextensible like POSTGRES or Illustra for max
imal convenience; the Template GiST will compilation
extensible like SHORE for maximal ef®ciency. With a
little care, these two implementations can be built off of
the same code base, without replication of logic.
7 Summary and Future Work
The incorporation of new data types into today's database
systems requires indices that support an extensible set of
queries. To facilitate this, we isolated the essential nature of
search trees, providing a clean characterization of how they
are all alike. Using this insight, we developed the General
ized Search Tree, which uni®es previously distinct search tree
structures. The GiST is extremely extensible, allowing arbi
trary data sets to be indexed and ef®ciently queried in new
ways. This #exibility opens the question of when and how
one can generate effective search trees.
Since the GiST uni®es B+trees and Rtrees into a single
structure, it is immediately useful for systems which require
the functionality of both. In addition, the extensibility of the
GiST also opens up a number of interesting research problems
which we intend to pursue:
ffl Indexability: The primary theoretical question raised by
the GiST is whether one can ®nd a general characteri
zation of workloads that are amenable to indexing. The
GiST provides a means to index arbitrary domains for ar
bitrary queries, but as yet we lack an #indexability the
ory# to describe whether or not trying to index a given
data set is practical for a given set of queries.
ffl Indexing NonStandard Domains: As a practical mat
ter, we are interested in building indices for unusual do
mains, such as sets, terms, images, sequences, graphs,
video and sound clips, ®ngerprints, molecular structures,
etc. Pursuit of such applied results should provide an in
teresting feedback loop with the theoretical explorations
described above. Our investigation into RDtrees for
set data has already begun: we have implemented RD
trees in SHORE and Illustra, using Rtrees rather than
the GiST. Once we shift from Rtrees to the GiST, we
will also be able to experiment with new PickSplit meth
ods and new predicates for sets.
ffl Query Optimization and Cost Estimation: Cost esti
mates for query optimization need to take into account
the costs of searching a GiST. Currently such estimates
are reasonably accurate for B+trees, and less so for R
trees. Recently, some work on Rtree cost estimation
has been done [FK94], but more work is required to
bring this to bear on GiSTs in general. As an additional
problem, the userde®ned GiST methods may be time
consuming operations, and their CPU cost should be reg
istered with the optimizer [HS93]. The optimizer must
then correctly incorporate the CPU cost of the methods
into its estimate of the cost for probing a particular GiST.
ffl Lossy Key Compression Techniques: As new data do
mains are indexed, it will likely be necessary to ®nd new
lossy compression techniques that preserve the proper
ties of a GiST.
ffl Algorithmic Improvements: The GiST algorithms for in
sertion are based on those of Rtrees. As noted in Sec
tion 4.2, R*trees use somewhat modi®ed algorithms,
which seem to provide some performance gain for spa
tial data. In particular, the R*tree policy of #forced rein
sert# during split may be generally bene®cial. An inves
tigation of the R*tree modi®cations needs to be carried
out for nonspatial domains. If the techniques prove ben
e®cial, they will be incorporated into the GiST, either as
an option or as default behavior. Additional work will be
required to unify the R*tree modi®cations with Rtree
techniques for concurrency control and recovery.
Finally, we believe that future domainspeci®c search tree
enhancements should take into account the generality issues
raised by GiSTs. There is no good reason to develop new, dis
tinct search tree structures if comparable performance can be
obtained in a uni®ed framework. The GiST provides such a
framework, and we plan to implement it in an existing exten
sible system, and also as a standalone C++ library package,
so that it can be exploited by a variety of systems.
Acknowledgements
Thanks to Praveen Seshadri, Marcel Kornacker, Mike Ol
son, Kurt Brown, Jim Gray, and the anonymous reviewers for
their helpful input on this paper. Many debts of gratitude are
due to the staff of Illustra Information Systems # thanks to
Mike Stonebraker and Paula Hawthorn for providing a #exi
ble industrial research environment, and to Mike Olson, Jeff
Meredith, Kevin Brown, Michael Ubell, and Wei Hong for
their help with technical matters. Thanks also to Shel Finkel
stein for his insights on RDtrees. Simon Hellerstein is re
sponsible for the acronym GiST. Ira Singer provided a hard
ware loan which made this paper possible. Finally, thanks
to Adene Sacks, who was a crucial resource throughout the
course of this work.
References
[Aok91] P. M. Aoki. Implementation of Extended Indexes
in POSTGRES. SIGIR Forum, 25(1):2±9, 1991.
[BKSS90] Norbert Beckmann, HansPeter Kriegel, Ralf
Schneider, and Bernhard Seeger. The R*
tree: An Ef®cient and Robust Access Method
Page 11

For Points and Rectangles. In Proc. ACM
SIGMOD International Conference on Manage
ment of Data, pages 322±331, Atlantic City, May
1990.
[BKSS94] Thomas Brinkhoff, HansPeter Kriegel, Ralf
Schneider, and Bernhard Seeger. MultiStep
Processing of Spatial Joins. In Proc. ACM
SIGMOD International Conference on Manage
ment of Data, Minneapolis, May 1994, pages
197±208.
[CDF + 94] Michael J. Carey, David J. DeWitt, Michael J.
Franklin, Nancy E. Hall, Mark L. McAuliffe,
Jeffrey F. Naughton, Daniel T. Schuh, Mar
vin H. Solomon, C. K. Tan, Odysseas G. Tsatalos,
Seth J. White, and Michael J. Zwilling. Shoring
Up Persistent Applications. In Proc. ACM
SIGMOD International Conference on Manage
ment of Data, Minneapolis, May 1994, pages
383±394.
[CDG + 90] M.J. Carey, D.J. DeWitt, G. Graefe, D.M. Haight,
J.E. Richardson, D.H. Schuh, E.J. Shekita, and
S.L. Vandenberg. The EXODUS Extensible
DBMS Project: An Overview. In Stan Zdonik
and David Maier, editors, Readings In Object
Oriented Database Systems. MorganKaufmann
Publishers, Inc., 1990.
[Com79] Douglas Comer. The Ubiquitous BTree. Com
puting Surveys, 11(2):121±137, June 1979.
[FB74] R. A. Finkel and J. L. Bentley. QuadTrees:
A Data Structure For Retrieval On Composite
Keys. ACTA Informatica, 4(1):1±9, 1974.
[FK94] Christos Faloutsos and Ibrahim Kamel. Beyond
Uniformity and Independence: Analysis of R
trees Using the Concept of Fractal Dimension.
In Proc. 13th ACM SIGACTSIGMODSIGART
Symposium on Principles of Database Systems,
pages 4±13, Minneapolis, May 1994.
[Gro94] The POSTGRES Group. POSTGRES Reference
Manual, Version 4.2. Technical Report M92/85,
Electronics Research Laboratory, University of
California, Berkeley, April 1994.
[Gut84] Antonin Guttman. RTrees: A Dynamic Index
Structure For Spatial Searching. In Proc. ACM
SIGMOD International Conference on Manage
ment of Data, pages 47±57, Boston, June 1984.
[HNP95] Joseph M. Hellerstein, Jeffrey F. Naughton, and
Avi Pfeffer. Generalized Search Trees for
Database Systems. Technical Report #1274, Uni
versity of Wisconsin at Madison, July 1995.
[HP94] Joseph M. Hellerstein and Avi Pfeffer. The RD
Tree: An Index Structure for Sets. Technical Re
port #1252, University of Wisconsin at Madison,
October 1994.
[HS93] Joseph M. Hellerstein and Michael Stonebraker.
Predicate Migration: Optimizing Queries With
Expensive Predicates. In Proc. ACMSIGMOD
International Conference on Management of
Data, Minneapolis, May 1994, pages 267±276.
[Ill94] Illustra Information Technologies, Inc. Illustra
User's Guide, Illustra Server Release 2.1, June
1994.
[Jag90] H. V. Jagadish. Linear Clustering of Objects With
Multiple Attributes. In Proc. ACMSIGMOD In
ternational Conference on Management of Data,
Atlantic City, May 1990, pages 332±342.
[KB95] Marcel Kornacker and Douglas Banks. High
Concurrency Locking in RTrees. In Proc. 21st
International Conference on Very Large Data
Bases, Zurich, September 1995.
[KG94] Won Kim and Jorge Garza. Requirements For
a Performance Benchmark For ObjectOriented
Systems. In Won Kim, editor, Modern Database
Systems: The Object Model, Interoperability and
Beyond. ACM Press, June 1994.
[KKD89] Won Kim, KyungChang Kim, and Alfred Dale.
Indexing Techniques for ObjectOriented Data
bases. In Won Kim and Fred Lochovsky, edi
tors, ObjectOriented Concepts, Databases, and
Applications, pages 371±394. ACM Press and
AddisonWesley Publishing Co., 1989.
[Knu73] Donald Ervin Knuth. Sorting and Searching,
volume 3 of The Art of Computer Programming.
AddisonWesley Publishing Co., 1973.
[LJF94] KingIp Lin, H. V. Jagadish, and Christos Falout
sos. The TVTree: An Index Structure for High
Dimensional Data. VLDB Journal, 3:517±542,
October 1994.
[LS90] David B. Lomet and Betty Salzberg. The hB
Tree: A Multiattribute Indexing Method. ACM
Transactions on Database Systems, 15(4), De
cember 1990.
[LY81] P. L. Lehman and S. B. Yao. Ef®cient Lock
ing For Concurrent Operations on Btrees. ACM
Transactions on Database Systems, 6(4):650±
670, 1981.
[MCD94] Maur##cio R. Mediano, Marco A. Casanova, and
Marcelo Dreux. VTrees # A Storage Method
For Long Vector Data. In Proc. 20th Inter
national Conference on Very Large Data Bases,
pages 321±330, Santiago, September 1994.
[PSTW93] BerndUwe Pagel, HansWerner Six, Heinrich
Toben, and Peter Widmayer. Towards an Anal
ysis of Range Query Performance in Spatial
Data Structures. In Proc. 12th ACM SIGACT
SIGMODSIGART Symposium on Principles of
Database Systems, pages 214±221, Washington,
D. C., May 1993.
[PTSE95] Dimitris Papadias, Yannis Theodoridis, Timos
Sellis, and Max J. Egenhofer. Topological Rela
tions in the World of Minimum Bounding Rect
angles: A Study with Rtrees. In Proc. ACM
SIGMOD International Conference on Manage
ment of Data, San Jose, May 1995.
[Rob81] J. T. Robinson. The kDBTree: A Search
Structure for Large Multidimensional Dynamic
Indexes. In Proc. ACMSIGMOD International
Conference on Management of Data, pages 10±
18, Ann Arbor, April/May 1981.
[SRF87] Timos Sellis, Nick Roussopoulos, and Christos
Faloutsos. The R+Tree: A Dynamic Index For
MultiDimensional Objects. In Proc. 13th Inter
national Conference on Very Large Data Bases,
pages 507±518, Brighton, September 1987.
[Sto86] Michael Stonebraker. Inclusion of New Types
in Relational Database Systems. In Proceedings
of the IEEE Fourth International Conference on
Data Engineering, pages 262±269, Washington,
D.C., February 1986.
[WE80] C. K. Wong and M. C. Easton. An Ef®
cient Method for Weighted Sampling Without
Replacement. SIAM Journal on Computing,
9(1):111±113, February 1980.
Page 12