Документ взят из кэша поисковой машины. Адрес оригинального документа : http://www.sai.msu.su/~megera/postgres/gist/papers/cs286.ps
Дата изменения: Sun Jan 14 01:34:28 2001
Дата индексирования: Sat Dec 22 07:19:07 2007
Кодировка:

Indexing for String Queries using Generalized
Search Trees
Jeff Foster
jfoster@cs.berkeley.edu
Megan Thomas
mct@cs.berkeley.edu
Abstract
We present five index trees designed for sup
porting string searches. We discuss Counted
Trees, Substring Trees, and Regular Expression
Trees, all of which suffer the same problem --
their attempts at approximating a large set of
data lead to an almost complete lack of informa
tion in the interior nodes. A variant of BTrees,
called a PrefixSuffix Tree, avoids this difficulty
by severely restricting the form of queries. Fi
nally, we show that very high redundancy also
avoids this pitfall, using a Suffix BTree.
1 Introduction
Modern benchmarks such as TPCD [Tra96]
contain many ``like'' queries, a restricted form
of string pattern matching. However, cur
rent databases only index for prefix and suffix
searches. As more users want to ask substring
and possibly even regular expression queries, it
is becoming important to index for these as well.
The most common implementation of general
``like'' queries is a simple scan. Although this
may be acceptable for small relations, we would
like to do better for large databases.
We investigated several indexing strategies for
supporting various kinds of string queries. We
developed our indices in the GiST framework,
a generalized search tree package. GiST is an
abstraction of several common indexing tech
niques, including BTrees and RTrees, and pro
vides support for concurrency and transactional
semantics.
1.1 Description of GiST
A GiST is a balanced tree whose nodes contain
sets of entries, which are pairs (k; p) where k
is a key and p a pointer. For interior nodes,
p points to a child node, and for leaves p con
tains a pointer to the corresponding tuple in the
database. A key k represents a predicate that
is true of any tuples reachable from p. For ex
ample, in BTrees k is an interval meaning ``all
tuples reachable from p are within this interval.''
For RTrees, k is twodimensional rectangle.
GiST also requires the user to define six meth
ods:
ffl Consistent takes an entry (k 1 ; p) and a pred
icate k 2 and determines whether both k 1
and k 2 could simultaneously be true.
ffl Union takes a set of entries P and returns a
new predicate that holds for all tuples reach
able from P .
ffl Penalty takes two entries e 1 and e 2 and re
turns some value indicating how expensive it
would be to insert e 2 into the node pointed
to by e 1 .
ffl PickSplit takes a node and divides its entries
into two groups, trying to cluster like entries
together.
The other two methods are Compress and De
compress, which we will not discuss. They are
used mainly for packing and unpacking keys
for storage on disk. Although we will skip
them, they are important in practice, because
1

the fanout of a GiST is determined by the disk's
page size. Thus, compact disk representations
increase fanout. In our experiments, we assume
a page size of 8KB.
For a full description of GiST, see [HNP95].
1.2 Related Work
There is an extensive literature on string search
ing. A good general introduction is [Aho90].
One of the earliest indexing techniques for fast
substring searches is the suffix tree, which is
a compact representation that encodes every
possible suffix of the dataset [McC76]. Krish
nan, Vitter, and Iyer have shown how to use a
pruned suffix tree to estimate selectivity of sub
string searches, though not how to do the actual
searches [KVI96].
By combining suffix trees with other data
structures, BaezaYates and Gonnet allow arbi
trary regular expression searching in a relatively
compact space [BYG89]. Unfortunately, though
that work is designed for a very large dataset,
the Oxford English Dictionary, it does not sup
port transactional semantics. The OED does not
often change.
All but one of the trees we describe are based
on the following idea: Given a set of strings,
we want some compact, approximate represen
tation of that set. Thus, we want to find pat
terns in the strings. Learning patterns from sets
of strings is an essential task in the world of ge
netic databases. See [BJEG95] for a good sur
vey of this area. In this situation, it is often the
case that approximate matching is good enough;
[WM92] is a description of the agrep tool, which
allows approximate searches for regular expres
sions. See [KM95] for a much more theoretical
treatment of approximate matching.
Of course, searching for substrings and reg
ular expressions is not all there is to textual
databases. [Loe94] provides a fairly general
overview of some techniques that treat data as
structured text, rather than as byte strings.
[GNU94] describes a framework adding string
searching and manipulation primitives as part of
database queries. That work may be too theoret
ical to be realistic, but it must be good because
the paper uses the word ``stringological.''
2 The Trees
We have designed five kinds of GiST trees to sup
port string queries. Counted Trees treat strings
as bags of letters. Substring Trees use longest
common substrings as keys. Regular Expres
sion Trees use approximate regular expressions
as keys. Prefixsuffix Trees work like BTrees but
allow simultaneous prefix and suffix queries. Fi
nally, Suffix BTrees are BTrees with replicated
keys to provide fast substring queries.
Throughout the rest of our discussion, we will
use the following terminology. Let \Sigma be a finite
alphabet from which the strings are drawn, and
let ffl be the empty string. Define n = j\Sigmaj, and
number the elements of \Sigma with 0 through n \Gamma 1.
If x is a string, x i is the i th character of x, 0 џ
i ! n. We use this same subscript notation for
an element of an nvector.
For each tree, we first describe the key data
structure, then define the Union, Penalty, Pick
Split, and Consistent methods. We define these
in terms of keys rather than entries in order to
make things slightly simpler. Whenever we use
the word key where GiST requires an entry, we
really mean an entry containing that key and
some pointer.
3 Counted Trees
The simplest tree we considered is a Counted
Tree. This tree treats strings as bags of letters,
and at interior nodes simply stores the maximum
number of times each particular letter appears in
any single word below this node. Each key c is
a mapping \Sigma ! N , where c(a) = the maximum
frequency of a in any subkey. We represent this
information as an vector c = (c 0 ; : : : ; c n\Gamma1 ) of
integers. To keep the vectors short, in the im
plementation we ignore case and punctuation.
The methods are as follows:
2

ffl Union(c; d) is the pointwise maximum of the
two vectors: (c [ d) i = max(c i ; d i ).
ffl Penalty(k 1 ; k 2 ) is just jj(k 1 [k 2 ) \Gamma k 2 jj 1 , i.e.
how much the sum of all the counts stored
in e will increase by adding k 2 .
ffl PickSplit uses Guttman's quadratic split al
gorithm, with the penalty metric as stated
above [Gut84]. Let P = fk 1 ; k 2 ; : : : ; k n g be
a set of keys, and let P 1 and P 2 be the
output of PickSplit. We compute for each
pair (k i ; k j ) the distance between the two
keys, using the L 1 norm (the Manhattan
distance). The keys that are farthest apart
are chosen as seeds, and one is added to P 1
and the other to P 2 . We then distribute the
remaining keys to P 1 and P 2 according to
Penalty.
This tree supports the submultiset query:
Consistent takes a key k and a string x, and re
turns true if the frequencies of the letters in x
are at most the frequencies in k. By storing the
actual strings at the leaves, we can restrict this
further to substring queries.
3.1 Performance
Clearly this tree is only a slight generalization
of a tree that stores sets at the leaves and uses
set union as the Union method. Given the the
oretical inefficiency of such a workload, as docu
mented in [HKP97], we knew from the beginning
that this probably would not work. One poten
tial saving grace might be that n, the size of \Sigma, is
fixed, and thus the argument in [HKP97] about
subset queries does not directly apply.
Unfortunately, it still fails to be efficient in
much the same cases. The only time an in
terior node contains any useful information is
when a string with an infrequent letter appears
below that node. When tested on a file con
taining words from an English dictionary, nearly
all of the keys contain the common letters and
few contain the uncommon ones. Searching for
``xylo'' works very well; searching for ``ate'' re
quires scanning the entire index structure.
Counted Tree Clustering
0 5 10 15
5
10
15
0
5
10
Leaf number
Leaf number
Distance
Figure 1: Clustering in a counted tree.
We had hoped that, while the Counted Tree
itself would probably not be useful for search
ing, it would produce interesting clustering at
the leaf nodes. Data clustered by this tree could
be bulkloaded into a different type of tree that
might be good at searching but poor at cluster
ing. Unfortunately, the Counted Tree had little
success at clustering interesting datasets. Figure
1 shows a 3dimensional plot of the distances,
according to the Penalty metric, between keys
stored on the same page. This dataset was a file
of English words. If the clustering had worked
well, then distances on this page should be rel
atively small. As the graph indicates, they vary
between 0 and 10. Since the maximum word
length on this page is only 11 and the average
length of the whole dataset is 7, having keys on
the same page be 10 apart is not very good.
We believe this happens because, after accu
mulating tens of strings in a node, the key for
that node contains a vector that is near (in the
L 1 sense) almost any word. All of the common
letters are present, as are scattered uncommon
ones. Therefore virtually any word can be added
to the node without changing the key very much.
Although this word may be close to the key, it
is not necessarily close to specific words in the
node.
3

4 Substring Trees
Among the most common string queries are sub
string searches. Therefore, it seems natural to
try to index for these by storing substrings as
keys. In a Substring Tree, each key is the longest
common substring (LCS) of the subkeys.
One possible definition of LCS(x; y) would be
a string z such that x = ffzfi and y = flz ffi , where
Greek letters stand for arbitrary strings. How
ever, that definition seems overly restrictive. In
stead, we use the longest common subsequence
definition from [CLR90]. Rather than restating
this formally, we shall just say that LCS(x; y) is
z where all the letters of z appear in sequence
in both x and y, except that an arbitrary num
ber of other letters are allowed in between. For
example, LCS(joe; jeff) = je.
This improved definition is still impractical be
cause if the two strings are not exactly identical,
then we lose all information -- anything could
appear in the gaps. Accordingly, we record the
maximum distance between any two sequential
letters of the LCS, plus distances from the be
ginnings and endings of words. Thus, in our sys
tem, LCS(joe; jeff) = 0; j; 1; e; 2, meaning that
j always appears first, there is at most one letter
between between j and e, and at most two letters
to the right of e. Our algorithm for computing
this information is a straightforward modifica
tion of the one given in [CLR90].
We write an LCS as a pair (x; d), where x
contains the letters of the longest common sub
sequence and d contains the distances between
the letters, e.g. LCS(joe; jeff) = (je; (0; 1; 2)).
The GiST methods for this tree are
ffl Union(k 1 ; k 2 ) = LCS(k 1 ; k 2 ), as just de
scribed.
ffl Penalty(k 1 ; k 2 ). Let k = (x; d) =
LCS(k 1 ; k 2 ). Penalty returns jxj \Theta 2 l , where
l = P jdj
i=0 2 d i . The idea here is that longer
strings are better, but that a larger gap be
tween letters is exponentially bad, since any
letter could fill in the gap, or there might be
no letter in the gap.
ffl PickSplit again uses the quadratic split al
gorithm as we described before. This time
the distance between two keys k 1 and k 2 is
measured as Penalty(k 1 [k 2 ) \Gamma Penalty(k 1 ),
the increase in the penalty value when k 2 is
added to k 1 .
The query that this tree supports is substring
search. Consistent takes a key k and a string x,
and returns true if x could appear as a substring
of any of the strings k represents. At the leaves,
this is the usual substring search, but we need to
take the gaps into account in the interior nodes.
Fortunately, a minor, and slightly more efficient,
modification of the LCS algorithm can handle
this.
4.1 Performance
Unfortunately, although the Consistent method
is efficient, the keys it operates on contain little
information. In a moderately varied dataset, the
probability of having a long LCS for an arbitrary
pair of keys is low. Because each key is the LCS
of many other keys, there is little chance that
the LCS will, in fact, contain any string.
Figure 2 shows a series of histograms captur
ing the trend towards less and less information
in LCSs as the number of subkeys increases.
Each diagram shows how many LCSs of n dis
tinct strings have a given length. When n is 1,
the LCS is simply the string length. The data is
102 random words from an English dictionary.
As can clearly be seen, the fraction of LCSs
containing a nonempty string is negligible once
more than a few words are unioned. When tak
ing the LCS of an 8KB page, which may contain
hundreds of words, there is little clustering any
penalty metric can do. In general interior nodes
contain no information.
This is not to say that there is no cluster
ing in this tree, but simply that a vast major
ity of the nodes have none. We loaded a 38470
word spelling dictionary into the suffix tree and
looked at the root node. Out of 20 entries,
only four contained entries longer than three let
ters: superc, ion, lies, and consus. For each,
4

0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10 12
Fraction
of
LCSs
Length
LCS Lengths, n=1
0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10 12
Fraction
of
LCSs
Length
LCS Lengths, n=2
0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10 12
Fraction
of
LCSs
Length
LCS Lengths, n=3
0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10 12
Fraction
of
LCSs
Length
LCS Lengths, n=4
0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10 12
Fraction
of
LCSs
Length
LCS Lengths, n=5
Figure 2: Histograms illustrating how the length of the LCS of n strings decreases as n increases.
The data is 102 random words from an English dictionary.
5

the largest gap between letters was, respectively,
nine, five, seven, and six. Thus, for substring
searches of five or fewer letters, there would be
no pruning of the search at the root note. Con
sidering that the average word length in this
file is eight letters, having substring searches of
fewer than five letters descend every path from
the root is unacceptable. The results were even
worse when we tried a data file from the TPC
D benchmark [Tra96] with longer strings -- the
average length in this dataset was 19 letters. In
this case, loading the data yielded no substrings
longer than one letter at the root.
5 Regular Expression Trees
The most general string searches commonly
available use regular expressions as patterns.
The properties of regular expressions and their
correspondence with deterministic finite au
tomata (DFAs) are well known [HU79]. Within
the GiST framework, using DFAs as keys struck
us as being a natural technique for indexing both
general regular expression and substring queries.
A Regular Expression Tree is a tree whose keys
contain DFAs. The DFA at a key accepts a su
perset of all the languages accepted by its sub
keys. To check whether a key matches a regular
expression query, we intersect the DFA for the
query with the DFA at the key and test whether
the result is nonempty.
The union of two keys M 1 and M 2 is a new
DFA M such that L(M) ' L(M 1 ) [ L(M 2 ),
where L(M) is the language accepted by DFA
M . The easiest way to generate such a DFA
is to use the standard union algorithm [HU79]
and then minimize the result. If we apply only
this algorithm the ' becomes an equality. Un
fortunately, recall that the fanout of GiST trees
is determined by the number of keys that can
fit on a disk page. Even compactly represented
DFAs tend to be bulky. Thus, if we only use the
standard algorithms for minimization, it may not
be possible to fit a DFA within a page, which
defeats the purpose of having an index. Addi
tionally, the union and minimization algorithms
are, respectively, O(n 2 ) and O(n 3 ) in the num
ber of states. While CPU time is not normally
important, if intermediate DFAs are allowed to
get large this time can become quite significant.
In order to produce small DFAs, we resort
to approximation. Suppose that DFA M takes
up too much space. Then create a DFA with
one fewer state by picking a pair of states of M
and collapsing them. Repeat until the DFA is
small enough. If this still is insufficient, we can
also make the following approximation: Suppose
some state s has two transitions to state t, one
when the input symbol is a and one when the
input symbol is c. Rather than store two sepa
rate transitions, we can store one transition with
label [a \Gamma c], meaning take the transition on sym
bols a, b, and c.
In order to pick the states to collapse, we use
the following simpleminded heuristic: We pick
a state q at random, and then we pick a state r
such that q has a transition to r. If there is no
such state, we collapse q with a random state.
This is clearly not optimal. Even so, it shows
absolutely no signs of working, even on a rela
tively small set of words.
Basically, this tree suffers from the same diffi
culties as Counted Tree and the Substring Tree.
In order to have a small, accurate regular ex
pression accepting a large set of strings, the
strings have to have many commonalities. For
a database like a set of words from the English
language, that does not happen. Figure 3 shows
three snapshots of the union method taken af
ter four, 64, and 128 strings were unioned and
then minimized. These strings were the first 128
from a list of English words in alphabetical order.
The topmost DFA has perfect information, but
by the time only 64 strings have been thrown in,
the union method started making large approx
imations. By the time we get to 128 strings, the
DFA contains no information.
6 PrefixSuffix Trees
A BTree [Com79] can efficiently support prefix
queries on strings. The keys of the BTree are
ranges of strings, and a prefix x matches a range
[y, z] if either x is lexically ordered between y
6

0 1
a 2
b
3
a
4
b
5
e
6
b
7
e
8
d
9
t
a
y
10
t
11
i 12
n
g
0 1
a
2
[ac]
d
[ar]
0 [az]
Figure 3: A progression of three finite automata created by the union method. The topmost DFA
is the union of four strings, the middle the union of 64 strings, and the bottom the union of 128
strings.
and z or if x is a prefix of y. By reversing strings
before inserting them, BTrees can handle suffix
queries in the same fashion.
We propose a new kind of tree, a PrefixSuffix
Tree, which can handle simultaneous prefix and
suffix queries. Rather than storing plain strings
or reversed strings, we store folded strings. Let
x be a string of length m. Define fold(x) as a
string of length m such that
fold(x) 2i = x i
fold(x) 2i+1 = xm\Gammai\Gamma1 if 2i + 1 ! m
for 0 џ i ! bn=2c. For example, fold(megan) =
mneag.
Aside from this modification, the tree works
just like a BTree:
ffl Union takes two intervals and returns the
smallest interval containing both.
ffl Penalty(k 1 ; k 2 ) returns how much the inter
val k 1 will expand if k 2 is added.
ffl PickSplit(P ) divides P into P 1 and P 2 ,
where the first half of P goes in P 1 and the
second half goes in P 2 .
A more thorough description of an implementa
tion of BTrees with GiST appears in [HNP95].
The Consistent method for this tree takes a
key k and a query of the form x(\Sigma \Lambda )y. Strings x
and y are folded together, and Consistent returns
true if the folded string is a prefix of k. The only
tricky aspect is that x and y may be different
lengths, in which case the folded string must be
truncated. For example, the query t(\Sigma \Lambda )as is
folded into ts. There is no point to representing
it as ts\Sigmaa, since in a prefix search this matches
anything ts matches. Only when comparing a
query to a leaf node do we necessarily do a full
comparison on x and y.
6.1 Performance
In contrast with the last three ideas we men
tioned, this tree does work well for many queries.
In general, as long as x and y are long enough,
the tree will be able to eliminate interior nodes
from consideration. Specifically, the length of
the fold of a query x(\Sigma\Lambda)y is 2 \Theta min(jxj; jyj) +
(jxj mod 2). The last term takes into account
that the folding begins with the prefix. If this
7

quantity is too small, e.g. if x = ffl, then, even if
y is long, the folded query string will match any
interior node.
However, if all the queries were to omit the
prefix, then presumably a BTree with reverse
strings would be the right index to use. If we re
strict the queries to have jxj = jyj, then the new
tree is optimal. The moral here is that by sacri
ficing the generality of full substring search, we
were able to achieve good performance. Further,
while this tree cannot handle full regular expres
sion queries, it is capable of answering queries a
normal BTree cannot.
7 Suffix BTrees
Our last tree is what we term a Suffix BTree.
As we stated previously, BTrees can efficiently
support prefix queries. In a Suffix BTree, we
take advantage of this efficiency by turning all
substring searches into prefix searches.
Rather than inserting each string once into a
BTree, we insert it many times. For every string
x in the original data file, we insert x and all
suffixes of x, keeping the same pointer for each
insertion. For example, if the original key was
foo with pointer 42, then we insert the entries
(foo; 42); (oo; 42), and (o; 42). Aside from that,
we make no changes to the regular BTree im
plementation.
After doing this, substring searches are easy.
A substring s appears in x iff s is a prefix of
some suffix of x. Thus, to search for a substring
s, we do a prefix search. Since prefix searches in
BTrees are extremely efficient, now so are sub
string searches.
The reason this tree works where our first
three trees failed is redundancy. If the origi
nal dataset contained m strings with an average
length of l, then the Suffix BTree will contain lm
leaflevel entries. The height of the tree will also
increase by a corresponding amount. In general
this is unacceptable for large values of l, though
for a small, fixed l it works wonderfully.
8 Future Work
In [HKP97], it has been shown that set inclusion
workloads have potentially the worst possible ac
cess overhead, while BTrees have the best. We
believe that the substring queries we have looked
at are an important intermediate point on this
spectrum, and we plan to look for some theoret
ical, rather than statistical, justification for why
low redundancy does not work unless the form of
queries is severely restricted. Additionally, there
should be some middle ground. It should be pos
sible to restrict the form of queries a little and
add a little redundancy to get, from an engineer
ing perspective, the optimal balance between the
expressiveness and redundancy.
There is a large body of string data for which
j\Sigmaj is very small: genetic data. Although in the
ory a small finite number is no different from a
large finite number, in practice there is quite a
difference. Perhaps by restricting j\Sigmaj to be small
we could build better indices.
Finally, it may just be that GiST is not the
right framework for building string indexing. As
we stated before, there is a significant body of
related work, though none of the solutions we
have seen support transactional semantics. An
other approach would be to start from one of
those solutions and try to add concurrency and
locking.
9 Conclusions
We have described five kinds of index trees that
were designed for string searches. The first three
trees we discussed -- Counted Trees, Substring
Trees, and Regular Expression Trees -- all fell
into the same trap. Counted Trees only work
for queries containing infrequent letters and fail
completely at clustering. Substring Trees and
Regular Expression Trees both contain virtually
no information in the interior nodes.
The only trees that work well are the last two
we described: PrefixSuffix Trees and Suffix B
Trees. PrefixSuffix Trees succeed by severely
restricting the form of queries. Suffix BTrees
avoid the trap through very high redundancy.
8

These last trees sidestep a fundamental problem
with the others: small approximations of large
sets of strings contain little information.
Acknowledgments
Many thanks to Joe Hellerstein and Steve Fink
for listening to our crazy ideas and making good
suggestions.
References
[Aho90] A. V. Aho. Algorithms for Find
ing Patterns in Strings. In J. van
Leeuwen, editor, Handbook of Theo
retical Computer Science, volume A,
pages 257--300. Elsevier Science Pub
lishers, 1990.
[BJEG95] A. BrЇazma, I. Jonassen, I. Eidham
mer, and D. Gilbert. Approaches to
the automatic discovery of patterns in
biosequences. Technical Report Re
port No 113, Department of Infor
matics, University of Bergen, Bergen,
Norway, December 1995.
[BYG89] R. BaezaYates and G. Gonnet. Fast
Text Searching for Regular Expres
sions or Automaton Searching on
Tries. In G. Ausiello, M. Dezani
Ciancaglini, and S. Ronchi Della
Rocca, editors, ICALP '89, volume
372 of Lecture Notes in Computer
Science, pages 46--62. SpringerVerag,
Stresa, Italy, July 1989.
[CLR90] T. Cormen, C. Leiserson, and
R. Rivest. Introduction to Algo
rithms. McGraw Hill, 1990.
[Com79] D. Comer. The Ubiquitous BTree.
ACM Computing Surveys, June 1979.
[GNU94] G. Grahne, M. NykЁanen, and
E. Ukkonen. Reasoning about
Strings in Databases. In Proceedings
of the 13th ACM Symposium on
Principles of Database Systems,
pages 303--314, Minneapolis, MN,
1994.
[Gut84] A. Gutman. Rtrees: A Dynamic
Index Structure for Spatial Search
ing. In Proc. 1984 ACMSIGMOD
Conference on Management of Data,
Boston, MA, 1984.
[HKP97] J. Hellerstein, E. Koutsoupias, and
C. Papadimitriou. On the Analysis
of Indexing Schemes. In PODS '97,
1997. To appear.
[HNP95] J. Hellerstein, J. Naughton, and
A. Pfeffer. Generalized Search Trees
for Database Systems. In Proceedings
fo the 21st VLDB Conference, Zurich,
Switzerland, 1995.
[HU79] J. Hopcroft and J. Ullman. Introduc
tion to Automata Theory, Languages,
and Computation. Addison Wesley,
1979.
[KM95] J. Knight and E. Myers. Approximate
Regular Expression Pattern Matching
with Concave Gap Penalties. Algo
rithmica, 14(1):85--121, July 1995.
[KVI96] P. Krishnan, J. Vitter, and B. Iyer.
Estimating Alphanumeric Selectivity
in the Presence of Wildcards. In Pro
ceedings of the 1996 ACM SIGMOD
Internaional Conference on Manage
ment of Data (SIGMOD '96), pages
282--293, Montreal, Canada, May
1996.
[Loe94] A. Loeffen. Text Databases: A Survey
of Text Models and Systems. ACM
SIGMOD Record, 23(1), March 1994.
[McC76] E. M. McCreight. A Space
Economical Suffix Tree Construction.
J. of the ACM 23, pages 262--272,
1976.
9

[Tra96] Transaction Processing Performance
Council, San Jose, CA. TPC Bench
mark D Standard Specification, 1996.
Revision 1.2.2.
[WM92] S. Wu and U. Manber. Fast Text
Searching Allowing Errors. Commu
nications of the ACM, 35(10):83--91,
October 1992.
10