Äîêóìåíò âçÿò èç êýøà ïîèñêîâîé ìàøèíû. Àäðåñ îðèãèíàëüíîãî äîêóìåíòà : http://www.sai.msu.su/~megera/postgres/gist/papers/cs286.ps
Äàòà èçìåíåíèÿ: Sun Jan 14 01:34:28 2001
Äàòà èíäåêñèðîâàíèÿ: Sat Dec 22 07:19:07 2007
Êîäèðîâêà:
Indexing for String Queries using Generalized
Search Trees
Jeff Foster
jfoster@cs.berkeley.edu
Megan Thomas
mct@cs.berkeley.edu
Abstract
We present five index trees designed for sup­
porting string searches. We discuss Counted
Trees, Substring Trees, and Regular Expression
Trees, all of which suffer the same problem --
their attempts at approximating a large set of
data lead to an almost complete lack of informa­
tion in the interior nodes. A variant of B­Trees,
called a Prefix­Suffix Tree, avoids this difficulty
by severely restricting the form of queries. Fi­
nally, we show that very high redundancy also
avoids this pitfall, using a Suffix B­Tree.
1 Introduction
Modern benchmarks such as TPC­D [Tra96]
contain many ``like'' queries, a restricted form
of string pattern matching. However, cur­
rent databases only index for prefix and suffix
searches. As more users want to ask substring
and possibly even regular expression queries, it
is becoming important to index for these as well.
The most common implementation of general
``like'' queries is a simple scan. Although this
may be acceptable for small relations, we would
like to do better for large databases.
We investigated several indexing strategies for
supporting various kinds of string queries. We
developed our indices in the GiST framework,
a generalized search tree package. GiST is an
abstraction of several common indexing tech­
niques, including B­Trees and R­Trees, and pro­
vides support for concurrency and transactional
semantics.
1.1 Description of GiST
A GiST is a balanced tree whose nodes contain
sets of entries, which are pairs (k; p) where k
is a key and p a pointer. For interior nodes,
p points to a child node, and for leaves p con­
tains a pointer to the corresponding tuple in the
database. A key k represents a predicate that
is true of any tuples reachable from p. For ex­
ample, in B­Trees k is an interval meaning ``all
tuples reachable from p are within this interval.''
For R­Trees, k is two­dimensional rectangle.
GiST also requires the user to define six meth­
ods:
ffl Consistent takes an entry (k 1 ; p) and a pred­
icate k 2 and determines whether both k 1
and k 2 could simultaneously be true.
ffl Union takes a set of entries P and returns a
new predicate that holds for all tuples reach­
able from P .
ffl Penalty takes two entries e 1 and e 2 and re­
turns some value indicating how expensive it
would be to insert e 2 into the node pointed
to by e 1 .
ffl PickSplit takes a node and divides its entries
into two groups, trying to cluster like entries
together.
The other two methods are Compress and De­
compress, which we will not discuss. They are
used mainly for packing and unpacking keys
for storage on disk. Although we will skip
them, they are important in practice, because
1

the fanout of a GiST is determined by the disk's
page size. Thus, compact disk representations
increase fan­out. In our experiments, we assume
a page size of 8KB.
For a full description of GiST, see [HNP95].
1.2 Related Work
There is an extensive literature on string search­
ing. A good general introduction is [Aho90].
One of the earliest indexing techniques for fast
substring searches is the suffix tree, which is
a compact representation that encodes every
possible suffix of the dataset [McC76]. Krish­
nan, Vitter, and Iyer have shown how to use a
pruned suffix tree to estimate selectivity of sub­
string searches, though not how to do the actual
searches [KVI96].
By combining suffix trees with other data
structures, Baeza­Yates and Gonnet allow arbi­
trary regular expression searching in a relatively
compact space [BYG89]. Unfortunately, though
that work is designed for a very large dataset,
the Oxford English Dictionary, it does not sup­
port transactional semantics. The OED does not
often change.
All but one of the trees we describe are based
on the following idea: Given a set of strings,
we want some compact, approximate represen­
tation of that set. Thus, we want to find pat­
terns in the strings. Learning patterns from sets
of strings is an essential task in the world of ge­
netic databases. See [BJEG95] for a good sur­
vey of this area. In this situation, it is often the
case that approximate matching is good enough;
[WM92] is a description of the agrep tool, which
allows approximate searches for regular expres­
sions. See [KM95] for a much more theoretical
treatment of approximate matching.
Of course, searching for substrings and reg­
ular expressions is not all there is to textual
databases. [Loe94] provides a fairly general
overview of some techniques that treat data as
structured text, rather than as byte strings.
[GNU94] describes a framework adding string
searching and manipulation primitives as part of
database queries. That work may be too theoret­
ical to be realistic, but it must be good because
the paper uses the word ``stringological.''
2 The Trees
We have designed five kinds of GiST trees to sup­
port string queries. Counted Trees treat strings
as bags of letters. Substring Trees use longest­
common substrings as keys. Regular Expres­
sion Trees use approximate regular expressions
as keys. Prefix­suffix Trees work like B­Trees but
allow simultaneous prefix and suffix queries. Fi­
nally, Suffix B­Trees are B­Trees with replicated
keys to provide fast substring queries.
Throughout the rest of our discussion, we will
use the following terminology. Let \Sigma be a finite
alphabet from which the strings are drawn, and
let ffl be the empty string. Define n = j\Sigmaj, and
number the elements of \Sigma with 0 through n \Gamma 1.
If x is a string, x i is the i th character of x, 0 Ÿ
i ! n. We use this same subscript notation for
an element of an n­vector.
For each tree, we first describe the key data
structure, then define the Union, Penalty, Pick­
Split, and Consistent methods. We define these
in terms of keys rather than entries in order to
make things slightly simpler. Whenever we use
the word key where GiST requires an entry, we
really mean an entry containing that key and
some pointer.
3 Counted Trees
The simplest tree we considered is a Counted
Tree. This tree treats strings as bags of letters,
and at interior nodes simply stores the maximum
number of times each particular letter appears in
any single word below this node. Each key c is
a mapping \Sigma ! N , where c(a) = the maximum
frequency of a in any sub­key. We represent this
information as an vector c = (c 0 ; : : : ; c n\Gamma1 ) of
integers. To keep the vectors short, in the im­
plementation we ignore case and punctuation.
The methods are as follows:
2

ffl Union(c; d) is the pointwise maximum of the
two vectors: (c [ d) i = max(c i ; d i ).
ffl Penalty(k 1 ; k 2 ) is just jj(k 1 [k 2 ) \Gamma k 2 jj 1 , i.e.
how much the sum of all the counts stored
in e will increase by adding k 2 .
ffl PickSplit uses Guttman's quadratic split al­
gorithm, with the penalty metric as stated
above [Gut84]. Let P = fk 1 ; k 2 ; : : : ; k n g be
a set of keys, and let P 1 and P 2 be the
output of PickSplit. We compute for each
pair (k i ; k j ) the distance between the two
keys, using the L 1 norm (the Manhattan
distance). The keys that are farthest apart
are chosen as seeds, and one is added to P 1
and the other to P 2 . We then distribute the
remaining keys to P 1 and P 2 according to
Penalty.
This tree supports the sub­multiset query:
Consistent takes a key k and a string x, and re­
turns true if the frequencies of the letters in x
are at most the frequencies in k. By storing the
actual strings at the leaves, we can restrict this
further to substring queries.
3.1 Performance
Clearly this tree is only a slight generalization
of a tree that stores sets at the leaves and uses
set union as the Union method. Given the the­
oretical inefficiency of such a workload, as docu­
mented in [HKP97], we knew from the beginning
that this probably would not work. One poten­
tial saving grace might be that n, the size of \Sigma, is
fixed, and thus the argument in [HKP97] about
subset queries does not directly apply.
Unfortunately, it still fails to be efficient in
much the same cases. The only time an in­
terior node contains any useful information is
when a string with an infrequent letter appears
below that node. When tested on a file con­
taining words from an English dictionary, nearly
all of the keys contain the common letters and
few contain the uncommon ones. Searching for
``xylo'' works very well; searching for ``ate'' re­
quires scanning the entire index structure.
Counted Tree Clustering
0 5 10 15
5
10
15
0
5
10
Leaf number
Leaf number
Distance
Figure 1: Clustering in a counted tree.
We had hoped that, while the Counted Tree
itself would probably not be useful for search­
ing, it would produce interesting clustering at
the leaf nodes. Data clustered by this tree could
be bulk­loaded into a different type of tree that
might be good at searching but poor at cluster­
ing. Unfortunately, the Counted Tree had little
success at clustering interesting datasets. Figure
1 shows a 3­dimensional plot of the distances,
according to the Penalty metric, between keys
stored on the same page. This dataset was a file
of English words. If the clustering had worked
well, then distances on this page should be rel­
atively small. As the graph indicates, they vary
between 0 and 10. Since the maximum word
length on this page is only 11 and the average
length of the whole dataset is 7, having keys on
the same page be 10 apart is not very good.
We believe this happens because, after accu­
mulating tens of strings in a node, the key for
that node contains a vector that is near (in the
L 1 sense) almost any word. All of the common
letters are present, as are scattered uncommon
ones. Therefore virtually any word can be added
to the node without changing the key very much.
Although this word may be close to the key, it
is not necessarily close to specific words in the
node.
3

4 Substring Trees
Among the most common string queries are sub­
string searches. Therefore, it seems natural to
try to index for these by storing substrings as
keys. In a Substring Tree, each key is the longest
common substring (LCS) of the sub­keys.
One possible definition of LCS(x; y) would be
a string z such that x = ffzfi and y = flz ffi , where
Greek letters stand for arbitrary strings. How­
ever, that definition seems overly restrictive. In­
stead, we use the longest common subsequence
definition from [CLR90]. Rather than restating
this formally, we shall just say that LCS(x; y) is
z where all the letters of z appear in sequence
in both x and y, except that an arbitrary num­
ber of other letters are allowed in between. For
example, LCS(joe; jeff) = je.
This improved definition is still impractical be­
cause if the two strings are not exactly identical,
then we lose all information -- anything could
appear in the gaps. Accordingly, we record the
maximum distance between any two sequential
letters of the LCS, plus distances from the be­
ginnings and endings of words. Thus, in our sys­
tem, LCS(joe; jeff) = 0; j; 1; e; 2, meaning that
j always appears first, there is at most one letter
between between j and e, and at most two letters
to the right of e. Our algorithm for computing
this information is a straightforward modifica­
tion of the one given in [CLR90].
We write an LCS as a pair (x; d), where x
contains the letters of the longest common sub­
sequence and d contains the distances between
the letters, e.g. LCS(joe; jeff) = (je; (0; 1; 2)).
The GiST methods for this tree are
ffl Union(k 1 ; k 2 ) = LCS(k 1 ; k 2 ), as just de­
scribed.
ffl Penalty(k 1 ; k 2 ). Let k = (x; d) =
LCS(k 1 ; k 2 ). Penalty returns jxj \Theta 2 l , where
l = P jdj
i=0 2 d i . The idea here is that longer
strings are better, but that a larger gap be­
tween letters is exponentially bad, since any
letter could fill in the gap, or there might be
no letter in the gap.
ffl PickSplit again uses the quadratic split al­
gorithm as we described before. This time
the distance between two keys k 1 and k 2 is
measured as Penalty(k 1 [k 2 ) \Gamma Penalty(k 1 ),
the increase in the penalty value when k 2 is
added to k 1 .
The query that this tree supports is substring
search. Consistent takes a key k and a string x,
and returns true if x could appear as a substring
of any of the strings k represents. At the leaves,
this is the usual substring search, but we need to
take the gaps into account in the interior nodes.
Fortunately, a minor, and slightly more efficient,
modification of the LCS algorithm can handle
this.
4.1 Performance
Unfortunately, although the Consistent method
is efficient, the keys it operates on contain little
information. In a moderately varied dataset, the
probability of having a long LCS for an arbitrary
pair of keys is low. Because each key is the LCS
of many other keys, there is little chance that
the LCS will, in fact, contain any string.
Figure 2 shows a series of histograms captur­
ing the trend towards less and less information
in LCSs as the number of sub­keys increases.
Each diagram shows how many LCSs of n dis­
tinct strings have a given length. When n is 1,
the LCS is simply the string length. The data is
102 random words from an English dictionary.
As can clearly be seen, the fraction of LCSs
containing a non­empty string is negligible once
more than a few words are unioned. When tak­
ing the LCS of an 8KB page, which may contain
hundreds of words, there is little clustering any
penalty metric can do. In general interior nodes
contain no information.
This is not to say that there is no cluster­
ing in this tree, but simply that a vast major­
ity of the nodes have none. We loaded a 38470­
word spelling dictionary into the suffix tree and
looked at the root node. Out of 20 entries,
only four contained entries longer than three let­
ters: superc, ion, lies, and consus. For each,
4

0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10 12
Fraction
of
LCSs
Length
LCS Lengths, n=1
0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10 12
Fraction
of
LCSs
Length
LCS Lengths, n=2
0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10 12
Fraction
of
LCSs
Length
LCS Lengths, n=3
0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10 12
Fraction
of
LCSs
Length
LCS Lengths, n=4
0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10 12
Fraction
of
LCSs
Length
LCS Lengths, n=5
Figure 2: Histograms illustrating how the length of the LCS of n strings decreases as n increases.
The data is 102 random words from an English dictionary.
5

the largest gap between letters was, respectively,
nine, five, seven, and six. Thus, for substring
searches of five or fewer letters, there would be
no pruning of the search at the root note. Con­
sidering that the average word length in this
file is eight letters, having substring searches of
fewer than five letters descend every path from
the root is unacceptable. The results were even
worse when we tried a data file from the TPC­
D benchmark [Tra96] with longer strings -- the
average length in this dataset was 19 letters. In
this case, loading the data yielded no substrings
longer than one letter at the root.
5 Regular Expression Trees
The most general string searches commonly
available use regular expressions as patterns.
The properties of regular expressions and their
correspondence with deterministic finite au­
tomata (DFAs) are well known [HU79]. Within
the GiST framework, using DFAs as keys struck
us as being a natural technique for indexing both
general regular expression and substring queries.
A Regular Expression Tree is a tree whose keys
contain DFAs. The DFA at a key accepts a su­
perset of all the languages accepted by its sub­
keys. To check whether a key matches a regular
expression query, we intersect the DFA for the
query with the DFA at the key and test whether
the result is non­empty.
The union of two keys M 1 and M 2 is a new
DFA M such that L(M) ' L(M 1 ) [ L(M 2 ),
where L(M) is the language accepted by DFA
M . The easiest way to generate such a DFA
is to use the standard union algorithm [HU79]
and then minimize the result. If we apply only
this algorithm the ' becomes an equality. Un­
fortunately, recall that the fanout of GiST trees
is determined by the number of keys that can
fit on a disk page. Even compactly represented
DFAs tend to be bulky. Thus, if we only use the
standard algorithms for minimization, it may not
be possible to fit a DFA within a page, which
defeats the purpose of having an index. Addi­
tionally, the union and minimization algorithms
are, respectively, O(n 2 ) and O(n 3 ) in the num­
ber of states. While CPU time is not normally
important, if intermediate DFAs are allowed to
get large this time can become quite significant.
In order to produce small DFAs, we resort
to approximation. Suppose that DFA M takes
up too much space. Then create a DFA with
one fewer state by picking a pair of states of M
and collapsing them. Repeat until the DFA is
small enough. If this still is insufficient, we can
also make the following approximation: Suppose
some state s has two transitions to state t, one
when the input symbol is a and one when the
input symbol is c. Rather than store two sepa­
rate transitions, we can store one transition with
label [a \Gamma c], meaning take the transition on sym­
bols a, b, and c.
In order to pick the states to collapse, we use
the following simple­minded heuristic: We pick
a state q at random, and then we pick a state r
such that q has a transition to r. If there is no
such state, we collapse q with a random state.
This is clearly not optimal. Even so, it shows
absolutely no signs of working, even on a rela­
tively small set of words.
Basically, this tree suffers from the same diffi­
culties as Counted Tree and the Substring Tree.
In order to have a small, accurate regular ex­
pression accepting a large set of strings, the
strings have to have many commonalities. For
a database like a set of words from the English
language, that does not happen. Figure 3 shows
three snapshots of the union method taken af­
ter four, 64, and 128 strings were unioned and
then minimized. These strings were the first 128
from a list of English words in alphabetical order.
The topmost DFA has perfect information, but
by the time only 64 strings have been thrown in,
the union method started making large approx­
imations. By the time we get to 128 strings, the
DFA contains no information.
6 Prefix­Suffix Trees
A B­Tree [Com79] can efficiently support prefix
queries on strings. The keys of the B­Tree are
ranges of strings, and a prefix x matches a range
[y, z] if either x is lexically ordered between y
6

0 1
a 2
b
3
a
4
b
5
e
6
b
7
e
8
d
9
t
a
y
10
t
11
i 12
n
g
0 1
a
2
[a­c]
d
[a­r]
0 [a­z]
Figure 3: A progression of three finite automata created by the union method. The topmost DFA
is the union of four strings, the middle the union of 64 strings, and the bottom the union of 128
strings.
and z or if x is a prefix of y. By reversing strings
before inserting them, B­Trees can handle suffix
queries in the same fashion.
We propose a new kind of tree, a Prefix­Suffix
Tree, which can handle simultaneous prefix and
suffix queries. Rather than storing plain strings
or reversed strings, we store folded strings. Let
x be a string of length m. Define fold(x) as a
string of length m such that
fold(x) 2i = x i
fold(x) 2i+1 = xm\Gammai\Gamma1 if 2i + 1 ! m
for 0 Ÿ i ! bn=2c. For example, fold(megan) =
mneag.
Aside from this modification, the tree works
just like a B­Tree:
ffl Union takes two intervals and returns the
smallest interval containing both.
ffl Penalty(k 1 ; k 2 ) returns how much the inter­
val k 1 will expand if k 2 is added.
ffl PickSplit(P ) divides P into P 1 and P 2 ,
where the first half of P goes in P 1 and the
second half goes in P 2 .
A more thorough description of an implementa­
tion of B­Trees with GiST appears in [HNP95].
The Consistent method for this tree takes a
key k and a query of the form x(\Sigma \Lambda )y. Strings x
and y are folded together, and Consistent returns
true if the folded string is a prefix of k. The only
tricky aspect is that x and y may be different
lengths, in which case the folded string must be
truncated. For example, the query t(\Sigma \Lambda )as is
folded into ts. There is no point to representing
it as ts\Sigmaa, since in a prefix search this matches
anything ts matches. Only when comparing a
query to a leaf node do we necessarily do a full
comparison on x and y.
6.1 Performance
In contrast with the last three ideas we men­
tioned, this tree does work well for many queries.
In general, as long as x and y are long enough,
the tree will be able to eliminate interior nodes
from consideration. Specifically, the length of
the fold of a query x(\Sigma\Lambda)y is 2 \Theta min(jxj; jyj) +
(jxj mod 2). The last term takes into account
that the folding begins with the prefix. If this
7

quantity is too small, e.g. if x = ffl, then, even if
y is long, the folded query string will match any
interior node.
However, if all the queries were to omit the
prefix, then presumably a B­Tree with reverse
strings would be the right index to use. If we re­
strict the queries to have jxj = jyj, then the new
tree is optimal. The moral here is that by sacri­
ficing the generality of full substring search, we
were able to achieve good performance. Further,
while this tree cannot handle full regular expres­
sion queries, it is capable of answering queries a
normal B­Tree cannot.
7 Suffix B­Trees
Our last tree is what we term a Suffix B­Tree.
As we stated previously, B­Trees can efficiently
support prefix queries. In a Suffix B­Tree, we
take advantage of this efficiency by turning all
substring searches into prefix searches.
Rather than inserting each string once into a
B­Tree, we insert it many times. For every string
x in the original data file, we insert x and all
suffixes of x, keeping the same pointer for each
insertion. For example, if the original key was
foo with pointer 42, then we insert the entries
(foo; 42); (oo; 42), and (o; 42). Aside from that,
we make no changes to the regular B­Tree im­
plementation.
After doing this, substring searches are easy.
A substring s appears in x iff s is a prefix of
some suffix of x. Thus, to search for a substring
s, we do a prefix search. Since prefix searches in
B­Trees are extremely efficient, now so are sub­
string searches.
The reason this tree works where our first
three trees failed is redundancy. If the origi­
nal dataset contained m strings with an average
length of l, then the Suffix B­Tree will contain lm
leaf­level entries. The height of the tree will also
increase by a corresponding amount. In general
this is unacceptable for large values of l, though
for a small, fixed l it works wonderfully.
8 Future Work
In [HKP97], it has been shown that set inclusion
workloads have potentially the worst possible ac­
cess overhead, while B­Trees have the best. We
believe that the substring queries we have looked
at are an important intermediate point on this
spectrum, and we plan to look for some theoret­
ical, rather than statistical, justification for why
low redundancy does not work unless the form of
queries is severely restricted. Additionally, there
should be some middle ground. It should be pos­
sible to restrict the form of queries a little and
add a little redundancy to get, from an engineer­
ing perspective, the optimal balance between the
expressiveness and redundancy.
There is a large body of string data for which
j\Sigmaj is very small: genetic data. Although in the­
ory a small finite number is no different from a
large finite number, in practice there is quite a
difference. Perhaps by restricting j\Sigmaj to be small
we could build better indices.
Finally, it may just be that GiST is not the
right framework for building string indexing. As
we stated before, there is a significant body of
related work, though none of the solutions we
have seen support transactional semantics. An­
other approach would be to start from one of
those solutions and try to add concurrency and
locking.
9 Conclusions
We have described five kinds of index trees that
were designed for string searches. The first three
trees we discussed -- Counted Trees, Substring
Trees, and Regular Expression Trees -- all fell
into the same trap. Counted Trees only work
for queries containing infrequent letters and fail
completely at clustering. Substring Trees and
Regular Expression Trees both contain virtually
no information in the interior nodes.
The only trees that work well are the last two
we described: Prefix­Suffix Trees and Suffix B­
Trees. Prefix­Suffix Trees succeed by severely
restricting the form of queries. Suffix B­Trees
avoid the trap through very high redundancy.
8

These last trees sidestep a fundamental problem
with the others: small approximations of large
sets of strings contain little information.
Acknowledgments
Many thanks to Joe Hellerstein and Steve Fink
for listening to our crazy ideas and making good
suggestions.
References
[Aho90] A. V. Aho. Algorithms for Find­
ing Patterns in Strings. In J. van
Leeuwen, editor, Handbook of Theo­
retical Computer Science, volume A,
pages 257--300. Elsevier Science Pub­
lishers, 1990.
[BJEG95] A. Br¯azma, I. Jonassen, I. Eidham­
mer, and D. Gilbert. Approaches to
the automatic discovery of patterns in
biosequences. Technical Report Re­
port No 113, Department of Infor­
matics, University of Bergen, Bergen,
Norway, December 1995.
[BYG89] R. Baeza­Yates and G. Gonnet. Fast
Text Searching for Regular Expres­
sions or Automaton Searching on
Tries. In G. Ausiello, M. Dezani­
Ciancaglini, and S. Ronchi Della
Rocca, editors, ICALP '89, volume
372 of Lecture Notes in Computer
Science, pages 46--62. Springer­Verag,
Stresa, Italy, July 1989.
[CLR90] T. Cormen, C. Leiserson, and
R. Rivest. Introduction to Algo­
rithms. McGraw Hill, 1990.
[Com79] D. Comer. The Ubiquitous B­Tree.
ACM Computing Surveys, June 1979.
[GNU94] G. Grahne, M. Nyk¨anen, and
E. Ukkonen. Reasoning about
Strings in Databases. In Proceedings
of the 13th ACM Symposium on
Principles of Database Systems,
pages 303--314, Minneapolis, MN,
1994.
[Gut84] A. Gutman. R­trees: A Dynamic
Index Structure for Spatial Search­
ing. In Proc. 1984 ACM­SIGMOD
Conference on Management of Data,
Boston, MA, 1984.
[HKP97] J. Hellerstein, E. Koutsoupias, and
C. Papadimitriou. On the Analysis
of Indexing Schemes. In PODS '97,
1997. To appear.
[HNP95] J. Hellerstein, J. Naughton, and
A. Pfeffer. Generalized Search Trees
for Database Systems. In Proceedings
fo the 21st VLDB Conference, Zurich,
Switzerland, 1995.
[HU79] J. Hopcroft and J. Ullman. Introduc­
tion to Automata Theory, Languages,
and Computation. Addison Wesley,
1979.
[KM95] J. Knight and E. Myers. Approximate
Regular Expression Pattern Matching
with Concave Gap Penalties. Algo­
rithmica, 14(1):85--121, July 1995.
[KVI96] P. Krishnan, J. Vitter, and B. Iyer.
Estimating Alphanumeric Selectivity
in the Presence of Wildcards. In Pro­
ceedings of the 1996 ACM SIGMOD
Internaional Conference on Manage­
ment of Data (SIGMOD '96), pages
282--293, Montreal, Canada, May
1996.
[Loe94] A. Loeffen. Text Databases: A Survey
of Text Models and Systems. ACM
SIGMOD Record, 23(1), March 1994.
[McC76] E. M. McCreight. A Space­
Economical Suffix Tree Construction.
J. of the ACM 23, pages 262--272,
1976.
9

[Tra96] Transaction Processing Performance
Council, San Jose, CA. TPC Bench­
mark D Standard Specification, 1996.
Revision 1.2.2.
[WM92] S. Wu and U. Manber. Fast Text
Searching Allowing Errors. Commu­
nications of the ACM, 35(10):83--91,
October 1992.
10