Документ взят из кэша поисковой машины. Адрес оригинального документа : http://sp.cs.msu.ru/courses/match/LEAPS/IndexSupport93.pdf
Дата изменения: Thu Oct 30 21:24:47 2003
Дата индексирования: Mon Oct 1 22:59:02 2012
Кодировка:

Index Support
David Applied Research A. Brant

for Rule Activation
and Daniel P. Miranker Sciences Department

Laboratories The University Austin,

and Computer

of Texas at Austin TX 78713-8029 835-3381 edu

P.O. BOX 8029 (512)

brant@cs.utexas. ABSTRACT
Integrated rule and database systems are quickly moving from the research laboratory into commercial systems. However, the current generation of prototypes are designed to work with small rule sets involving limited inferencing. The problem of supporting large complex rule programs within database management systems still presents significant challenges. The basis for many of these challenges is providing support for rule activation. Rule activation is defined as the process of determining which rules are satisfied and what data satisfies them. In this paper we present performance results for the DATEX database rule system and its novel indexing technique for supporting rule activation. Our approach assumes that both the rule program and the database must be optimized synergistically. However, as an experimental result we have determined that DATEX requires very few changes to a standard DBMS environment, and we argue that these changes are reasonable for the problems being solved. Based on the performance of DATEX we believe we have demonstrated a satisfactory solution to the rule activation problem for complex rule programs operating within a database system. 1

INTRODUCTION

The integration of rule and database systems, sometimes referred to as an active database, is an important step forward for both expert systems and database management systems (DBMSS). Several experimental systems have been developed (Hanson [1991], McCarthy and Dayal [1989], Sellis, Lin, and Raschid [1989], Stonebraker, Rowe, and Hiroharna [1990], Widom, Cochrane, and Lindsay [199 l]). It has been argued that to achieve rule and database consistency, the only viable approach is to incorporate the rules and the rule activation mechanism into the DBMS (StonebraJcer [1992]). For alerters, triggers, and simple rule systems this is already being accomplished. In fact, several commercial DBMSS already provide a rule capability (INGRES [1990], Sybase [1990]). However, the computational and storage demands of alerters and triggers are trivial. compared to those imposed by medium to large scale expert systems operating within a DBMS. Perhaps this is also why so few of the database rule language projects deal with the hard problem of supporting a general rule

This work was supported in part by the Office of Naval Research, Grant NOO014-90-J-1366, dre ARL:UT Internal Research and Development Program, and the State of Texas.
Permission granted direct title that to copy that without fee ell or part are not made copyright and appaar, of this material is for provided commercial of the copying specific the copies or distributed notice notice

one where a large number of complex rules fire in a cyclic fashion for hundreds or thousands of cycles (or perhaps indefinitely). Stonebraker has referred to these as "hard expert systems" and suggests that, due to the enormous demands of these systems, they will likely remain outside the DBMS for the foreseeable future. Our work is directed precisely toward supporting these complex rule programs inside a tightly integrated database rule environment. The system is caIIed DA~X. A critical salient feature of hard rule systems, overlooked by many, is that they are not dynamic. They usually involve an investment of several man-years for initial development. The cost and dumtion of problems solved by hard expert systems are commensurate with this high development COSL which implies a recurring problem with indefinite life cycle costs (Kerschberg [1987], Kerschberg [1988]). We agree, even assert, that DBMS supprt of hard rule systems will require some explicit built-in mechanisms and will have a permanent effect on the DBMS. The cost and duration of the problems solved leads users of these systems to provide significant resources to achieve a solution. Techniques which require additional or non-standard DBMS support are warranted and should be investigated. Moreover, these systems present an unprecedented opportunity for database optimization techniques, especially multiple query optimization. We believe DATEX demonstrates that the resources and functionality necessary to support hard problems are quite reasonable and demonstrate the feasibility of incorporating these programs into a DBMS, Since performance is the ltilting factor for these hard systems, we have not been overly concerned with the semantics or syntax of sethe rule language. Instead, we have developed an evolutionary quence of rule compilers, activation methods, and execution environments. Lessons from each system were derived through the analysis of a suite of scalable benchmark programs, representative of expert systems. While the semantics of the rule language can certainly affect performance, many of the issues are orthogonal, especially if one assumes a fairly powerful and expressive language. The key aspects of DATEX rule activation are specialized irrdices, derived from a compile-time analysis of the rule program, and law matching. The rule activation problem has been defined as determining which rules to activate and when to activate them. This problem must be addressed whenever a record is added or modified. Rule activation is assumed to include both the identification of which rules to attempt to satisfy, and tinding the database records that actually satisfy those rules, i.e., the rule's irrstarttiaIions. For hard expert systems we assume that rules may allow an arbitrarily complex join to be specified by the condition elements. In this environment, rule activation becomes quite challenging, and can easily imply many multiway joins per database update. Stonebralcer [1992] suggests a taxonomy of approaches to the rule activation problem -- brute force, discrimination networks, and marking schemes. Brute force is easily excluded as a viable approach for large rule programs; it simply is not computationally tractable. Discrimination networks are used extensively in expert system shells, However, we have shown that traditional Rete (Forgy [1982]) and TREAT (Miranker [1990]) discrimination net-

program environment,

advantage,

the ACM

and the is given a fae

publication To copy

and its date otherwise,

is by permission permission.

of the Association or to republish,

for Computing raquires

Machinery. and/or

SIGMOD 151931Washington, 01993 ACM ~.8979J.~92-~/93

DC, USA /000~/0042...

$j .50

42

works are not appropriate for large scale expert systems due to their excessive space and time requirements, and that they are consistently outperformed in both time and space by LEAPS, our lazy matching technique (Miranker and Brant [1990], Miranker, Bran~ Lofaso, and Gadbois [1990], Brant, Grose, Lofaso, and Miranker [1991]). The fundamental weakness of state-saving discrimination networks like Rete and TREAT is that they do not scale well to solve large, data-intensive problems. The space required by their saved state information, and the time to process it, tends to grow as a polynomial of the size of the base relations (h-ant Grose, Lofaso, and Miranker [1991]). LEAPS avoids much of the work and storage done in Rete and TREAT by using reentrant update sensitive, data streams in rule activation as opposed to eagerly expanding a rule's entire search space during rule activation. In doing so, LEAPS is inherently a marking scheme. Marking hm been used effectively to provide high performance in other database rule systems (Stonebraker., Rowe, and Hirohama [1990]) and is intrinsic to the lazy matching technique and DATEX. Marking provides an extremely efficient way to identify the satisfied rules corresponding to a database update and is usually implemented by bit encoding at the base relation record level. In DA~X however, indirect markers of several varieties are used. First an index is built on the join attributes of the rule's conditions, and markers are kept with the index records to identify which database records satisfy the constants in the conditions. This is most appropriate for rule systems that change infrequently. We call this afiltering indez. Second, an ordered set of tuple identifiers is kept to indicate the database records that need to be processed for rule activation. The ordering is based on a given search heuristic. And, third, an ordered set of search vectors, composed of cursors, is kept for each non-empty data stream. These keep the current entry point for each partially expanded search space corresponding to a rule activation data stream. In evaluating the efficiency of DKI'EX we believe that the usual database cost metrics, storage space and disk accesses, continue to be the correct measures. These metrics have been used to guide our design and experiments. We first present an architectural overview of DATEX in section 2, followed by a description of, and a rationale for, the index used in rule activation. Section 4 presents the performance Esults and compares them to main-memory implementations. Section 5 summarizes our contributions and describes current and future directions. 2

language with object-otiented extensions has been specified and is in the next version. In order to facilitate this being implemented change, the role language compiler is designed to create and operate on an intermediate form that is language independent. Thus, multiple rule language front-ends can be easily accommodated. The current underlying DBMS is Genesis, an extensible database system that provides a high degree of flexibility in a rapid prototyping environment (Batory et al. [1988]). It was originally thought that the rule activation techniques used in DA~X would require extensive hooks and modifications to the DBMS. However, it turns out that few extensions are needed to the standard relational DBMS in order to provide support for DAXEX. In fact, a standam-i SQL interface is sufficient for correct behavior and the version now in development is based on Oracle. We note that the use of parallel query processing and lightweight-process cmnmunication will most likely be needed in the objective DAI13X system. This is being explored in conjunction with the work repn-ted on in this paper. The rule compiler takes the rules and the schema as input and generates the object code for the rules and an augmented schema as output. The schema is augmented to support the efficient processing of the rule activation mechanism and is specific to the rules that were compiled. That is, the changes to the schema are rule dependent. The augmented schema is then used by Genesis when ~rocessing the rules. The s;hema is augmented in several ways, They are: . Create ajltering index on the augmented base relations. This index is presented in detail in the next section. . For each base relation referenced by a rule condition, augment it by adding a timetag attribute. These relations are then stored in order by the value of the timetag. Tnetags are strictly increasing integers assigned when records are added or modMied. This is unnecessary if the DBMS assigns unique tuple identi6ers. , For each augmented relation that is referenced in a negated condition, c;eate a second relation which will shaabw~he original. Shadow relations are only needed to support negation and their use is described in Miranker, Brant, Lofaso, and Gadbois [1990]. In a version-based DBMS, even this extension is not needed. The compiled rules and the augmented schema are then used by Genesis to activate the rules. The primary database support for rule activation is the filtering index. 3

DATEX ARCHITECTURE A FILTERING INDEX TO SUPPORT RULE ACTIVATION

Figure 1 shows the DATEX architecture. While the current rule language of DATEX is 0PS5 (Forgy [1981]), an SQL derived

Rule Object Code

Augmented DB Schema

t

t
Genesis DBMS
I

I
FIGURE

1. DATEX

Architecture

Rule activation in DA~X is supported through an index designed to provide a value-based join path combined with specific information pertaining to the constant tests within conditions of a rule. The design of the filtering index has been guided by several types of information which may be used to accomplish efficient rule activation. Much of the work that has been done in the expert system domain is applicable (McDermott, Newell, and Moore [1978], Gupta [1987], Miranker [1990], Bran~ Grose, Lofaso, and Miranker [1991]). At a minimum, we may learn from the in-depth studies that characterize the rules in hard expert systems. From these studies we know that for a typical rule program 1. from one rule firing to the nexi most of the database is unchanged, 2. most rule actions result in changes to the database (but the number of changes is small), 3. rule actions that modify existing records are much more common than those which add new records, 4. the changes have a small scope (few rules affected), 5. at any given time, only a small subset of rides are activated, 6. most condition elements contain constants, 7. we can effectively use the constant tests within conditions to identify rules that are NOT active, 8. constant tests provide good selectivity, i.e., they are highly selective, 9. rules are com~osed of 2-16 conjunctive elements. 10. rules have 2 ~ 5 join attributes~and

43

11. the join selectivity on records that satisfy the constant tests is medium, i.e., about 0.5. These pieces of information have led to various breakthroughs and improvements in rule activation within expert system shells. The first and foremost of these was the development of discriminafilter the tion networks such as Rete and TREAT. These networks
data based on constant tests within each condition and store the results in the a-memories. These provide links to the base relations

and as such are indices. In addition to forming an index on the base relations, the a-memories can be used to identify which rules are not satisfiable. That is, if a rule's constant tests are not satisfied by at least one base relation record, then do not perform the joins. This provides a powerful filter to determine which joins need to be performed. It has been shown that the use of TREAT provides an effective rule activation mechanism in a database context, and it is used in the Ariel system (Wang and Hanson [1990], Hanson [1991]). While discrimination networks provide a powerful tool in rule activation, the lazy matching technique used in the LEAPS approach has dramatically surpassed them in performance. Even more importantly for databases, LEAPS uses much less space than discrimination networks for its internal state. Less state means less state maintenance, which implies faster state maintenance. When replacing a discrimination network with LEAPS it was recognized that constant tests would still be required for any efficient rule activation since they provide such a powerful filter on the joins to be performed. Therefore, we needed to keep the information provided by the u-memories of the network. However, it wasn' t at all clear that the et-memories themselves were needed. This issue led to the development of a filtering index (FI). Even though et-memories provide an index to the base relations, they may be of little advantage in computing a join on those records, since they are not organized by join attribute value. One solution would be to implement the rx-memories as a file structure and build a join attribute, vahre-based, index on top of them. This was the solution used in Ariel. However, since updates are so common in these systems, we argue that the overhead to support join indices on top of rx-memories is not justified. Instead, we looked for a technique that would involve a single index, which would require no more entries than c&memories, and still provide both constant test filtering and join support. The result is an attribute-based index clustered by values on a given attribute. The FI file records are defined as (rel, attr,val.tt, s_bitO, .. ..s_bit31). where rel identifies a base relation, attr identifies a join attribute in rel, val is a value for attr, tt is the timetag of the record in rel having the value val, is a bit string defined below. and S_bitO,. . . . ,s_bit31 In discrimination networks, when a tuple is inserted into a base relation, a sequence of constant tests are performed, one test for each constant that appears in the rules' conditions. For FI records the constants are enumerated and mapped to a bit vector (s_bito,...,s_bit31). Whereas discrimination nets employ a linear (in the number of rules) method to update ct-memones, DATEX uses a log time algorithm (in the number of constants in the rule source) to determine the bit string (Nishyama et al. [1992]). In general, a record is added to the FI file for each join attribute of that tuple. The purpose of the bil string is to avoid accessing base relation records that do not satisfy the constant tests in a given predicate during the join operation, and thus replace the function of the amemoryl. Through analysis
of our benchmark suite it was deter-

the FI. Moreover, the El performs much better than the AMEM for most cases. To investigate the performance of the FI compared to an amemory index, we considered a file structure based on a-memories. The AMEM file has records defined as (cond, rel. tt), where cond identities a specific condition of a rule, rel is the base relation on which cmd is deEned, and u is the tirnetag of a record in rel that satisfies cond. For ease of representation we assume a single file, called the BR file, contains the base-relations. Records in the BR file have the is form (@lJLattr,val, ,.., attr,vat), where rel is a base relation, tt a timetag identifying a unique record in rel, and the att~ val are the attribute/value pairs for the record. Figure 2 shows an example of a two-way equijoin based on an AMEM to BR Ele access path. The join is performed in a nested loop manner by order of timestamps. As can be seen in the figure, a total of 12 records are read to accomplish the join. All records from the AMEM file must be read and, for each of those, their corresponding base-relation file record must also be read. The only parameter restricting the number of records read is the selectivity of the constant tests.

Condition
Condition

Rule (OPS5 Syntax) (p example 1+(R `A 1 "B ) 2~ (S `B `C 2)
---> (...))

AMEM
(cond, rel,

File
tt) (rel,

BR File
tt, attr, attr, val) valr

`R, .R,

O, A,l, l, A,l, A,O, B,at

B,a B,b. B;b Ct2 l

R,3, St2, S,4,

B, b,C,

"S,5, 1 82 3 84 Get Get Get Get the Join Get Get the Join Get Get Get Get the Join Get Get the Join first AMEM record for

B, b,C,2 .

cond=l

appropriate first

BR record. for cond.2 .

AMEM record

appropriate BR records criteria next

BR record from steps is satisfied. for

compare 2 and 4. cond.2.

5 8 6

AMEM record

appropriate BR records criteria next

BR record compare from steps 2 and 6. is not satisfied. for cond.1.

7 8
8 9

AMEM record

appropriate first

BR record. for cond=2.

mined that 32 different constant tests for each base relation are sufficient. The Ff index records are identical to any standard index, with the exception of this additional bit vector (one word). 3.1

Join Access Path Analysis

8
10

AMEM record

appropriate BR records criteria next

BR record compare from steps 8 and 10. is not satisfied. for cond=2. compare 8 and 12.

An analytical comparison was performed to determine the expected performance of the two indexing alternatives, one defined for a-memories (the AMEM file) and one for the FI. The result was that there is no case where the AMEM performs better than

11 812

AMEM record

appropriate BR records criteria

BR record from steps is satisfied. relation AMEM Rwithrelation file

FIGURE2.Joining 1. Wenotethat while the functionality paths may have different lengths. isreplaced, the access

Susing

44

Figure 3 shows the same example as that in Fig. 1 using the FI tile instead of the AMEM tile as an index on the BR tile. In this example, two fewer records are read from the BR file. The key functional difference is that the FI file can filter on both the selectivity of the constant tests and the join values. The actual performance of each method is subject to many factors. However, the two dominant parameters are select selectivity and join selectivity. The overriding cost factor in using either index is the number of disk accesses incurred during a join. This is measured by examining the cost to equijoin one tuple, T, with a relation, R, using the previously defined file structures. The following parametem are needed q n = number of tuples in R o~el = constant test select operator selectivity for R in the query being executed q ~join = join operator selectivity, T join R, in the query being executed q b = block size in bytes = 8192 q [Fl = FI tile record size in bytes = 20

q

q q

tAMEM=AMEM filerecordsizein bytes=8 tDB=BRfilerecord sizeinbytes=l 024(this isstandard manyexpertsystems ) f= blocking factor for B+ tree = 0,7 BF1 = number of blocks for the FJ file = n x tFI nx20 -- . = 0.0035n b xf 8192 X 0.7 ~,4~~M = number of blocks for the AMEM file =
n x `A.MEA4

for

nx8 = 0.0014n b xf = 8192x0.7 BDB = number of blocks for the BR file = n x tDB n x 1024 --= =0.1786n bxf 8192x0.7 Using the AMEM/BR access path, we scan the records of the AMEMtile that have acondition identifier corresponding to the appropriate condition element forthequery being processed. & suming scold start, the number of disk accesses for scanning the AMEMfileis min (o,,ln, 0.0014n) . Thus, for all but the smallest values of o~,l, all 0.0014n blocks will be read. `Ihese will result in min (o$=ln, 0.1786n) disk accesses to the BR file. The total number of disk accesses for theAMEM/BRpatb is min(a$etn,0.0014n) +min(o,eln,0.1786 n). Using theFUBRaccess path, wescantherecords oftheFIfile that satisfy the equijoin value from tuple T. Again assuming a cold start, the number of disk accesses for scanning the FI file is rnin (ajoinn, 0.0035 n). These will result in min ( G$e[ojo in n, 0.1786n) disk accesses to the BR file. The total number of disk accesses for the FUBR path is min((sjoinn, 0.0035n) +min(c~gl(sjonn, 0.1786 n). Clearly, thedominant factor inthese equations is the number of accesses to the BR tile. Therefore, the FUBR path is favored whenever

Rule (p rule_l (R `A 1 `B ) (s "B "c 2j ---> (...)) FI File (rel, attr, val, tt, s_bit) (rel, val, BR File tt, attr, attr, val)

o o o
I 2 3

Get first FI passes select Get BR record.

record for bit test.

relation

R That path ~join. over

min (o~elcjon n, O.1786n)

0.1786 n).

Get first FI record for relation passes select bit test: Records from steps 1 and 3 satisfy the criteria. Get BR record.

S join

is, the FI/BR path can never be worse than the AMEM/BR and will be considerably better for low values of o$,1 and Figure 4 shows a plot of the performance advantage FUBR AMEMLBR. The x-axis is Join Selectivity (~join,) the z-axis is

o
4 6

o"
~

Get next FI passes select Get BR record. first FI

record blt

for test.

relation

R

0
8

Get

record

for

::&::&n satisfy

S the

?;EZeEtZ~;eEtaE;t2$Ztnot join criteria.

Q

G.t

BR

record,

FIGURE

3. Joining relation relation S using FI

R with

FIGURE

4. D~k Access Comparison AMEM File

of FI File and

45

Select Selectivity such that y=

(osel), with

the advantage

factor on the y-axis

4.1 FI File Results

and Analysis

min (O~el, 0.1786) min (ts~e[cjoin, 0.1786)

nzirz(z, 0.1786) = min (n, 0.1786)

A particular y-axis value corresponds to the performance increase factor of PI over AMEM, i.e., a value of 10 means that FI is 10 times faster than AMEM. As can be seen from the graph, the FI/ BR path is favored for most values of ~joti and O$eh For moderate and o~el the performance factor is protO 10W values Of nounced and no combination favors the AMEIWBR path. This analysis led to the adoption of the FI file as an index for DATEX. The overall size of the index and number of updates required to process a database change was also examined. A comparison was made between the space requirements for an AMEM system versus an FI based system2. This comparison was based on the largest problem size that has been run for our scalable benchmarks and corresponds to databases in the megabyte range. While these are small by database standards, they still represent an important step forward for hard problems on active databases. The object of the analysis was to examine the number of FI entries that could be expected in DAI13X as opposed to the number of cx-memory entries in LEAPS. The result is in Fig. 5. In general the FI approach was roughly equivalen~ though somewhat better than, the AMEM in size.
Oj~~

[1

`
.

One of the important aspects of using the FI file was the aggregation of the constant test select information from all relevant condition elements into a single bit mask. In the process of computing a join, this scheme could be no more effective than et-memories. However, it could be much worse. Consider the case of two different equijoins involving the same base relation. Let the condition element for join 1 have a O$el of 0.02 and let the condition element for join 2 have a G,ez of 0.5. When computing join 1 using the FI file, for every tuple that satisfies the constant test, 25 tuples will be read that do not. This is the undesirable effect of aggregating the constant test select criteria. To address this concern the number of accesses made for computing all equijoins was compared with the number that passed the filter, i.e., the appropriate select bit was set. The results are shown in Table 3. For manners and wrdtzdb, the results were extremely good, with the number of tuples passing the filter averaging over 90%. For waltz and weaver the average was 50?4 to 60%. One possible explanation for the low percentage of weaver tuples passing the filter is that weaver contains a very large number of condition elements on relatively few base relations. Thus, the negative effects of aggregation are more pronounced. Table 1. Name manners waltz waltzdb Program Summaries Comment Finds a seating arrangement for dinner guests by depth-first search. Waltz line labeling for simple scenes by constraint propagation. A more database oriented version of Waltz line labeling for complex scenes by constraint propagation. A VLSI router using a blackboard technique. Size Set 2 32 1800 576 791 Buffer Set 2 Set 3 64 2664 864 1311 Accesses Set 3 Set 4 Set 4 128 3600 1152 1831

No. of Rules 8 33 35

4

PERFORMANCE

RESULTS

data on DAI'EX was to evaluate the design decisions made during the analysis phase of the development. Important issues were execution time, FI behavior, and disk buffer hit rates. The benchmark programs and data sets used in evaluating DA~X are shown in Tables 1 and 2.

The purpose for gathering performance

weaver

637

u
45,000 40,000 35,000

Shared alpha-memo

entries

Table 2, Program manners waltz waltzdb weaver Table 3.

Initial

Database Set 1 16 864 288 531

Filtering

of Equijoin Set 1

30,000 me 25,000 20,000 15,000 10,000 5,000 0 L.El manners FIGURE 5. Comparison waltz waltzdb manners: Raw Filtered % Passed waltz Raw Filtered % Passed waltzdb: Raw Filtered % Passed weavec Raw Filtered % Passed

1,279 1,039 81 62,505 32,246 52

7,839 6,847 87 126,075 64,941 .52 5,634,112 5,582,740 99 8,539,384 5,171,189 61

53,220 49,188 92 184,755 95,121 51 11,642,854 11,568,338 99 14,934,424 9,467,849 63

384,415 368,159 96 248,325 127,816 51 19,805,788 19,708,128 100 22,849,464 14,905,309 65

weaver and

iiiikl
If

1,779,562 1,751,334 98

of Shared Alpha-Memory FI Entries

5,864,876 3,415,058 58

2. llese results assume that u-memory sharing is implemented. not, the size of the AMEM tile would be 6 to 40 times larger.

The FI file is used to index equijoins only, i.e., = and +. All non-eqnijoins are produced by scanning the database directly. Therefore, no constant test selections are made a priori for joins on condition elements involving non-equijoins. This turned out to

46

have been a poor decision. Table 4 shows the number of accesses made for the non-equijoins (raw) versus the number that passed the constant test filters (filtered). By dividing the filtered amount by the raw, a O$el value can be derived. For all runs of all programs except weaver, the values of o~el were very low, ranging from 0.01 to 0.11 with a mean of 0.05. Weaver showed a g~el of 0.56. For manners, waltzdb, and weaver, the impact of having a join index for non-equijoins should be substantial, however, equijoin accesses are dominant in those programs. On the other hand, waltz is dominated by non-equijoin accesses and should demonstrate a signtilcant speedup over the current DATEX system. Table 4. Type manners: Raw Filtered
o~el

from a main-memory base. Table 5. Qpe manners: Buffer Disk Hit Rate waltz: Buffer Disk Hit Rate waltzdb: Buffer Disk Hit Rat weaver: Buffer Disk Hit Rate Table 6. ~pe manners: Buffer Disk Hit Rate waltz: Buffer Disk Hit Rate waltzdb: Buffer Disk Hit Rate weave~ Buffer Disk Hit Rate Table 7. Program manners: Set 1 Set 2 Set 3 Set 4 waltz: Set 1 Set 2 Set 3 Set 4 wakzdb: Set 1 Set 2 Set 3 Set 4 weave~ Set 1 Set 2 Set 3 Set 4 Disk Buffer

expert system shell to a disk-resident

data-

Hit Rate for FI File Accesses Set 2 Set 3 Set 4

Set 1

7,092 2 100.0 791,755 32 100.0 4,427,726 5,698 99.9 9,743,095 19 100.0 Dkk Buffer

35,303 6 100.0 2,167.149 440,030 79.7 13,177,002 225,957 98.3 14,075,878 23 100.0

186,300 25 100.0 3,928,367 1,194,672 69.6 26,526,927 591,040 97.8 24,316,040 37 100.0

1,099,681 23,990 97.8 6,461,663 2,238,361 65.4 44,504,770 1,115,422 97.5 37,104,651 69,808 99.8

Filtering

of Non-Equijoin Set 1 Set 2

Buffer

Accesses Set 3 Set 4

818 0.7; 107,411 8,559 0.08 277,499 5,418 0.02 1,985,375 1,085,415 0.55

2,922 189 0.06 216,299 17,035 0.08 843,343 9,682 0.01 2,710,056 1,501,178 0.55

10,970 381 0.03 316,811 24,859 0.08 1,714,339 13,946 0.01 4,168,456 2,323,498 0.56

42,426 765 0.02 429,570 29,464 0.07 2,890,487 18.210 0.01 5,658,856 3,161,818 0.56

waltz Raw Filtered
G.el

waltzdb: Raw Filtered cr~el weaver: Raw Filtered G.el 4.2 Buffer

Hlt Rate for BR File Accesses Set 2 Set 3 Set 4

Set 1

23,905 200 99.2 833,769 15,666 98.1 13,007,084 730.050 94.4 24,559,493 41,451 99.8 DATEX

99,288 2,488 97.5 1,820,951 32,184 98.2 46,235,110 2,544,482 94.5 39,950.082 412,566 99.0

541,619 27,693 94.9 2,675,493 47,311 98.2 95,248,329 7,000,782 92.6 84,465,546 2,881,329 96.6

3,934,357 355,179 91.0 3,633,409 63,324 98.3 161,633,113 9,308,836 94.2 131,730,488 4,934,730 96.3

Hit Rate

A key measure for any system based on disk-resident data is the buffer hit rate. We measured the effectiveness of the buffering for both the FI and the BR file. The FI file data (Table 5) was expected to show very high hit rates. This is due to the relatively small size of the ~ file and the kuge number of records per disk block. The FI and BR files each had fifty 8K blocks assigned to them. This assignment was fixed for all benchmarks. The results were as expected, with the exception of waltz. As can be seen in the table, waltz had a hit rate that vaned from 100% for the smallest problem, to 65.4% for the largest. While we expected that due to the fixed buffer assignment, we would see a drop in hit rates as the data set size increased, we were somewhat surprised at the magnitude of the drop in waltz. This suggests that more thought should be given to buffer management issues. The benchmarks were performed using a least recently used replacement policy. Another reactive replacement approach might make more sense, but we suspect that prefetching is likely to be effective as well. This is due to the seaming of tiles for executing the joins. The buffer hit rates for the BR file are shown in Table 6. Overall these were quite acceptable. The downward trend is still present, however, and should continue as the data set increases. Simply increasing the buffer size should be both reasonable and sufficient. The typical BR tile size during the execution of the largest benchmarks was approximately 1000 - 1200 blocks. Ten percent of that would suggest a buffer of 100-120 blocks. Therefore, for the largest data sets, our 50 block buffers were undersized by most standards. 4.3 Execution

versus OPS5.C Execution 0PS5.C 1.0 13.8 425.8 15,838.5 343.3 988.0 2,963.0 3,831.8 641.5 2,109.6 4,341.6 8,033.3 170.3 255.8 552.6 1,053.7

Time (in seconds) Increase Factor

DATEX

17,4 69.3 365,6 2,640,7 1,901.2 6,939.2 14,259.5 25,102.1 6,852.8 23,443.3 51,722.3 83,412.7 12,551.4 19,055.0 37,326.5 58,618.0

17.4 5.0 0.9 0.2 5.5 ::; 6.6 10.7 11.1 11.9 10.4 73,7 74.5 67.5 55.6

Time Results,

Parallelism

and Future
main-memory,
(Miranker

Directions
DATEX
discrimination

was compared with the fastest known,
network based, expert system

-- 0PS5.C

and Imfaso [1991]). the times range from average of 22.7 times FI techniques nearly

When compared with this system (Table 7), 5 times jizrter to 74.5 times slower with an slower. The combination of LEAPS and the overcomes the overhead incurred by moving

47

A number of optimization problems remain. DATEX guarantees that every join is supported by FI entries by greedily adding new links. We do not yet have any results on finding the minimum size subset of attributes needed to establish this guarantee. Since rule predicates can be large, there is great flexibility in selecting a query plan. It is not inconceivable that an optimization method would determine that only a subset of the base relations need to be augmented with FI entries. Although this DPZEX prototype targets a uniprocessor and otherwise ignores transaction management, the philosophy of compile time analysis has been applied in a sister project on the concurrent execution of rule languages, CREL (Kou, Miranker, and Browne [1991]). The CREL system includes a parallelizing compiler. The compiler takes as input a rule program that has been written by a programmer who assumes that the entire program will be executed as a single uninterrupted transaction. The output of the compiler is a collection of rule programs, each containing a small subset of the rules in the original, such that each new rule program can be run as an independent transaction in a database environment. The concurrent execution of all the subprograms is guaranteed to be correct. The compiler is based on standard techniques presented in the literature on serializability. Assuming an adequate parallel query processor, we exhapolate that the synthesis of DATEX and CREL will result in the parallel execution of disk-resident, gigabyte, expert system problems in elapsed times measured in minutes. 5

CONCLUSION

The DATEX system demonstrates that solving hard expert system problems in tight integration with DBMSS is not far out of reach. The critical aspects of a solution embodied in DATEX include an effective, in both time and space, marking scheme for nde activation and a specialized indexing method derived from a compile time analysis of the source rule program. We claim that acceptable performance from such an active database system is attainable only if the database contains explicit support for the inference engine. We further claim that in light of the scope of problems solved by expert systems, that we have reduced the cost of such support to an acceptable level. Although the solutions described herein demonstrate that an active database inference engine can approach the execution speed of a main-memory expert-system shell, it is clear that additional speed is required for large databases and that the use of large-scale parallel computers may be a necessary part of an active database system. 6

REFERENCES

D., et. al. [1988], "GENESIS: An Extensible Database Management System:' IEEE Transactions on Software Engineering, 14:11, pp. 1711-1730. Brant, D., T. Grose, B. Lofaso, and D. Miranker [1991], "Effects of Database Size on Rule System Performance: Five Case Studies," Proceedings of the 171h International Conference on Very Large Databases, pp. 287-296. Forgy, C. [1981], "OPS5 User's Manual", Tech, Rep. CMU-CS81-135, Carnegie-Mellon University. Forgy, C. [1982], "RETE: A Fast Match Algorithm for the Many
Pattern/N4any Object Pattern Match Problem:' Artificial

Batory,

Benjamin/Cummings Publishing Company, Inc.,Menlo Park CA. Kou, C. M., D.P. Miranker, and J.C. Browne [1991], "On the Performance of the CREL System," Journal of Parallel and Distributed Computing, 13:4, pp. 424-441. McCarthy, D., and U. Dayal [1989], "The Architecture of an Active Database Management System~' Proceedings of the 1989 International Conference on ACM-SIGMOD Management of Data, pp. 215-224. McDermott J., A. Newell, and J. Moore [1978], "The Efficiency of Certain Production System h-nplementations~' In Pattern-directed Inference Systems, D. Waterman and F. Hayes-Roth (eds.), Academic Press. Miranker, D. [1990], TREAT: A New and EflcieniMa~chAlgorithm for AI Production Systems, Pittman/Morgan-Kaufman, LOS Altos, CA. Miranker, D.P., D. Brant, B. Lofaso, and D. Gadbois [1990], "On the Performance of Lazy Matching in Production Systems;' Proceedings of the 1990 National Conference on Artificial Intelligence, pp. 685-692. Miranker, D.P., and D.A. Brant [1990], "An Algorithmic Basis for Integrating Production Systems and Database Systems," Proceeding of the Sixth International Conference on Data Engineering, pp. 353-360. Miranker, D., and B, Lofaso [1991], "The Organization and Performance of a TREAT-Based Production System Compiler," IEEE Transactions on Knowledge and Data Engineering, 3:1, pp.3-10. Nishyama, S., K. Goolsbey, and D.P. Miranker [1992], Constant Tests in a Production System "Optimizing Environment," Tech. Rep., Dept. of Computer Sciences, University of Texas at Austin. Sellis, T., C-C. Lin, and L. Raschid [1989], "Data Intensive Production Systems: The DIPS Approach," SIGMOD Record. Stonebraker, M., L. Rowe, and M. Hirohama [1990], "The Implementation of POSTGRES~' IEEE Transactions on Knowledge and Data Engineering, 2:1, pp.125-142. Stonebraker, M. [1992], "The Integration of Rule Systems and on Knowledge Database Systems;' IEEE Transactions and Data Engineering, 4:5, pp. 415-423. Sybase [19901, Sybase V4.O Reference Manual, Svbase Corp., Emeryv;lle, CA. Wang. Y-W. and E. Hanson [19901. "A Performance Comparison of the Rete and TREAT .&gorithms for Testing D;tabase Tech. Rep. WSU-CS-90-18, Dept. of Rule Conditions," Computer Science and Engineering, Wright State University. Widom, J., R. Cochrane, and B. Lindsay [1991], `Tmplernenting Set-Oriented Production Rules as an Extension to Starburst," Proceedings of the 17th International Conference on Very Lurge Databases, pp. 275-285.
".

Intelligence, 19, pp. 17-37. Gupta, A. [1987], Parallelism in Production Systems, Pittmanl Morgan-Kaufman, Los Altos, CA. Hanson, E. [1991], "The Design and Implementation of the Ariel Active Database Rule System," Tech. Rep. WSU-CS-9106, Wright State Univemity. INGRES [1990], INGRES Version 6.3 Reference Manual, INGRES Products Division, Alameda, CA. Kerschberg, L. [1987], (cd.) Proceedings of the Firsl International Conference on Expert Database Systems, Benjamin/ Cummings Publishing Company, Inc., Menlo Park CA. L. [1988], (cd.) Proceedings of the Second Kerschberg, Corsference on Expert Database Systems, International

48