Документ взят из кэша поисковой машины. Адрес оригинального документа : http://sp.cs.msu.ru/courses/match/Rete/Soar-PSM-E88.pdf
Дата изменения: Sat Oct 22 14:23:49 2005
Дата индексирования: Mon Oct 1 22:52:41 2012
Кодировка:
Soar/PSM-E: Investigating Match Parallelism in a Learning Production System
Milind Tambe, Dirk Kalp, Anoop Gupta1, Charles Forgy, Brian Milnes, Allen Newell Department of Computer Science, Carnegie Mellon University, Pittsburgh, Pa 15213

Abstract
Soar is an attempt to realize a set of hypotheses on the nature of general intelligence within a single system. Soar uses a production system (rule based system) to encode its knowledge base. Its learning mechanism, chunking, adds productions continuously to the production system. The process of searching for relevant knowledge, matching, is known to be a performance bottleneck in production systems. PSM-E is a C-based implementation of the OPS5 production system on the Encore Multimax that has achieved significant speedups in matching. In this paper we describe our implementation, Soar/PSM-E, of Soar on the Encore Multimax that is built on top of PSM-E. We first describe the extensions and modifications required to PSM-E in order to support Soar, especially the capability of adding productions at run time as required by chunking. We present the speedups obtained on Soar/PSM-E and discuss some effects of chunking on parallelism. We also analyze the performance of the system and identify the bottlenecks limiting parallelism. Finally, we discuss the work in progress to deal with some of them.

Neomycin medical diagnosis task [19] and the Cypress algorithm design task [18]. Soar exhibits a wide range of problem-solving, learning and human-like performance. Soar uses a production system that provides a single representation for its knowledge base. Each knowledge base item is represented by a condition-action rule, or production, that fires whenever its conditions match elements in working memory. Soar uses chunking [9] as its sole learning mechanism; chunking creates new productions, chunks, that summarize the results of problem solving and adds them to the existing set of productions. These chunks then fire in appropriate later situations, providing a learning-transfer mechanism. Large and complex systems built in Soar are slow in their execution; this limits their utility. The dominating factor in this slowdown is the production matching procedure. As chunking adds new productions, the demands of matching increase, so it is important to optimize the match as much as possible. Researchers have been exploring many alternative ways of speeding up the execution of the matching procedure in production systems: efforts have focused on highperformance uniprocessor implementations [4, 10], as well as parallel implementations [5, 6, 17, 12, 15, 21]. Most results for parallel processing have been simulation results or the results of parallelization of slow Lisp-based implementations. The PSM-E implementation [6] of the OPS5 production system [2] on the Encore Multimax multiprocessor, differs from other efforts: its C-based compiler writes highly optimized machine code to achieve significant speedups on actual production systems2. In this paper, we describe the Soar/PSM-E implementation of Soar, that uses PSM-E for match. We investigate the available parallelism of the matching procedure for Soar systems. Soar's chunking mechanism provides a new dimension in the study of parallel production system match that is absent in other investigations concerned with parallelism in nonlearning production systems. Chunking continuously creates

1. Introduction
Soar is an architecture for a system that is to be capable of general intelligence [8]. Its development started as an AI system in 1981 and it is now also under exploration as a model of human cognition [14]. Soar has been exercised on a large variety of tasks: many of the classic AI mini tasks such as the blocks world and the Towers of Hanoi, as well as large tasks such as the R1 computer configuration task [16], the
1

Department of Computer Science, Stanford University, Stanford, CA. 94305 Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission.

2 Certain production systems showed between ten to twenty times speedup over the original CMU Lisp implementation of OPS5 [6].


new productions; Soar's production system must be able to incrementally compile chunks without large overheads, as all the gains of a highly optimized system such as PSM-E could be lost by such overheads. As chunks arise automatically as a result of the problem-solving, they present various computational phenomena that do not appear in non-learning production systems [20]. Soar/PSM-E provides a useful tool for studying the previously unexplored effects of chunking on parallelism. This paper is focused on the parallel processing aspects of the Soar matcher, using the C-based PSM-E implementation. Hence, all measurements presented are of the match time. Other mechanisms of Soar, that have been left in their original Lisp form, are unoptimized and dominate the system's total run time; they will be the subject of future work. The paper is organized as follows: Section 2 presents background information on the OPS5 production system, the matching procedure used in OPS5 and the PSM-E implementation. Section 3 presents a brief overview of Soar. Section 4 describes the structure of the Soar/PSM-E system. Section 5 explains the extensions to PSM-E for Soar and discusses the various tradeoffs involved. Section 6 presents the results of our measurements of Soar programs and analyzes the factors limiting our speedups. Section 7 proposes future work to improve the match speedups for Soar programs.

Each CE is composed of a set of tests for a wme's attributes' values. All of the CE's attribute value tests must be matched for that wme to match the CE. There are two types of attribute tests: constant and equality. Constant tests check that an attribute of a wme holds some constant, usually a symbol or number. Most attribute tests are constant tests. An equality test binds a variable (syntactically any symbol enclosed in "<" and ">") to an attribute's value in the scope of a single production. The variables in these tests will, on first occurrence, match any value, but if the variable appears again in a later test, then the attribute tested must hold the same value. When the CEs of a production are all matched in conjunction by some list of wmes, an instantiation of that production, which is the list of the matching wmes, is created and added to the conflict set (CS). OPS5 uses a selection procedure called conflict resolution to choose a single production's instantiation from the CS, which is then fired. When a production fires, the RHS actions associated with that production are executed, in the context of the LHS's variable bindings. Actions add or remove wmes and perform input/output. Production systems repeatedly cycle through three phases: match, select and fire. The matcher first updates the CS with all of the current matches for the productions. Conflict resolution selects one of these instantiations, removes it, and then fires it. Figure 2-1 displays an OPS5 production and an instantiation for the production.

2. Background
Soar uses a variation of OPS5 as its production system. It also uses OPS5's RETE matching algorithm. In this section we briefly describe the OPS5 production system language and the RETE matching algorithm. We then present a brief overview of the PSM-E implementation of OPS5. Readers familiar with OPS5, RETE or our previous publications on the PSM-E system may wish to skip one or more of these subsections.
An OPS5 production (p blue-block-is-graspable (block ^name ^color blue) -(block ^on ) (hand ^state free) ATTR.TEST CE

LHS

2.1. OPS5
An OPS5 production system is composed of a database of temporary assertions, called the working memory (WM), and a set of if-then rules, or productions, that constitute the production memory (PM). The assertions in WM, called working memory elements (wmes), are record structures with a fixed set of named access functions, called attributes, much like Pascal records. Each production is a list of condition elements (CEs), corresponding to the if part of the rule (the left-hand side or LHS), and a set of actions corresponding to the then part of the rule (the right-hand side or RHS). A CE is a pattern that tests for the existence, or absence, of a wme in WM. A CE may be optionally negated, i.e., preceded with a dash ("-"), signifying that it tests for the absence of any matching wme. Each of the non-negated CEs of a production must match a wme, and none of the negated CEs may match, before the production is ready to fire, or is satisfied.
RHS

--> (write "Block" "is graspable"))

An instantiation for the production ( (block ^name b1 ^color blue ^state blocked) (hand ^state free ^name robot-1-hand) )

Figure 2-1: An OPS5 production and its instantiation.

2.2. The RETE Matching Algorithm
The RETE matching algorithm [3] is a highly efficient algorithm for match that is also suitable for parallel implementation. The algorithm gains its efficiency from two sources. First, it stores the partial results of match from previous cycles for use in subsequent cycles, exploiting the fact that only a small fraction of WM changes on each cycle. Second,


it attempts to perform tests common to CEs of the same and different productions, only once by sharing them in a directed acyclic graph structure, or network. The algorithm performs match using a special kind of dataflow network that is compiled from the LHS of productions, before the production system is actually executed. An example production and the network for this production is shown in Figure 2-2.

tiation for its production. If the token has a delete flag and it has not yet been removed by being fired, the production node removes the PI from the CS. 3. Memory nodes hold the partial state of the match by storing the set of the PIs whose tokens have appeared at its single input link. If the token, on its one input arc, has the add flag, its PI is added to the set; with the delete flag it is removed. Memories always place their input token on their output arc. These nodes make RETE a state saving algorithm: it need only see the changes to WM on each cycle to calculate the CS's contents. 4. And nodes combine PIs into larger PIs. These nodes have two input arcs, called left and right; each must be preceded by a memory node. When a token arrives at an arc, its PI is compared with all of the PIs in the opposite arc's preceding memory. Any pair which preserve the variable bindings of the production that compiled into the node are joined into a new PI. These new PIs are sent out the output link for further matching. The and node ensures that the value bound to every variable in a CE is consistent with the value bound to the same variable in the preceding CEs of that production. And nodes join the wmes that match CEs in a left to right linear fashion. 5. Not nodes also have two input arcs: left and right. Not nodes keep a memory of the items that have passed in their left input and count the number of items in the their right arc's preceding memory, which match them. A not puts its input item from its left link onto its output only if there are no PIs in the right memory that block it by matching. Not nodes implement negated conditions by only passing partial instantiations that do not have a match for a negated CE. The majority of node activations are for constant test nodes. However with suitable indexing techniques, these may be reduced by almost half [5]. Since one-input nodes are much simpler to execute than than two input nodes, close to 90% of the processing time in an optimized implementation is spent in the two-input nodes.

(p blue-block-is-graspable (block ^name ^color blue) -(block ^on ) (hand ^state free) --> (modify 1 ^state graspable))

Root block

hand

color = blue state = free NOT Eq (CE1.name, CE2.on)

Two input nodes Memory nodes P nodes

AND True P

Figure 2-2: An example production and its network.

This data-flow network passes items called tokens across its arcs between its nodes. Each node has one or two input arcs and zero or one output arc. There are three types of one input nodes: constant, production (P) and memory; and two types of two-input nodes: and and not. Each token contains an add or delete flag and a partial instantiation (PI), i.e., a list of wmes, matching CEs. 1. Constant test nodes implement the constant attribute tests. The top of the network is composed only of these and forms a network that discriminates wmes based on the constants they contain. Each constant node receives a token, on its one input, that contains only one wme in its PI. If its wme has a certain constant for some attribute's value, then it sends it out its single output arc. 2. P nodes are the terminal nodes of the network. They add and delete instantiations of productions from the CS, and so need no output link. When a token arrives on its input, if it has an add flag, its PI is added to the CS as an instan-

2.3. PSM-E: Parallel Implementation of OPS5 on the Encore Multimax
PSM-E [6] is plementation of Multimax3. It RETE data-flow a highly the OPS5 produces network. optimized C-based parallel improduction system on the Encore a machine coded version of the Before starting a run, the PSM

3 The Encore Multimax is a 16-CPU shared memory multiprocessor. The machine employed for this research used the NS32032 processor, which runs at approximately 0.75 MIPS.


compiler generates a tree structured representation of the RETE network for the current set of productions. This data structure is used to generate OPS83 [4] style assembly code for the network. The system then uses the assembling, linking and loading facilities provided by the operating system to create the executable image. PSM-E consists of one control process that selects and then fires an instantiation and one or more match processes that actually perform the RETE match. PSM-E exploits parallelism at the granularity of node activations. Previous work has demonstrated that to achieve significant speedups via parallelism in production systems, it is necessary to exploit parallelism at a very fine granularity [5]. A node activation consists of the address of the code for a node in the RETE network and an input token for that node. These node activations are called tasks and are held in one or more shared task queues. Each individual match process performs match by picking up a task from one of these queues, processing the task and, if any new tasks are generated, pushing them onto one of the queues. When the task queues becomes empty, one production system cycle ends; the control process applies conflict resolution to select and fire an instantiation from the CS. Exploiting parallelism at the level of node activations allows PSM-E to achieve significant speedup for match using up to 13 processes [6].

new context element are added to the system and the older wmes are removed. If a decision cannot be reached, then an impasse results and the system creates a subgoal to solve the impasse. This subgoal allows Soar to bring to bear the full power of its problem solving capability on the impasse. All of Soar's goals are sewn together in a stack called the context stack. Each goal entry in the context stack is represented using three wmes: one for the problem space chosen to solve this goal, one for the state from which problem solving is progressing and one for the current operator. Chunking is Soar's sole learning mechanism. It generalizes and caches the results of problem-solving as new productions. These productions may then fire in similar situations, preventing impasses and reducing the problem-solving effort. Chunking works by recording the wmes of each instantiation and the wmes created by firing that instantiation. When a wme is created that is accessible from any context, other than the most recent context, chunking builds a new chunk to summarize the creation of this result wme. Chunking performs a dependency analysis by searching backward through the instantiation records to find the wmes that existed before the result context that were used to generate this result. It then constructs a new production whose LHS is based on these wmes and whose RHS reconstructs the result. This chunk can then recreate the result wmes without hitting the impasse; this prevents the impasse from reoccurring and speeds problem solving. Soar systems can be run without chunking, i.e., with chunking turned off. We will refer to this as the without chunking run. We will refer to the runs with chunking turned on, as during-chunking runs. In these runs, the performance system learns while it is solving a problem. After chunking on an input, sometimes Soar systems are run on the same input, to test the efficiency of the learned rules. We will refer to these as the after-chunking runs. It is already well established that the addition of chunks improves the performance in Soar a great deal, when viewed in terms of subproblems required and the number of decisions within the subproblem [19]. However, in one of the examples presented in this paper, chunking causes an increase in total match time and hence total run time. In a high level view, which measures gains in terms of the number of decisions, certain computational effects [20] are ignored. These effects may cause the time per decision to increase drastically, completely offsetting match time gains. In this paper, however, we will not address such cost/benefit issues of chunking, and concentrate instead on the total match time. The production system in Soar is similar to OPS5 with modifications, some of which are listed below: · RETE must support the run-time addition of productions, unlike OPS5, and must update the memory nodes of the production with the current contents of WM, and the CS with the instantiations for this production.

3. Soar
The goal of the Soar project is to build a system capable of general intelligent action. Soar is based on the problem space hypothesis [13], which states that all goal-oriented behavior is search in problem spaces. The problem space determines the set of legal states and operators that can be used during the processing to attain the goal. The states represent situations. There is an initial state representing the initial situation, and a set of desired final states. An operator is applied to a state in the problem space to yield a new state for that problem space. When an operator application generates a desired state, the goal is achieved. Soar is composed of three modules: Decide, chunking and a production system. Decide is a universal subgoaling mechanism [7], and is responsible for the creation and deletion of all of the system's goals, as well as the selection of problem spaces, states and operators. We also refer to Decide as the performance system of Soar. Decide works in a two phase loop: elaborate and decide. In the elaboration phase, all the productions in the system are matched (by the production system) to determine the CS. However, unlike OPS5, all of the instantiations in the CS are then fired in parallel. This constitutes one elaboration cycle within the elaboration phase. If this results in new instantiations, then the elaboration phase continues by entering another elaboration cycle. This process continues until quiescence is reached, i.e., when no more instantiations are generated. At the end of the elaboration phase Soar enters the decision phase. If a decision can be reached about the problem space, state or operator (the context element) to be used, then the wmes related to the


· OPS5's negated condition elements cannot test for the absence of a conjunction of matching wmes. Soar adds LHS syntax and modifies the RETE to support these conjunctive negations. · The Soar CS differs from OPS5 in that all productions may be fired in parallel. · Soar systems use collections of smaller wmes to represent data that an OPS5 program would typically represent in a single wme. · Soar productions only add wmes. The decision module keeps track of which wmes are accessible from the context stack, and automatically garbage collects inaccessible wmes.
Match Soar (Without a matcher)

CHUNKS & CONTEXT-DECISIONS

PSM-E control process INSTANTIATIONS Task Queue

Match Process

Match Process

We used three Soar programs to examine the various aspects of the implementation and the results of parallelism: 1. Cypress-Soar [18], an algorithm design system with 196 productions. We chose a run that derives the quick-sort algorithm. 2. Eight-puzzle-Soar, a system that solves the eight-puzzle mini task with 71 productions [9]. 3. Strips-Soar, a system that plans in the domain of Robot control [1] with 105 productions.

Process

PSM-E match processes

Figure 4-1: The organization of the Soar/PSM-E system.

4. Organization of the Soar/PSM-E System
The Soar/PSM-E implementation of Soar on the Encore consists of one Soar process that maintains all the functionality except the capability of matching; a PSM-E control process, and one or more PSM-E match processes. The number of match processes remains fixed for the duration of a particular run. We are planning a full port of Soar to C, but our current structure allows us to concentrate on our primary goal of investigating parallelism in the match. Unfortunately, as Lisp and C processes cannot share memory on the Encore, this arrangement causes some data structures to be duplicated in Soar and PSM-E. Further, Soar and PSM-E can communicate only through Unix4 pipes. Soar/PSM-E operates in a mode where Soar uses PSM-E as a matching engine. Both Soar and PSM-E keep a copy of WM. As the Decide module adds or deletes wmes, it sends messages to PSM-E to repeat those operations on its wmes. If this adds new instantiations to the PSM-E CS, then as PSM-E fires these instantiations it also sends copies of them over to Soar, so that Soar may also fire them. Both Soar and PSM-E then fire these instantiations, updating their copies of WM and repeating match. If new chunks are created, Soar passes them over to PSM-E at the end of the elaboration cycle. This organization is depicted in Figure 4-1.

In the above description of Soar/PSM-E, we have used PSM-E to imply a modified version of the PSM-E implementation of OPS5. In the next section, we describe the modification required to PSM-E in order to support Soar.

5. Extensions Required to Support Soar
The implementation of Soar on the PSM-E required many changes to be made to PSM-E. In this section, we describe the most significant change made to PSM-E: the capability of adding productions at run-time in order to support chunking in Soar. The first subsection describes the run-time code generation for new productions. Soar also requires that the memory nodes of the newly added production be updated with the current contents of WM, so that the chunk can be made immediately available for use. The second subsection describes our solution to this updating problem.

5.1. Run Time Addition of Productions
The problem of adding a new production at run-time on the PSM-E is significant because it requires that the production be compiled into machine code, like the rest of the system. Interpretive techniques, though simple, are not suitable because execution speed is very important. To exploit the benefits of node sharing, the new code must also be integrated into the existing machine code for the rest of the network. Recall that the RETE network shares common tests and nodes between different productions to save work at run-time. Sharing is especially important in Soar, since chunks are generated from the existing set of productions. Therefore, schemes for compiling the chunk as a separate piece of code are ruled out, because they are too inefficient. Soar adds chunks only at the end of an elaboration cycle, i.e., when the match is quiescent. This eliminates the comUnix is a registered trademark of Bell Laboratories.

4


plexity of synchronizing and modifying the code while the match processes are executing. However, the code generation for the newly added production must be efficient, so that this processing itself does not become a serial bottleneck. This efficiency requirement rules out using the assembler provided by the operating system, since forking off a process to assemble and then link and load generates an unacceptably large amount of overhead. To speed up the compilation of the chunks, the production system compiler used to generate the assembly code on the PSM-E was modified to generate machine code directly and was included as part of the runtime system. Two mechanisms are used to make the compilation faster and provide sharing. These mechanisms are : · A tree data structure that provides a high level description of the RETE network. This is similar to the structure shown in Figure 2-2. · A jumptable to integrate the new code with the existing code. Figure 5-1 shows a jumptable. It shows an indirect jump to the label-12 through the second index in the jumptable. The run-time addition of productions is illustrated below with a simplified example.

fore, a successful parent node activation has to execute the following sequence: 1. Place the c1-activation into the task queue. 2. Place the cnew-activation into the task queue. 3. Execute next-code.

Production: P-old (p blue-block-is-graspable (block ^name ^color blue) -(block ^on ) (hand ^state free) --> (modify 1 ^state graspable)) Production: P-new (p blue-block-can-be-placed-on-table (block ^name ^color blue) -(block ^on ) (place ^table free) --> (modify 1 ^state on-table)) Root block hand place

Jump

Table[2] The Jumptable

Jump Label-12

color = blue state = free

Label-1 Label-12 Label-97 Label-100 . . .

parent

table = free

c1 Jump = 50 P-old cnew Jump = 100 P-new

Figure 5-2: Adding a production at run-time.

Figure 5-1: The jumptable mechanism.

The Figure 5-2 shows a production P-new (depicted by bold lines) being added to an existing production P-old in the system. P-new shares some nodes, corresponding to its first two CEs, with existing nodes in P-old. To locate these shared nodes, a high level data structure, similar to the one shown in the diagram is maintained by the system. On the Soar/PSME, when an activation for the parent node in P-old succeeds, it queues an activation for the successor node c1, into the shared task queue. The jumptable contains an entry (with the index of 50 in this case) for the node c1. This jumptable entry maintains a pointer to next-code, the section of the code to be executed by a process, after it places the c1-activation into the task queue. When the node cnew is added as the successor of the parent node, it requires the parent node activation to queue a cnew-activation along with the c1-activation. There-

An area of shared memory is reserved for generating cnew-code, the code for the node cnew. The node cnew is given an entry into the jumptable (with the index of 100 in this case). The following sequnce of assignments is then used: Jumptable [100]:= Jumptable [50]; Jumptable [50]:= Code for queuing the node cnew; This causes the successful parent-activation to place both the c1-activation and the cnew-activation into the task queue before it goes on to execute nextcode, as it did before. A process which picks up the cnew-activation then executes the cnew-code. Thus the jumptable maintains a link between any two sections of code, between which the code for a new node could in principle be inserted. An entry for a node in the jumptable points to the section of code to be executed after


queueing that node in the the task queue. The process of integration of the new code then reduces to changing entries in the jumptable. In reality, the procedure of adding new nodes is complicated by a large number of special cases. Two issues can be raised about the jumptable: 1. The overhead of the jumptable during match in the three programs has been measured to be about 1-3%, much less than the 20-30% loss due to an unshared network (see below). 2. When there are two or more successors to a node, then only one jumptable entry is maintained for all of the successors together. Thus the size of the jumptable has not been an issue. Tables 5-1 and presents some data about the chunks added in the three systems. The first column names the tasks. The second column lists the average number of CEs in the task's Soar productions. The third column gives the average number of CEs in the chunks. The fourth column gives the amount of code generated per chunk. The fifth column gives the amount of memory used per a two-input node.
Task Avg. nbr. of CEs in the Task Ps 18 13 26 Avg. nbr. of CEs in the chunks 36 34 51 Avg. nbr. of bytes/ chunk 7,900 8,500 15,500 Avg. nbr. of bytes/ 2-input node 219 250 304

system. The time to compile the chunks appears in the third column. The fourth column gives the time to compile the chunks when sharing in the two-input nodes is turned off. Sharing requires that the RETE data-structure be searched for points to share. The numbers in Table 5-2 show that even with that overhead, the code generation time for the version with sharing is less than the time to generate code for the unshared version. This is because sharing reduces the amount of code generated. We are currently investigating the use of parallelism to reduce this time.

5.2. Run Time Update of State
As mentioned in Section 2.2, RETE is a state-saving algorithm, i.e., it saves the partial results of the match in the various memory nodes in the network. When chunks are added at run-time, the unshared memory nodes of the chunks are empty. The empty memories must be updated with PIs representing the partial matches of the WM contents to the new production. The updating procedure of the newly added chunk has to ensure that no duplicate state is added to preexisting memory nodes, that already contain the required tokens. The update procedure must not become a serial bottleneck by being very complex. A simple method of updating the node memories for the new production would be to pass the contents of WM back through the network and permit only those node tasks associated with the new production to execute. However, some of the nodes associated with the new production are shared with existing network, and therefore such a scheme would add duplicate state to those nodes resulting in incorrect behaviour in the execution of the program. The updating must therefore be confined to only the new nodes. The identification of the new, and unshared nodes, is facilitated by the fact that the RETE network is linear, i.e., once one node in the production loses sharing, all its descendents remain unshared. Therefore, a simple node ID scheme allows the identification of the nodes to be updated. All nodes in the network have incrementally assigned unique ID numbers and a newly added node is always assigned an ID greater than any other existing node in the network. Identifying the IDs for the last shared node and the first new node allows the determination of all nodes that are required to be updated. The algorithm runs the entire WM through the normal network with only two loca