Документ взят из кэша поисковой машины. Адрес оригинального документа : http://acat02.sinp.msu.ru/presentations/medvedeva/Article.doc
Дата изменения: Thu Jul 25 20:19:18 2002
Дата индексирования: Mon Oct 1 20:21:55 2012
Кодировка:


The Self-Organization of the Cellular Environment and the Reproduction of
the Network Logical Structure

M.V. Medvedeva, V.A. Koloskov,
State Technical University, Kursk, Russia
mariya_medvedeva@hotmail.com

Keywords: fault tolerance, multicontroller network, recovery from faults,
parallel distributed computations, cellular algorithm, backup copies.

1. Introduction
The discrete multicontroller networks used for group logical control of
objects in transport, industry, electric communications, power engineering,
research experiment are characterised by the hard requirements to
reliability, fault tolerance. To satisfy this requirements, the facilities
of the fault tolerance support should independently and fast recover the
network functions, which were lost as a result of the presence of faults,
that is especially important for onboard control systems. Thus, the
perspective facilities providing the fault tolerance should ensure the
autonomy, scalability and functional flexibility of the multicontroller
network.
To support the fault tolerance, it is necessary to include the backup
copies of elements and program units into the network composition. The
strategies of support of fault tolerance [1,2], which are based on the
replacement of a faulty element with an element capable of working, require
the reloading of the program units to their new locations in a network.
That increases the idle time of a network that is inadmissible for the
control networks. The preliminary allocation of backup copies of program
units in a network will permit to use the switching of elements on
execution the programs, which are required after the recovery, instead of
their reloading. The effective switching on the new program units has to be
supported by the homogeneous facilities, that would allow to preserve
autonomy and scalability of the control network. Cellular feature of the
algorithms of switching requires to have the minimum composition of backup
copies in each element, that ensure the system recustomizing under the
arbitrary combinations of faults.
Today there are no solutions concerning the design of the cellular
algorithms of processing the faults, the task of distribution of the
program units' backup copies in a network and definition of the backup
copies' composition for each element allowing for the topology of a network
was not considered by the present time. When developing the optimum
schedule, it is supposed that the backup copies are located in the common
memory [3] instead of their distribution among the network elements, or it
is assumed that the maximum quality of backup copies is located in each
element [5]. The storage of copies of all program units in the backup
elements reduces the task of recovery to the trivial replacement of the
faulty element and the backup copy. Such replacement results in loss of the
homogeneity, scalability of a network and in essential waste of time during
exchanges between elements in a recovered network.
This paper presents the design of the algorithms of fast recustomizing of
a network at the presence of faults, which allow to construct independent,
fault-tolerant and scalable multicontroller network. Such opportunities of
the algorithms are caused by the chosen approach to the recovery based both
on the allocation of program units' copies allowing for the topology of a
network and on the parallel and distributed computations of not intersected
routes from faulty to backup nodes, where each microcontroller of the route
is recustomized on execution of the new program unit. The quantity and
composition of backup copies stored in each microcontroller of the network
is determined by the number of possible directions of movement from faulty
to neighbouring microcontrollers through their physical links. In the
offered approach, the functions of the network recovery are defined by the
system of cellular operations, which is interpreted by the adaptive network
of the cells of the microcontrollers customizing with the same topology, as
the initial multicontroller network has. It has allowed to integrate the
cell of customizing into the microcontroller and to create the adaptive
microcontroller with constant composition of the backup copies, which does
not change with scaling a network. The discrete scalable networks of
adaptive microcontrollers provide the autonomy and minimal waste of time
when recovering the network functions by means of the identical cellular
rules, which are independent of quantity and allocation of faulty elements.
The algorithms offered in the given paper eliminate the limitations on the
location of the backup elements in a mesh that differs them from the
algorithm designed by the authors earlier [4] permitting to arrange the
backup elements as a column or a row on the of mesh boundary. The paper
considers the multicontroller networks presented by the mesh or the torus,
where each microcontroller can execute the functions of the primary and the
backup element of the network.
The algorithms are considered at certain assumptions. The loading of
program units and their copies, and also the customizing of the
microcontrollers to the primary or backup functions are executed before the
beginning of the network's run. On-line fault detection is a difficult
problem, however, in this paper, we assume the existence of fault detection
mechanisms and focus on how such information may be used for the recovery
of the multicontroller network's functions. Thus, it is supposed, that
faults are detected by means of the microcontroller self-diagnosing, and
the routing algorithms find the required program unit in the network with
faulty elements.
The following section presents the models of fault tolerance support,
Section 3 gives the examples of the formal record of the offered cellular
algorithms as the systems of the cellular operations of parallel
recustomizing of a network, and also a variant of the structure
implementation of the homogeneous device of recustomizing. Section 4 shows
the simulation results of the cellular algorithms. The brief discussion of
the obtained results is given at the end of the paper.

2. Models of Support of Multicontroller Network's Fault Tolerance.

The structure of the initial multicontroller network with faulty and
backup elements as well as forming the routes of recustomizing are defined
by the orthogonal graph meshes with M*N nodes, М - is a number of nodes
along the vertical axis, N - along the horizontal axis. Wrapping the meshes
in the vertical and horizontal directions is possible, when wrapped, the
mesh becomes a torus. The distance between the nearby nodes in all
directions k ( { 1,2,3,4} (1 -to the right, 2 -upwards, 3 -downwards, 4 -
to the left) is constant and it is equal to the d mesh step, the distance
between the remote nodes is defined by the orthogonal metrics.
The GS(V,U) structure's graph is the undirected mesh each node of which
[pic] has the weight {((i, (j),[pic], [pic])( where [pic] ( ((((( - is the
state of element (capable of working/faulty), [pic] ( ((((( - is the
customizing of element (primary/backup), ((i, (j) ( {(i,j),(i,j-
1),(i+1,j),(i-1,j),(i,j+1)} - is a number (a mark) of the executing program
unit. The edges U determine the presence of physical links between the
microcontrollers. Only in case of absence of faults the equality ((i, (j)=
(i,j) is correct. As a result of recustomizing (Fig. 1), some set of nodes
takes the functions of neighbouring elements upon itself. Thus, in case of
faults the numbers of program units are formed from the physical
coordinates (i,j-1),(i+1,j),(i-1,j),(i,j+1) of the neighbouring nodes.

[pic]
Fig. 1. Example of recustomizing of the structure's graph Gs



When simultaneous constructing of the recustomizing routes in a graph mesh
with arbitrary allocation of the backup and faulty nodes, the problem of
ambiguity of choice of the node-follower and the node-predecessor is being
solved for every node vij ( V of the route. The following principles are
fixed in a basis of the proposed algorithm:
1) the minimization of the accumulated length of the recustomizing routes,
that provides the minimization of the recustomizing microcontrollers'
quantity;
2) the elimination of the routes intersection by the choice of the
directions with the fast completion of routes, that provides the
increase of the algorithm's correcting ability at the complicated
combinations of faults.
To satisfy the principles given above, the offered algorithm envisages the
parallel and distributed computation of the characteristics of the nearest
backup nodes' reachability in the k ( {1,2,3,4} directions and the
characteristics of quality of the real routes of recustomizing for all
nodes [pic] of the mesh.
The two directed graph meshes [pic](V, [pic]), [pic](V,[pic]) are used for
the description of the recustomizing routes' parallel constructing. The
nodes V of the graphs [pic],[pic] are marked by the weight ([pic],[pic]).
The graph [pic] describes the variants of the potential reachability of
backup elements, which are determined according to Rule 1 (Section 2.1) for
all nodes of the mesh. The graph [pic] describes the quality
characteristics of the recustomizing routes from the faults' location to
the backup nodes. The offered variants of the cellular algorithm's
implementation have the different rules of forming the digraph [pic], which
will be considered in Section 2.2.

2.1. Model of Reachability of Backup Elements

The lengths {[pic]} of the potential routes from the node vij ( [pic] to
the nearest backup nodes in the directions k ( { 1,2,3,4} are used as the
characteristics of the backup nodes' reachability. Each outgoing arc, which
goes from the node vij ( [pic] in the k direction, has the weight [pic]
characterizing the length of route from the vij node to the backup node in
the k direction.
The constructing of the graph [pic] is executed by the parallel
determination of the entering arcs' weights {[pic]} ((p,q) ( {(i,j-
1),(i+1,j),(i-1,j),(i,j+1)}) for all the nodes in accordance with Rule 1.
When the movement to the backup nodes is prohibited, the arcs are marked by
the value LX of the length of the route of the backup reachability, which
is always more than any real length of the route in the mesh of the given
dimensionality.
Rule 1. If the node vij ( [pic](V, [pic]) with the weight
([pic],[pic])=(0,0) has at least one outgoing arc with the weight [pic] <
LX, all the entering arcs get values of weights [pic] = [pic]= . = min
{[pic], [pic], ...}+ d; [pic] = [pic] = ... = LX when [pic] = 1; [pic]=
[pic]=... = d when ([pic],[pic])=(0,1).
The examples of the initial states of the graph meshes [pic], which were
designed according to Rule 1 for torus 4(4 with different variants of the
backup nodes and faulty nodes allocation, are given in Fig. 2.

[pic] [pic]
Fig. 2. Example of the graph [pic]constructing

This section proposes the two rules of the constructing of the digraph
[pic] using the following characteristics when marking the graph arc:
- the length of the route from the fault location to the route's current
node vij (Rule 2);
- the quantity of the permitted directions of the backup nodes'
reachability for the node vij (Rule 4).
As any microcontroller can be customized on execution of only one program
unit, the given work solves the problem of correcting the intersected
routes (Rule 3).
Rule 2. If the outgoing arc of the node vij ( [pic], has the weight
[pic]&[pic]LX, the corresponding outgoing arc of the graph [pic] gets the
weight [pic] = d if ([pic],[pic]) = (1,0), or [pic] = min ([pic], [pic],
...( + d ([pic][pic][pic]0) if ([pic],[pic]) = (0,0), or [pic] = 0 if
([pic],[pic]) = (0,1).
The digraph [pic] designed according to Rule 2 describes lengths of the
recustomizing routes from the faults' locations to the backup nodes. If the
node vij ( [pic] is included into the route, one of the entering arcs
always has the weight [pic][pic]0 ((p,q)({(i,j-1),(i+1,j),(i-
1,j),(i,j+1)}), which is equal to the length of the route from the fault's
location to vij. For the nodes not included into the routes all the
entering and outgoing arcs have a zero weight. Rule 2 defines the marking
of outgoing arcs for the nodes of the graph [pic], where every node can be
an initial (([pic],[pic]) = (1,0)), an intermediate (([pic],[pic]) =
(0,0)), or a final (([pic],[pic]) = (0,1)) point of the route.
The predecessor for the node vij at the intersection of the several routes
is selected from the direction having the minimum length of the route from
the fault location to the vij node. Such selection forms the favourable
conditions for fast completion of the long paths. If in some node vij(
[pic] several entering arcs are intersected, Rule 3 is correct.
Rule 3. The weights {[pic]} of the entering arcs of the node vij (
[pic](V, [pic]) change their values on maximum ratings LX in case the
weights {[pic]} of these arcs in the graph [pic](V,[pic]) have positive,
but not maximum values.
The examples of the final states of the [pic] graph meshes constructed
according to Rules 1,2,3 for the torus with the given variants of
allocation of faulty and backup nodes (Fig. 1) are shown in Fig. 3. The
results of the microcontroller's recustomizing on execution of the new
program units are represented by the values ((i, (j) in the nodes of the
graph [pic].
[pic] [pic]
Fig. 3. Example of the graph [pic] constructing (Rules 2,3)

The following Rule 4 is correct when the values mij of the quantity of the
permitted directions of the backup nodes' reachability in the graph [pic]
(or the quantity of outgoing arcs not equal to the LX) are used as the
weights of outgoing arcs of the node vij ( [pic]. The value mij is called
the degree of outcome of the node.
Rule 4. If the outgoing arc of the node vij( [pic] with the degree of
outcome mij has the weight [pic]&[pic]LX, the corresponding outgoing arc of
the graph [pic] is marked by the weight [pic] = [pic], if ([pic],[pic]) =
(1,0) or [pic].
According to Rule 4 for the route's node vij, at least one of its entering
arcs has the weight [pic], which is equal to the quantity of the outgoing
arcs in the graph [pic] for the node-predecessor from the k direction. In
such a manner as in Rule 2, the node-follower for the node vij is selected
from the direction of the minimal reachability of the backup nodes ([pic]).
The selection of the node-follower for the given variant of constructing of
[pic] is executed (also according to Rule 3) from the direction having the
minimum degree of outcome. In this case, there are the favourable
conditions for fast completion of the routes with the worst possibilities
of the directions' change, that increases the correcting ability of the
algorithm.
The examples of the graph meshes' constructing according to Rules 3,4 and
the results of the nodes' recustomizing for toruses with the above
considered (in Fig. 2) combinations of faults are given in Fig. 4.
[pic] [pic]
Fig. 4. Example of the graph [pic] constructing (Rules 3,4)

Fig. 5 shows the three stages of evolution of the graph meshes [pic]
during the correction of the intersected routes, which are determined
according to Rules 2,3 and 3,4.
The next section gives the example of the variant of the proposed
algorithm's formal record (Rules 1, 2, 3) used in further work for the
algorithms' simulation and the synthesis of elements of the network of
microcontrollers' recustomizing.

2.2. Cellular Algorithms of Parallel Recustomizing

The cellular algorithm of the recustomizing of multicontroller network is
specified by the system of parallel substituting operations over the set of
pairs (cells) (Sij,(i,j)), where Sij - is the word of the cell's state (the
processed data), (i,j) - is the name of a cell which coincides with its
coordinates in a graph mesh (i=1,2,...,M; j=1,2,...,N). Each cell (i,j) has
an access to the neighbouring cells data (i,j-1),(i+1,j),(i-1,j),(i,j+1).
The parallel computations are the iterative procedure over the data array.
At each step all the operations, which are ordered by the algorithm for all
cells, are executed, and the intermediate value of the cellular array
S={(Sij,(i,j)) [pic] i( 1,2,...,M, [pic] j ( 1,2,...,N } is generated. The
process of computation is completed, when any cell cannot change its data
Sij. The stable state of the cellular array corresponds to the result of
parallel processing of the initial array with the data about faults in the
graph mesh.
[pic] [pic] [pic]
(a)
[pic] [pic] [pic]
(b)
Fig. 5. Example of elimination of conflicts with use of Rules 2,3 (a) and
Rules 3,4 (b)

The word Sij is defined allowing for the specificity of computations being
executed in the cell. For the offered cellular algorithms, the word of the
cell's state will be:
[pic]
where ([pic])({0,1} - are the variables of the state and customizing of
the node [pic]; [pic] - is the digital vector, the k position of which is
the record of the quantity of the potentially recustomized microcontrollers
of the route from the node [pic] to the backup node in the k ( {1,2,3,4}
direction,[pic]= LX if the movement to the backup in the k direction is
prohibited (LX - is the value of the length of route, which is more than
any real length of route in the mesh of given dimensionality); [pic] - is
the digital vector, the k position of which is the record of the quantity
of the recustomized microcontrollers of the route from the faulty node to
the [pic], if the k direction belongs to the route, [pic]=0, if the k
direction does not belong to the route; [pic]=([pic]) - is the set of
variables of the correction of the competitive directions characterizing
the possibility ([pic]=0) or the prohibition ([pic]>0) of choice of the
node-predecessor from the k direction; [pic], [pic] - are the digital
vectors having the positive values in the positions of the node-follower
([pic]>0) and the node-predecessor ([pic]>0) of the route.
According to Rules 1, 2, 3 of processing the models [pic], [pic], the
cellular algorithm is represented by the system Ф = {(1, (2, (3, (4, (5,
(6, (7}. Allowing for the designations [pic], [pic] the operations {(1((7}
will be:
[pic] [pic]
[pic] [pic].
[pic] for [pic]((i,j-1),(i+1,j),(i-1,j),(i,j+1) we find from the formulas:
[pic] [pic]
[pic]
[pic]
[pic],
[pic]
[pic]
[pic]
The system Ф2 of the cellular operations for the variant of the
parallel algorithm, which implements Rules 1,3,4, is composed similarly to
the Ф1. The systems Ф2 and Ф1 have the different rules of execution the
operation (7: [pic]
[pic]
The technical decisions of the hardware implementation of the
microcontroller recustomizing devices have been created for each variant of
the cellular algorithms. The cellular algorithms of recustomizing are
represented by the network of microcontrollers recustomizing on the
corresponding copies' execution (Fig. 6). Each microcontroller ([pic]) has
its own recustomizing device ([pic]), which forms the address of the
attached copy using the data from the neighbouring devices and the state of
its own microcontroller.
[pic]
Fig. 6. The structure of the fault-tolerant multicontroller network

The hardware implementation of the cellular algorithm Ф1 (the network of
the devices for the microcontrollers recustomizing) has the links'
structure shown in Fig. 7. The composition of the recustomizing device
(Fig. 8) includes the block [pic] of calculating the functions of backup
elements' reachability (operation [pic]) and the block [pic] of calculating
the functions of route's quality and microcontroller recustomizing
(operations [pic],[pic]).
[pic] [pic]
|Fig. 7. Links structure of the |Fig. 8. The composition of the |
|recustomizing device |recustomizing device |


3. The Simulation Results

The results of the cellular algorithms' simulation have shown the
correctness of the offered approach to the parallel recustomizing of
network microcontrollers with the distributed allocation of the backup
elements.
The examples of modeling of the cellular algorithms (Ф1, Ф2) are
illustrated by the final states of the cellular arrays (Tab. 1,2) for the
one combination of faults (Fig. 3,4). The tables show the values of the
variables of the cellular array [pic][pic]; [pic]; [pic];
[pic][pic][pic][pic].
|Table 1. Final state of the |Table 2. Final state of the |
|cellular array ([pic]) |cellular array ([pic]) |
| |0303030|0202020|LXLXLXL|LX01010| | |0202020|0202020|LXLXLXL|0101010|
| |3 |2 |X |1 | | |2 |2 |X |1 |
| |0001000|0000000|0000000|0002000| | |0000000|0100000|0000000|0400000|
| |0 |0 |0 |0 | | |0 |0 |0 |0 |
| |0000000|0000000|0000000|0200000| | |0000000|0000000|0000000|0000000|
| |0 |0 |0 |0 | | |0 |0 |0 |1 |
|2|00 |01 |00 |00 | |2|00 |01 |00 |00 |
| |0202020|0101010|0202020|0202020| | |0202020|0101010|0202020|0202020|
| |2 |1 |2 |2 | | |2 |1 |2 |2 |
| |0002000|0000000|0000000|0000000| | |0000000|0000040|0000000|0000000|
| |0 |3 |0 |0 | | |0 |0 |0 |0 |
| |0000000|0000000|0000000|0000000| | |0000000|0000000|0000000|0000000|
| |0 |0 |0 |0 | | |0 |0 |0 |0 |
|3|00 |00 |00 |10 | |3|00 |00 |00 |10 |
| |0303030|0202020|0303030|LXLXLXL| | |0303030|0202020|0202020|LXLXLXL|
| |3 |2 |3 |X | | |3 |2 |2 |X |
| |0000000|0000000|0000000|0000000| | |0000000|0300000|0400000|0000000|
| |0 |0 |0 |0 | | |0 |0 |0 |0 |
| |0000000|0000000|0000000|0000000| | |0000000|0000000|0000000|0000000|
| |0 |0 |0 |0 | | |0 |0 |0 |0 |
|4|10 |00 |01 |00 | |4|10 |00 |01 |00 |
| |LXLXLXL|0303030|LXLX01L|LX02020| | |LXLXLXL|0202020|01LXLX0|0202020|
| |X |3 |X |2 | | |X |2 |1 |2 |
| |0000000|0000000|0000010|0001000| | |0000000|0000000|0200000|0400000|
| |0 |0 |0 |0 | | |0 |0 |0 |0 |
| |0000000|0000000|0202000|0100000| | |0000000|0000000|0003040|0000000|
| |0 |0 |2 |0 | | |0 |0 |0 |0 |


For the different variants of the algorithm with the change of the network
dimension and the backup nodes' allocation, we have researched the
following characteristics by means of the method of statistical trials: the
ability of the algorithm to neutralize a random distribution of t faulty
elements; the value of time required for the recovery (the quantity of the
cellular algorithm's iterations); the alterability of the multicontroller
network (the part of the recustomized microcontrollers in the total
quantity of the mesh nodes). The graphs of the dependence of the correcting
ability [pic] upon the quantity t of faults for the toruses (curves 1,2)
and for the unwrapped meshes (curves 3,4) with 10[pic]10 dimension and with
the quantity of backup elements equal to 10 are shown in Fig. 9. [pic] was
estimated by the part of the corrected combinations of t faulty elements in
the total quantity of combinations. To compare the different variants, Fig.
9 shows the graphs for the two modifications of the operation (3 of the
cellular algorithm.
The curves 1,3 are obtained for the algorithm Ф2, curves 2,4 - for the
algorithm Ф1. The increase of the correcting ability of the first variant
(curves 1,3) is explained by the lowering of the number of the corrections
of directions' conflicts (including fatal) with a small (5-7 %) elongation
of the routes, that is confirmed by the analysis of the evolution of
cellular arrays with a significant number of faults.
The decrease of the recustomized elements' quantity for the distributed
allocation of the backup nodes in comparison with the column (row) of the
backup elements for the torus and the mesh is 40-50 %.
[pic]
Fig. 9. Graph of the correcting ability's dependence upon the quantity of
faults


Conclusion

The given paper presents the parallel algorithms of fast adaptation of the
multicontroller network at the presence of faults permitting to construct
independent, scalable, fault-tolerant networks with decentralized and
distributed facilities of recovery after the faults detection. The
microcontroller of this network includes the logical microcontroller with
backup copies of the neighbouring microcontrollers program units and the
device of recustomizing. The feature of the parallel algorithms of recovery
is the cellular processing of faults by means of the rules, which are
independent of the quantity and allocation of faults in the network, and
the recustomizing of the microcontrollers capable of working to the copies
of the neighbouring nodes' programs.
The paper offers the variants of the two cellular algorithms based on
parallel construction of the not intersected paths from faulty nodes to
backup nodes and on recustomizing of the routes' microcontrollers. The
proposed algorithms implement the rules of the evaluation of the backup
nodes' reachability providing the construction of the paths in rational
directions and preventing the hit of the paths to deadlocks. In the first
algorithm the value of the maximum current length of route is used as the
criterion of direction's priority at the resolution of conflicts of the
intersected routes, that has allowed to minimize the total quantity of the
recustomized microcontrollers. In the second algorithm the priority is
given to the direction with minimum quantity of the directions of the
backup elements' reachability that has allowed to increase the correcting
ability of the algorithm at the significant number of faults.
The simulation results of the proposed algorithms have shown their high
performance.

References

1. Bae M.M., Bose B. " Resource placement in torus-based networks ", IEEE
Transactions on Computers, vol. 46, no. 10, pp. 1083-1092,1997.
2. Banerjee P., Peerce M. "Design and evaluation of hardware strategies
for reconfiguring hypercubes and meshes under faults", IEEE Transactions on
Computers, vol. 43, no. 7, pp. 841-848, 1994.
3. Bertossi, A. A., Mancini, L., "Fault-tolerant LPT task scheduling in
multiproceccor system", Microprocessors and Microsystems, vol. 16, no. 2,
pp. 91-99, 1992.
4. Koloskov, V. A., Medvedeva, M.V., Medvedev, A. V. "Cellular Self-
Organization of a Fault-tolerant multi-microcontroller", Automatic Control
and Computer Sciences, no. 2, pp. 50-57, 2000.
5. Krishna, C.M., Shin, K. G., "On Scheduling Tasks with a Quick Recovery
from Failure", IEEE Transactions on Computers, vol. C-35, no. 5, pp. 448-
455, May 1986.
-----------------------
t

4

3

2

K(t)

1