Документ взят из кэша поисковой машины. Адрес оригинального документа : http://acat02.sinp.msu.ru/presentations/pose/talk.doc
Дата изменения: Mon Jul 15 12:42:34 2002
Дата индексирования: Mon Oct 1 20:02:36 2012
Кодировка:


Correlation Engine Portotype


V.Pose, JINR Dubna
B.Panzer-Steindel, CERN IT



Overview


The Correlation Engine Prototype was developed during a 3-month work in
CERN IT Division financed by the CERN-Intas project.

The CERN monitoring prototype, part of the fabric management work package
(WP4) of the DataGrid project, gathers monitoring data from farm nodes in
CERN into a central monitoring database. Performing correlations on the
data in the monitoring database should help to:

. foresee exceptions on individual nodes and on node groups
. analyse performance of the farm.

So the results of the correlations can be used to save a system
administrators work in the following ways:

. to to trigger automatic remedy actions
. to gather additional monitoring data to get a detailed view of the
current exceptional state of nodes; this additional information
should reduce the efforts a system administrator makes to decide how
to treat the affected node(s).

The correlation engine prototype was developed to enable easy adding of new
correlations of monitoring data and actions triggered in case of
exceptions. The new functionality can be added in form of new engines.
The current prototype is written in Perl and the engines are subroutines
placed in a separate Perl module. Two engines are implemented - ayt and
procpu.

The ayt engine tries to connect to the hosts, for which the monitoring data
are missing or out-of-date and to get some basic information about their
state.

The procpu engine:

. looks for nodes with a number of processes higher then usual and a
CPU usage less then usual
. looks for nodes with a very high number of processes
. connects to such a node using the ayt engine
. gathers more information about the node state and stores it as a
report in a textfile in a file database.

Currently no remedy actions are implemented.
The results of the correlation engine can be accessed through a web-
inerface. The main pages of the web-interface are shown on the poster.




ayt engine


The ayt engine does the following for a given node:

1. tests if the data read from the monitoring database are up-to-date:
o the timeout is the sampling period of the metric plus a general
wait interval
o the sampling period for each metric is set in the configuration
module Cfg.pm
2. if the data are out-of-date tests the node for a ping response
3. if the node responds to ping tries to open telnet-port 23
4. if the port is opened tries to login in a telnet session
5. if the login succeedes rus the ps command
6. if ps succeedes runs df -x afs
7. if df succeedes runs df -t afs
8. if df -t afs succeedes runs ls for a directory on the AFS partition



procpu engine


The procpu engine does the following for a given node:

. tests the data read from the monitoring database against threshold
sets configured in the configuration module Cfg.pm:
o each threshold set can contain a minimum and a maximum threshold
for each metric; so in different sets a given metric can have
different thresholds
. runs the ayt engine on the node
. uses the telnet-session opened by ayt to get the information reflected
in the report.



Report


A report produced by the procpu engine for a node contains:

. the monitoring data read from the monitoring database and their
timestamps
. the thresholds set for the different metrics which are correlated
. values for the same metrics measured by the correlation engine
. a list (top 10) of open files sorted by number of links to the open
file
. a list (top 10) of processes running the same command sorted by the
nubmer of processes running the command
. a list (top 10) of hosts to which the tested node is connected
through TCP sorted by the number of connections to the host
. a list (top 5) of states of internet sockets sorted by the number of
connections beeing in the state
. virtual memory usage summary
. a list (top 5) of processes with top virtual memory usage
. a list (top 5) of processes with top physical memory usage.




Implementation


The correlation engine prototype consists of the following Perl modules:

. main module ce.pl
. module containing the code of the engines Engine.pm
. library module Celib.pm
. configuration module Cfg.pm .

The web-interface consists of a couple of CGI-scripts written in Perl.
The data exchange betweeen the correlation engine and the web-interface is
made by text files.

The main module ce.pl implements in particular the following functionality:

. initializes common datastructures of the engines from the
configuration module Cfg.pm
. reads data from the monitoring database
. periodically runs the engines
. saves the results of the engines into a file database .

The Engine.pm module currently contains the code for the ayt and procpu
engines.

The library module Celib.pl contains subroutines used by the engines. In
particular:

. an envelope to run commands on the node where the correlation engine
is running:
o implements timeout functionality:
. on timeout the process running the command is killed
o measures command execution time
o saves stdout, stderr and exit code of the command
. an envelope to run commands on a remote node:
o implements timeout functionality:
. in case of timeout sends the telnet BREAK signal
o measures command execution time
o saves stdout and stderr of the command
. a subroutine to reduce the maximum execution frequency of an engine on
a given node:
o a minimal time interval is set for each engine in the
configuration module Cfg.pm.

The configuration module Cfg.pm contains subroutines returning hashes with
the following configuration information:

. nodes watched by the correlation engine grouped by clusters
. metrics read from the monitoring database with following attributes:
o sampling period
o description
. engines executed by the ce.pl module with the following attributes:
o name of subroutine to call in Engine.pm
o minimal execution interval of the engine on a node
o engine-specific information, e.g. the threshold sets for the
procpu engine .

[pic]

Figure 1. Main screen of the web interface of the Correlation Engine
Prototype


|the Nodes box contains 2 nodes |the history links provide a 24 hour |
|currently beeing in exceptional |or 72 hour history of exceptions |
|state |the Threshold section on the rigth |
|the Report buttons will show the |shows the thresholds used for each |
|last or the last 10 reports for the |cluster |
|selected nodes | |

[pic]

Figure 2. The cluster status page shows the status of the 3 watched
clusters



[pic]

Figure 3.1 24-hour history of exceptions, page 1


[pic]

Figure 3.2 24-hour history of exceptions, continuation of page 1


[pic]

Figure 4. Search form for archieved reports