Äîêóìåíò âçÿò èç êýøà ïîèñêîâîé ìàøèíû. Àäðåñ îðèãèíàëüíîãî äîêóìåíòà : http://www.eso.org/~qc/dfos/cascadeMonitor.html
Äàòà èçìåíåíèÿ: Fri Feb 20 17:54:08 2015
Äàòà èíäåêñèðîâàíèÿ: Sun Apr 10 00:54:42 2016
Êîäèðîâêà:

cascadeMonitor

Common DFOS tools:
Documentation

dfos = Data Flow Operations System, the common tool set for DFO

make printable

new:

see also:

- v1.2.3: muc09/10 added; monthly mode for phoenix (mcalib mode)

mucMonitor

databases	none
dfos tools	none
output	cascadeMonitor.html, exported to http://qcweb.hq.eso.org/ALL/
upload/download	upload: cascadeMonitor.html (see output)

cascadeMonitor

Description

tool monitors condor execution

This tool visualizes the status of the current condor processing cascade on a MUC blade. It helps understanding the dependencies in a cascade, analyze recipe performance and interplay with the condor processing nodes on a muc blade. It further links to the other cascade monitors on the host and to the mucMonitor.

The tool has two main modes:

in DATE mode (-d), it visualizes the latest cascade of that date (or, if -c is used, that specified cascade), in a detailed overview ("high-resolution")
in ALL_DATE mode (-D), it visualizes all cascades executed on that date on the host ("low-resolution")

[It has also a monthly mode which is a special feature needed for phoenix 2.0, mcalib production.]

DATE mode:

Condor cascade for KMOS
Last update: 2013-02-07T15:06:04 (UT) by kmos@muc01 (0d 00h:00m:03s ago)
Browser refresh: every 30 sec | System load past minute: 0.19 Force browser refresh with Ctrl+R

The top panel has update information. In mode -d, the tool can be run in watch mode. The cadence of the tool for the watch mode is configurable and should be roughly adapted to the typical execution time of ABs (usually 30-60 sec is a good value). The browser refresh is adapted to the same value.

ALL_DATE mode:

Daily condor processing for muc02
Last update: 2013-02-07T15:06:04 (UT) by uves@muc02

In that mode the tool is called once (either on the command-line or by the JOBS_AUTO file). It collects information from all accounts on the host.

Navigation and monitor pages

The horizontal navigation has links to the overview page ('ALL') and to the detail pages (linked by instrument name). Y ou can switch to the cascadeMonitors of the other operational users of the same muc blade.

muc02 cascades: ALL giraffe uves xshooter queue
other

ALL (output of the ALL_DATE mode)

Processing date: 2013-03-03

You can navigate forward/backwards in time.

The plots shows all cascades executed on the specified DFO date and their execution times. All modes are included: CALIB, QCCALIB, and SCIENCE. The width of each green symbol corresponds roughly to the total execution time. The timescale is accurate to roughly one hour. Condor logs come in the time zone set by the account and are corrected to UTC. Here is one of three plots (muc02 has three operational accounts):

giraffe (GIRAFFE)

Max: 8 MAX: 100
0
	21	22	23	00	01	02	03	04	05	06	07	08	09	10	11	12	13	14	15	16	17	18	19	20
UT hours

uves (UVES)

...

xshooter (XSHOOTER)

...

Detail pages (last cascade per instrument)

This mode is called during, and immediately after, job execution and provides details of the condor processing. You find the output behind the instrument links (e.g. giraffe (GIRAFFE)). At any given time, there is only one output page, usually the one corresponding to the last executed cascade.

Job: CALIB_2013-02-09
Cascade: /data23/giraffe/condor/CALIB_2013-02-09-1360601594.23658609
Status: finished

The name of the analyzed job is displayed, as well as the cascade ID and its path. The status is either active or finished.

Total number of jobs: 28; finished: 28; exec_time: 18.7 m; scheduler: 0.2 m
Average consumption: 36.2% (2.9 of 8 cores)

These are parameters of the cascade. The 'scheduler' parameter displays the time used by the condor scheduler to analyze the cascade dependencies before it is actually launched. The average consumption is based on the time average of the number of used cores (part c) of the display).

Condor concepts

Condor knows four different AB states:

done processing job finished (either with or without succes, doesn't make a difference for condor)

blocked cannot be executed due to dependencies

waiting could be executed but currently not enough cores

active executing

There are two fundamentally important limitations to a cascade: dependency limited (some ABs need to be finished first, before others can execute, because of virtual calibrations required), and core limited (there are more executable ABs than cores). A dependency-limited cascade cannot benefit from more available cores, while a core-limited cascade would execute faster if more cores were available.

Condor has an internal watch process which loops every 10-15 sec and finds all status changes in the processing queue. These might be due to ABs having finished, started processing, or have their status changed from blocked to waiting. All visualizations on the detailed cascade monitor are state changes, i.e. they apply to one or more ABs having changed their state since the last state change. In general they do not refer to an individual AB although under special conditions this might be the case.

Main table

The output is divided in three parts: the finished jobs ("Done", green); the blocked jobs (the ones blocked by dependencies, red); the active and waiting jobs ("Used", grey ).

The display has numbers of ABs over elapsed time. The scaling factor of the time axis is configurable (to adapt to the specifics of the supported instrument). The cascade always starts with all ABs blocked (red) and ends with all ABs finished (green).

a) Done (finished jobs): 28

Total: 28 14:34:54 done: 0 (dot.2) [14:35:27] elapsed: 1.38 min; done: 2 (dot.3) [14:35:54] elapsed: 1.83 min; done: 5 (dot.4) [14:36:19] elapsed: 2.25 min; done: 8 (dot.5) [14:36:29] elapsed: 2.42 min; done: 8 (dot.6) [14:36:35] elapsed: 2.52 min; done: 9 (dot.7) [14:36:40] elapsed: 2.60 min; done: 10 (dot.8) [14:36:45] elapsed: 2.68 min; done: 11 (dot.9) [14:36:55] elapsed: 2.85 min; done: 14 (dot.10) [14:37:02] elapsed: 2.97 min; done: 15 (dot.11) [14:37:12] elapsed: 3.13 min; done: 17 (dot.12) [14:37:27] elapsed: 3.38 min; done: 19 (dot.13) [14:37:33] elapsed: 3.48 min; done: 20 (dot.14) [14:39:41] elapsed: 5.62 min; done: 23 (dot.17) [14:43:36] elapsed: 9.53 min; done: 24 (dot.18) [14:43:49] elapsed: 9.75 min; done: 24 (dot.19) [14:50:57] elapsed: 16.88 min; done: 25 (dot.20) [14:51:19] elapsed: 17.25 min; done: 25 (dot.21) [14:52:04] elapsed: 18.00 min; done: 26 (dot.22) [14:52:15] elapsed: 18.18 min; done: 26 (dot.23) [14:52:30] elapsed: 18.43 min; done: 27 (dot.24) The blue line spans the total number of jobs.

b) Blocked (still waiting): 0; processing: 0

Time: 0 blocked: 18 (dot.2) blocked: 13 (dot.3) blocked: 11 (dot.4) blocked: 9 (dot.7) blocked: 9 (dot.8) blocked: 8 (dot.10) blocked: 7 (dot.12) blocked: 3 (dot.17) blocked: 3 (dot.18) blocked: 1 (dot.21) blocked: 1 (dot.22) blocked: 0 (dot.23) blocked: 0 (dot.24) 18.7m (this is the total execution time of the cascade)

c) Cores used (max: 8)

Cores: 8 used+wait: 10 (dot.2) used+wait: 13 (dot.3) used+wait: 12 (dot.4) used+wait: 10 (dot.7) used+wait: 9 (dot.8) used+wait: 6 (dot.10) used+wait: 4 (dot.12) used+wait: 2 (dot.17) used+wait: 1 (dot.18) used+wait: 2 (dot.21) used+wait: 1 (dot.22) used+wait: 2 (dot.23) used+wait: 1 (dot.24) less than 8 cores used (the number of condor execution nodes on that blade): cascade is dependency-limited

these jobs are not blocked but wait for free nodes (node-limited)

Move your mouse over an icon and see some related technical information. dot.21 refers to the text file CALIB_2013-01-13.dot.21 in /data23/giraffe/condor/CALIB_2013-02-09-1360601594.23658609.

Output

-D: local as $DFO_LOG_DIR/<date>/cascadeMonitor_<date>.html
exported to qcweb/ALL/<muc>/<year>/cascadeMonitor_<muc>_<date>.html, e.g. http://qcweb.hq.eso.org/ALL/muc02/2013/cascadeMonitor_muc02_LAST.html
-d: local as $DFO_MON_DIR/cascadeMonitor.html
exported to qcweb/${DFO_INSTRUMENT}/monitor/cascadeMonitor.html
there is also the technical file cm_output_<date>.html exported to qcweb/$DFO_INSTRUMENTS}/logs/<date>/ (see below)

How to install

standard installation with dfosInstall

How to use

Type cascadeMonitor -h for a quick help, cascadeMonitor -v for the version number. Type

cascadeMonitor -D 2013-03-13

for the ALL_DATE mode: "low-resolution" overview of all cascades on the host for specified date. Type

cascadeMonitor -d 2013-01-13

to create the detailed ("high-resolution") cascade monitor for CALIB_2013-01-13 and your $DFO_INSTRUMENT;

cascadeMonitor -d 2013-01-03 -m SCIENCE

to create the cascade monitor for SCIENCE_2013-01-03 and your $DFO_INSTRUMENT;

cascadeMonitor -c CALIB_2013-02-09-1360601594.23658609

to visualize the specified cascade.

During execution of JOBS_NIGHT, or if you type it on the command-line, the tool runs in a loop, like

watch -n 30 cascadeMonitor -d 2013-01-13

The tool stops looping when it discovers that the cascade is finished (no .lock file found).

The supported condor cascade types are:

CALIB (condor AB processing of CALIB data, the dfo standard)
QCCALIB (condor QC reports of CALIB data, optional mode of parallel QC reporting)
SCIENCE (condor AB processing of SCIENCE data; for phoenix project)
QCSCIENCE (condor QC reports of SCIENCE data in parallel mode, for phoenix project)

Operations

While calling on the command line is possible anytime, it is called in watch mode by createJob. It is called twice (at the end of the cascade) within autoDaily, once with option -d for the cascade details, and once with option -D for the overview.

On the AB monitor, you also have a button to start the cascadeMonitor manually if you are interested in the cascade monitor as processing progresses.

The output is linked to the local AB monitor (link casc ).

Configuration file

The configuration file config.cascadeMonitor has the following keys:

Section 1: General configuration
MAX_CORES	8	number of cores on muc blades available for condor processing
TIME_SCALE	3	scaling factor for time axis; larger --> image gets compressed; default: 3
WATCH_CADENCE	30	cadence in sec of 'watch -n $WATCH_CADENCE ...' call in JOBS_NIGHT

Technical details

For the mode -D, the tool needs to collect information from all operational accounts on the muc blade. This is achieved in the following way:

the tool writes the core output table as cm_output_<date>.html into qcweb/$DFO_INSTRUMENT/logs/<date>
the tool gets the core output tables for the other instruments from their qcweb/$DFO_INSTRUMENT/logs/<date>/cm_output_<date>.html
this works in the same way for the other accounts
if no cm_output_<date>.html exists (because there are no data being processed) the tag "not available" is displayed

For the mode -d, no such scheme is necessary since it runs stand-alone on the account.

The list of operational accounts is read from the mucMonitor configuration file.

done	processing job finished (either with or without succes, doesn't make a difference for condor)
blocked	cannot be executed due to dependencies
waiting	could be executed but currently not enough cores
active	executing

Common DFOS tools: Documentation

cascadeMonitor

Description

Output

How to install

How to use

Configuration file

Common DFOS tools:
Documentation