Документ взят из кэша поисковой машины. Адрес оригинального документа : http://www.eso.org/~qc/dfos/dfoMonitor.html
Дата изменения: Mon Feb 1 16:58:17 2016
Дата индексирования: Sun Apr 10 00:39:47 2016
Кодировка:
dfoMonitor

Common DFOS tools:
Documentation

dfos = Data Flow Operations System, the common tool set for DFO
*make printable new: see also:
 

v4.1.2:
- new cronjob checks for cleanupRawdisp and 'dfosCron -t autoDaily'

The tool is managing the daily workflow. Check out under 'Operations' (e.g. here) for information about that workflow.
 

v4.2:
- new button 'reference' for calling rawdisp2reference (here)

v4.2.1:
- minor change for PHOENIX (sci_Certif flag supported again)




[ used databases ] databases sara.. transfer_requests (transfer status), observations..data_products; ngas..ngas_files (NGAS access)
[ used dfos tools ] dfos tools qcdate, ngasClient; checks output from autoDaily, createReport, ingestProducts
[ output used by ] output $DFO_MON_DIR/dfoMonitor.html; tool called by autoDaily and some tools of the daily workflow
[ output used by ] upload/download upload: XDM / ngasWatcher / transferWatcher files to WISQ; HTML output to qcweb
topics: description | installation checks | PHOENIX | XDM | AB number | data transfer | checkboxes: autoDaily | HC monitor | calChecker | last ABs | Links: service user links | output pages | how to use | configuration | status | operational aspects | technical: decision making

dfoMonitor

[ top ] Description

enabled for parallel execution

This tool provides the central GUI for monitoring and managing the QC daily workflow. It scans the active DFO dates, reads and displays their process status, and offers the next reasonable workflow steps. It serves as the standard interface to the whole daily workflow, with the goal to offer all needed functionalities and interactivity.

With version 4.0, the tool supports both the standard dfos workflow and the PHOENIX workflow. For PHOENIX, the dfoMonitor is mainly a passive monitor and has many functions switched off. In the following, the 'dfos only' functions are marked.

The monitor offers action buttons to launch the next step interactively. Mouse over any of these buttons to get a short description of the offered functionality. Enabling of workflow steps is based on the rule set described below.

At some steps, the monitor writes additional information like 'number of raw files'. At some steps links are offered to related information:

Links are offered to the nightlogs. Data reports are linked to the HTML output generated by createReport.

There is a link to the daily calChecker result pages ("CAL"). They are permanently stored under $DFO_LST_DIR/CALCHECK.

The link 'status' is an extraction from DFO_STATUS for the corresponding date, intended as an overview of the current processing status.

The tool displays filtered files, as detected by filterRaw. If an entry exists in $DFO_LST_DIR/filt_<instr>_<date>.txt, the corresponding box is colored yellow, and a link to the list is offered.

The tool displays whether the night had SM or VM (or both) SCIENCE runs. This information is extracted from the data reports.

There is the standard 'refresh' option, plus options to update the load and disk status.


[ top ] Installation checks. The tool monitors the DRS_TYPE (as configured in config.createAB): condor on (CON) or off (anything else). The configured $DFS_RELEASE is displayed and compared to the default dfs setting under /home/flowmgr . If configured, $MIDAS_CHECK compares the default MIDAS version (defined in /home/flowmgr/dfs) to $MIDVERS. Finally, the currently enabled pipeline version is displayed (with the detmon pipeline being filtered out).

[ top ] Monitoring of AB number. If the number of ABs in $DFO_AB_DIR is beyond a certain limit, the AB monitor (getStatusAB) becomes slow, and this will also slow down autoDaily. There have been cases that autoDaily effectively got stuck. To become aware of potential issues, the total number of ABs in $DFO_AB_DIR is monitored. It scores red if a hard-coded threshold is hit. Currently this threshold is 2500.

N_ABs:
530

[ top ] Disk space, XDM. The data disk space is monitored since with the data disk full, no automatic processing is possible. A quick overview is provided:

data disk: 120.5 GB (30%)

It updates in the background (ash mechanism) if clicked.

The XDM (eXtended Disk space Monitor) provides detailed feedback about the disk space usage on the data disk. It monitors the following data disk directories:

disk space on $DATA_DISK (total: 870 GB)
RAW: $DFO_RAW_DIR updated each time dfoMonitor is called
CAL: $DFO_CAL_DIR
SCI: $DFO_SCI_DIR
DFS: $DFS_PRODUCT
LST: $DFO_LST_DIR
*HDR: $DFO_HDR_DIR these values are normally read from the DFO_STATUS file and therefore static!
They are updated on demand, using [refresh], which will take a couple of seconds. They are also updated eventually when they get removed from DFO_STATUS (if 5000 new entries make them outdated so that they are auto-removed).
*PLT: $DFO_PLT_DIR
*LOG: $DFO_LOG_DIR
SUM: sum of all above  
OTH: all other data on $DATA_DISK in non-standard folders  
FREE: remaining free remaining free disk space

Disk space used by the directories is listed in GB; the bar also indicates usage in percentage. The disk space score is green if less than 80% disk volume is occupied, and red if more than that.

If a quota is defined in the config file (DATA_QUOTA), it is indicated and taken into account.

The XDM is exported to http://www.eso.org/observing/dfo/quality/WISQ/XDM/XDM.html and linked to the WISQ monitor on the navigation bar.


[ top ] PHOENIX notifications

PHO
ENIX

phoenix is the workflow tool for automatic science processing. It is used by the IDP accounts on muc08, muc09 and muc10. Find more information here.

The following section is relevant only for an instrument that has a PHOENIX process set up on muc08/muc09/muc10. For all other instruments this section can be ignored.

For the stream part of a phoenix process, it is desirable to have a signal from the operational account that a certain set of master calibrations has been finished and is available for phoenix processing. This 'batch unit' for phoenix is one month, for no other than pragmatic reasons.

To become 'phoenix-enabled' the dfoMonitor needs to be configured. There are the configuration keys PHOENIX_ENABLED and PHOENIX_ACCOUNT to be defined properly, see further below.

Here is a quick overview of the workflow. Find more information on the phoenix page.

a) If PHOENIX_ENABLED and PHOENIX_ACCOUNT are set, histoMonitor, when encountering a new month upon being called from 'finishNight', sends a signal (email) to the QC scientist that a new month has started, meaning that a set of certified master calibrations is available for the previous month. A new status flag 'phoenix_Ready' is written into DFO_STATUS, along with the previous month (format YYYY-MM).

b) This flag is catched by dfoMonitor and used to flag that month on the main output page:

PHO
ENIX:

2013-06

 
If no new PHOENIX job is on the ToDo list, this field is empty:
PHO
ENIX:

none

 

c) Then the QC scientist can launch this new PHOENIX job.

d) When this step has been done, the QC scientist can confirm this, by pushing the button 'done' on the dfoMonitor. This triggers a dialogue, where the user is asked to confirm the execution, and then this month is removed from the DFO_STATUS file. If there is more than one month, all months will offered for confirmation, one after the other.


[ top ] Data transfer links (dfos only). This checkbox has links related to the data transfer system (DTS), plus two rows for status checks of NGAS access ("ngas") and of the health of the transfer process ("transfer"), plus two buttons to launch queries. The ngas status is checked each time the dfoMonitor tool is launched, by launching an ngas download with ngasClient (the file is hard-coded as $TEST_FILE). If an error occurs, its code is displayed. As a timeout mechanism, the monitor waits for 60 sec at maximum for ngasClient, and then aborts. The DTS test and the ngas download are done in the background, and the result from the previous execution is displayed. This is usually good enough since dfoMonitor is called by many tools and therefore usually sufficiently up-to-date. The background call is done because of performance issues.

"Transfer" is checked with a query to the sara database which hosts file names and transfer status values. All CALIB files with transfer status < 6 (meaning not yet in the primary archive) are found, if the delay is more than 1 hr and less than 72 hrs. The one with the longest delay is displayed. If none is found, the "transfer" status is ok, otherwise nok. There is also an indication for delays of files of any type, but this is not used for the nok alert. This is motivated by the fact that for incremental processing, and for the closure of the QC loop with Paranal, CALIB files are by far the most important files. To avoid false alerts, delays by less than 1 hour are not evaluated. Delays by more than 72 hours are disregarded either since it is assumed that these might be due to database inconsistencies. This is not always true but the tool cannot decide this.

The complete query result is displayed upon launching the red action button (line labelled as "longest delay"). The green action button launches the inverse query, all archived files with status 6 and their delay values (time between OLAS archiving on Paranal and in the primary archive in Garching).

Finally, the link to the DataTransfer monitor displays the complete information for all files, plus statistics. The 'total' link relates to the Evalso monitor which is running on Paranal to measure the current Evalso bandwidth from/to Paranal. The link called 'Reuna' measures the bandwidth of the Reuna link (Chile to Europe). Both are useful to monitor the current DTS bandwidth and for analysing transfer issues.

The Evalso monitor is also displayed in the bottom monitor panel called "system".

Data
Transfer:
  Monitors: DataTransfer   | Evalso: total | Reuna
ngas
transfer
no CALIB file delayed by >1 hr

In case of problems, flags will turn red, e.g.:

Data
Transfer:
  Monitors: DataTransfer   | Evalso: total | Reuna
ngas
transfer
longest CALIB delay: VISIR.2008-11-08T08:13:11.123.fits CALIB (2.5 hrs)
longest delay (any dpr.catg): VISIR.2008-11-01T08:13:11.123.fits SCIENCE (54.4 hrs)

The ngas and the transfer flags are exported to the web server and embedded in the calChecker and the HC monitor.


[ top ] autoDaily checkbox (dfos only). This checkbox is intended to make the current status of the processing scheme more transparent. It checks for:

autoDaily?
enabled as cronjob
cleanupRD enabled
dfosCron monitoring enabled

This box must be green for dfos installations. The configured cronjob pattern is visible when hovering the mouse.

The activities of autoDaily are displayed in real-time underneath the XDM. If autoDaily is not running, this box displays:

autoDaily: no dates

If there is autoDaily activity, messages will inform about progress. You can follow the workflow by clicking on the 'log' link:

autoDaily: list_data_dates
log autoDaily running!
calling createAB

[ top ] HC monitor checkbox (dfos only). This checkbox monitors the proper update pattern of HC reports. It checks for the existence and proper scheduling of the following jobs:

HC monitor updates?
JOBS_TREND enabled as cronjob
JOBS_HEALTH existing
JOBS_NAVBAR enabled as cronjob

[ top ] calChecker checkbox (dfos only). The first checkbox checks for the existence and the proper scheduling of the calChecker cronjob (to be called every half hour). The second one checks if once a day the FULL mode is called, as a safety mechanism.

calChecker?
enabled as cronjob
FULL: enabled as cronjob

[ top ] AB checkboxes (dfos only). These checkboxes are used to monitor the autoDaily execution. The following information is displayed:

Last created AB: GIRAF.2012-09-13T13:07:07.895_tpl.ab (age: 20.4 h)
  Last processed AB: GIRAF.2012-09-13T13:07:07.895_tpl.ab
Last autoDaily: 2012-09-14T09:30:45 (age: 0.0 h)

The last autoDaily execution is written into the file $DFO_MON_DIR/autoDWatcher.html and exported to the HC web site. It is included there in the monitor page http://www.eso.org/observing/dfo/quality/ALL/qc1_info.html, ready to be inspected by the QC shiftleader. It will automatically flag red if its age excesses 6 hours.


[ top ] Service links. They come in the blue row between the header part and the date result part, divided into three parts:


[ top ] User links. Underneath the main section with the currently open dates, you find a tool bar with the possibility to launch dfos tools directly. These links can be defined by the user in the configuration file.

Graphical links. Further down there is standard information linked with graphical symbols, like

[ top ] System links. The the monitor page displays the 4 GANGLIA performance reports for your host, plus the Evalso bandwidth monitor:

  H D w m        
performance load_report cpu_report mem_report network_report Evalso bandwidth
example

Use the H D w m links for easy switching between hour|day|week|month timescales for the Ganglia reports.

The respective server name is read via unix 'hostname'. These reports are produced by SOS under the main URL http://mucmp.hq.eso.org/ganglia/.

For more information about GANGLIA check out the help link on the dfoMonitor in the system monitor "GANGLIA" box.


[ top ] HTML output. The result HTML page is stored locally under $DFO_MON_DIR/dfoMonitor.html. It is also copied, with stripped-off functionalities, to the DFO web server (http://qcweb.hq.eso.org/~qc/<instr>/monitor).

The extended disk space monitor XDM is exported as a separate page. To have it included in the WISQ information system, it goes to the overview page http://www.eso.org/observing/dfo/quality/WISQ/XDM/XDM.html.

The tool uses the standard '.esh' and '.ash' mechanism to make the browser interactive. Find a description how to implement this here. The '.ash' functionality is used to interactively update the load or disk status in the background.

autoDaily (dfos only). dfoMonitor is enabled for autoDaily, the wrapper tool for automatic processing the initial part of the daily workflow. The status table on the top right part of the monitor page displays whether an autoDaily is currently executed, monitors the execution status and offers a link to the execution log.

dfoMonitor has some additional options (-a, -m, -q) which are not required for command-line usage but have been introduced for autoDaily.

The tool displays the ingestion status of calibration products (under 'cdb') and of science products (under 'sci'). This is useful to get a reminder about data sets not yet ingested, and is included here to deal with the situation that ingestProducts is usually called off-line. The tool checks for files list_ingest_CALIB_$DATE.txt in $DFO_LST_DIR.

To support incremental processing, the tool offers a special blue button for preliminary certification of TODAY's CALIB data. There you can provide feedback to Paranal staff (comments about ABs, certification flags). The workflow calls certifyProducts -L ("certifyP-light"). No data are moved, the AB monitor is updated and exported. See more on the certifyProducts page.

The tool manages the following off-line jobs (all under $DFO_JOB_DIR; dfos only):

Managing means: check if the file contains valid entries; offer links to watch, edit, and execute. The open tasks appear under 'ToDo', either in grey (nothing to do) or in yellow (something to do):

ToDo: off-line processing
JOBS_NIGHT
ingest products:
JOBS_INGEST
cleanup (fits->hdr):
JOBS_CLEANUP
watch [edit] [launch] watch [edit] [launch] watch [edit] [launch]

The tool also offers links to some log subdirectories ($DFO_MON_DIR/AUTO_DAILY and CRON_LOGS) and to the DFO_STATUS file, with the status flags.There is also an active link to launch the statistics tool, 'extractStat -i' , which may be useful e.g. for the quarterly statistics.

It is possible to directly edit the configured values for the calibration memory depth, N_MCAL_LIST and N_VCAL_LIST, in the top 'CAL' section. You can also call the utility tools refreshVCAL and productExplorer there.

Notes: [edit] 
2007-01-28: 4 files Medusa2 instead of Argus; hide!
2007-01-27: some STD ~ok for fibre_effic correction, see what 2007-01-28 data look like.

POSTIT. There is the option to post notes, reminders etc. of temporary character into a text file and include them in the monitor (POSTIT function). Just click on the 'edit' link and create or edit the file $DFO_MON_DIR/DFO_POSTIT. The text will display in the dfoMonitor after refreshing.

Monitor navigation bar. There is also a user-friendly way to edit the monitor navigation bar. A link is offered to the configuration file (config.gui_navbar) which can be edited in the same way as the tool configuration file. The monitor navigation bar is included in all monitors.

NOTE: to update the navigation bar, first edit the config.gui_navbar file, then call dfoMonitor. All other monitors will then show the updated navigation bar after execution of the respective tool.


[ top ] Output


[ top ] How to use

Type dfoMonitor -h for on-line help (there is extended help available from the html page), and dfoMonitor -v for the version number. Type

dfoMonitor

to create or refresh the dfoMonitor.html page.

There are also hidden options -a (switch off check for autoDaily running); -m (to display the status message for autoDaily); -q (quiet mode, no logging). These are used by autoDaily.

The option -N is available for execution without ngas checking, on the command line.


[ top ] Configuration file

The tool reads its own config file plus some others. config.dfoMonitor defines:

Section 1: general parameters
XTERM_GEOM e.g. 120x25+10+500 size and location of pop-up xterm (used by .esh functionality)
     
IMG_URL, DFOS_URL   URLs for the images and for DFOS documentation
N_DFO_HDR 20 number of latest directories $DFO_HDR_DIR/ to scan (has impact on performance!)
DATA_DISK /data23/giraffe name of data disk hosting the data
DATA_QUOTA 50 quota on $DATA_DISK in percent (optional, default: 100)
CREATEAB_VCAL NO call createAB -m SCIENCE with flag -N, optional (default: NO)
MIDAS_CHECK YES YES|NO: if YES, displays actual versus default MIDAS version (optional, default: YES)
 
GANGLIA_FREQ HOUR | DAY | WEEK | MONTH time range of monitors (default: HOUR); managed by tool
CONDOR configuration:
CONDOR_CONFIG /home/condor/condor_config pathname of (global or special) condor config file (default: /home/condor/condor_config)
dfo: condor_config; qc cluster: condor_config.QC
muc: /etc/condor/condor_config
PHOENIX configuration (optional, only needed if a dfos account is monitoring a PHOENIX account)
PHOENIX_ENABLED YES | NO if YES, a message is sent when a new month is started by histoMonitor
PHOENIX_ACCOUNT sciproc@muc08 account and hostname for PHOENIX processing

Section 2: URLs for QC1 and trending
These URLs show up under the 'QC' section of the dfo Monitor. You can have two lines of QC related links. The first one is recommended for the current trending plots. The second one can be anything which you think is useful. You can use the following reserved strings to customize your trending bar:

  • SINGLE_VBAR: |
  • DOUBLE_VBAR: ||
  • BREAK: <br>
  • SPACE: &nbsp;

2.1: First line
Syntax:

  • ITEMxx: should be unique
  • LABEL: can be any string w/o blanks, will mark the link (use underscore for blanks)
  • DISPLAY: string to be displayed on 'onMouseOver' java condition (string w/o blanks; # use underscore for blanks)
  • URL: complete URL to trending page
QC1_URL ITEM01 | bias | bias_trending | http://www.eso.org/qc/GIRAFFE/img/CURRENT/trend_bias_current.gif there can be multiple entries
2.2 Second line
other trending links, including links to the QC1_plotter
QC2_URL ITEM01 | HealthCheck | HealthCheck_Monitor | http://www.eso.org/qc/ALL/daily_qc1.html there can be multiple entries

[ top ] Status information

dfoMonitor reads status file information. The disk occupancies for "HDR", "PLT" and "LOG" are written into DFO_STATUS. The workflow wrapper called for science AB creation sets the flag 'sci_Verified' once the dialogue is continued beyond the AB certification step.

[ top ] Operational aspects (dfos only)


[ top ] dfoMonitor decision making

Find here a description of how dfoMonitor decides about the DFO status of a specific DATE. For each main step of the workflow, three fundamental states can be defined:

The WAIT and DONE status per workflow step is based on finding the corresponding status flag in DFO_STATUS (no matter when the step was executed).

The OFFER status is based on the last entry per DATE in DFO_STATUS.

Usually these three values will be reached sequentially. But there are some cases where the OFFER state is kept although it has already been executed. This applies to the createAB option which is offered as long as the certifyProducts/moveProducts step has not been finished. The reason for this is that you may want to re-execute all or selected ABs when you discover an error or a bad product. This applies to the CALIB and SCIENCE modes separately.

The monitor has three colours to code these states: WAIT is coded grey, OFFER is coded yellow, DONE is coded green. As a special case, the raw_Incomplete status is coded red.

workflow step OFFER DONE  
  condition(s) to offer action action offered condition action
entry for DATE

general conditions for entry:

  • $DFO_RAW_DIR/<DATE> exists
  • fits_Requested is set
  • removed is not set
  • DATE is in latest $N_DFO_HDR dates on local disk

current date always labelled as "today"

none, except for the 'remove from dfoMonitor'' option after finishNight    
complete? green if raw_Complete set, otherwise yellow; current date: always yellow
     
VCAL/MCAL condition for entry: CALIB products for DATE in $DFO_CAL_DIR/MCAL and VCAL, resp.   blue if DATE is contained in MCAL/VCAL; check also the select list on top  
createAB (CALIB) last status entry: raw_Complete or cal_AB or cal_Queued or cal_QC launch 'createAB -m CALIB' cal_AB set  
CALIB ABs last status entry: cal_AB or cal_Queued or cal_QC link to AB status page, number of ABs n/a (DONE state not offered)  
certifyProducts and moveProducts (CALIB) last status entry: cal_QC launch 'certifyProducts -m CALIB' plus 'moveProducts -m CALIB' cal_Certif set  
certifyProducts -L last status entry: cal_QC and DATE=$TODAY launch 'certifyProducts -m CALIB -L' plus update getStatusAB no flag set; no DONE since provisional  
createAB (SCIENCE) last status entry: cal_Updated or sci_AB launch 'createAB -m SCIENCE', then 'verifyAB'; interactive stop for verification; if OK call 'moveProducts -m SCIENCE' sci_AB set  
SCIENCE ABs last status entry: sci_AB link to AB status page, number of ABs; if N_AB = 0, 'finishNight' offered n/a (DONE state not offered)  
finish last status entry: cal_Updated or sci_Updated or (sci_AB and N_AB = 0) launch 'finishNight' finished set offer the 'remove from dfoMonitor' option under DATE

The dfoMonitor supports standard situations. It relies on the internal consistency of the DFO_STATUS file. If the status file is filled inconsistently, the interpretation of the DFO status by dfoMonitor may be wrong.

The standard QC operational scheme DFOS (also called XXLight: science ABs created and moved; no processing, no certification) is supported in the following way: only the buttons 'createAB' and 'moveProducts' exist for SCIENCE. Affected workflows are marked by the DFOS logo.

If the dfoMonitor runs in the PHOENIX environment, it displays the PHOENIX logo.

For non-standard situations the dfoMonitor can also be used, to launch the key tools of the daily workflow from the GUI. For that purpose you may want to use the 'dfos' tool bar at bottom which can be used anytime, independent of status flags. Its main advantage over command-line calls (which are of course also an option) is that the underlying helper scripts know about the command syntax and guide the user interactively. Hence the dfoMonitor user does not really need to remember the exact dfos command syntax.