|
Документ взят из кэша поисковой машины. Адрес
оригинального документа
: http://theory.sinp.msu.ru/pipermail/ru-ngi/2012q1/000398.html
Дата изменения: Fri Feb 10 13:03:24 2012 Дата индексирования: Tue Oct 2 03:14:30 2012 Кодировка: |
Fri, Feb 10, 2012 at 11:52:53AM +0400, Victor Kotlyar wrote:
> Что-то я заметил, за последние два дня, pilot Атласа изменил свое поведение.
>
> Были запущены какие-то "длинные" задачи:
>
> resources_used.cput = 48:32:40
> resources_used.mem = 1151908kb
> resources_used.vmem = 2387800kb
> resources_used.walltime = 48:47:45
>
> В panda мониторе у нас упало число analysis задач, а в секции production
> - 0, и стоит слово test (как и у RRC-KI)
У нас за 3 последних дня ситуация с задачами ATLAS более-менее стабильная:
{{{
08.02.2012
==========
*queue atlas, 3391 jobs, failed 0.06%, killed 0.00%, canceled 0.03%:
1 canceled jobs
3388 jobs with code 0
2 jobs with code 1
quality assessor says: wow, shit, 2% or even less of errors?
Does our cluster work at all? Or you're killing every job? ;))
Memory consumption
0 - 1Mb ==> 19 ]
1Mb - 100Mb ==> 3221 ]==============================
100Mb - 1Gb ==> 126 ]=
1Gb - 2Gb ==> 22 ]
2Gb - 3Gb ==> 3 ]
Vmem consumption
0 - 1Mb ==> 19 ]
1Mb - 100Mb ==> 5 ]
100Mb - 1Gb ==> 3272 ]==============================
1Gb - 2Gb ==> 90 ]
2Gb - 3Gb ==> 2 ]
3.2Gb - 3.5Gb* ==> 3 ]
CPU time consumption
0 - 1min ==> 3224 ]==============================
1min - 10min ==> 101 ]
10min - 1hour ==> 62 ]
6hours - 1day ==> 4 ]
Walltime consumption
0 - 1min ==> 50 ]
1min - 10min ==> 3009 ]==============================
10min - 1hour ==> 312 ]===
1hour - 6hours ==> 15 ]
6hours - 1day ==> 4 ]
1day - 2days ==> 1 ]
09.02.2012
==========
*queue atlas, 11395 jobs, failed 0.00%, killed 0.00%, canceled 0.01%:
1 canceled jobs
11394 jobs with code 0
quality assessor says: wow, shit, 2% or even less of errors?
Does our cluster work at all? Or you're killing every job? ;))
Memory consumption
0 - 1Mb ==> 1764 ]=====
1Mb - 100Mb ==> 8884 ]==============================
100Mb - 1Gb ==> 433 ]=
1Gb - 2Gb ==> 314 ]=
Vmem consumption
0 - 1Mb ==> 1764 ]=====
1Mb - 100Mb ==> 165 ]
100Mb - 1Gb ==> 8955 ]==============================
1Gb - 2Gb ==> 461 ]=
2Gb - 3Gb ==> 50 ]
CPU time consumption
0 - 1min ==> 10690 ]==============================
1min - 10min ==> 275 ]
10min - 1hour ==> 112 ]
1hour - 6hours ==> 261 ]
6hours - 1day ==> 57 ]
Walltime consumption
0 - 1min ==> 2230 ]=======
1min - 10min ==> 8408 ]==============================
10min - 1hour ==> 392 ]=
1hour - 6hours ==> 280 ]
6hours - 1day ==> 84 ]
1day - 2days ==> 1 ]
10.02.2012
==========
*queue atlas, 2162 jobs, failed 0.00%, killed 0.00%, canceled 0.00%:
2162 jobs with code 0
quality assessor says: wow, shit, 2% or even less of errors?
Does our cluster work at all? Or you're killing every job? ;))
Memory consumption
0 - 1Mb ==> 165 ]===
1Mb - 100Mb ==> 1547 ]==============================
100Mb - 1Gb ==> 129 ]==
1Gb - 2Gb ==> 302 ]=====
2Gb - 3Gb ==> 19 ]
Vmem consumption
0 - 1Mb ==> 165 ]===
1Mb - 100Mb ==> 14 ]
100Mb - 1Gb ==> 1610 ]==============================
1Gb - 2Gb ==> 298 ]=====
2Gb - 3Gb ==> 75 ]=
CPU time consumption
0 - 1min ==> 1729 ]==============================
1min - 10min ==> 228 ]===
10min - 1hour ==> 80 ]=
1hour - 6hours ==> 125 ]==
Walltime consumption
0 - 1min ==> 199 ]===
1min - 10min ==> 1526 ]==============================
10min - 1hour ==> 285 ]=====
1hour - 6hours ==> 152 ]==
}}}
Вчера, конечно, было немного длинных задач, но это пока копейки, менее
1/2 процента от всех.
Но если ATLAS что-то поменял в стратегии распределения или запуска
задач или в чем-то другом, то об этом, конечно, хочется знать.
--
Eygene Ryabinkin, National Research Centre "Kurchatov Institute"
Always code as if the guy who ends up maintaining your code will be
a violent psychopath who knows where you live.