Документ взят из кэша поисковой машины. Адрес
оригинального документа
: http://theory.sinp.msu.ru/pipermail/ru-ngi/2012q1/000398.html
Дата изменения: Fri Feb 10 13:03:24 2012 Дата индексирования: Tue Oct 2 03:14:30 2012 Кодировка: |
Fri, Feb 10, 2012 at 11:52:53AM +0400, Victor Kotlyar wrote: > Что-то я заметил, за последние два дня, pilot Атласа изменил свое поведение. > > Были запущены какие-то "длинные" задачи: > > resources_used.cput = 48:32:40 > resources_used.mem = 1151908kb > resources_used.vmem = 2387800kb > resources_used.walltime = 48:47:45 > > В panda мониторе у нас упало число analysis задач, а в секции production > - 0, и стоит слово test (как и у RRC-KI) У нас за 3 последних дня ситуация с задачами ATLAS более-менее стабильная: {{{ 08.02.2012 ========== *queue atlas, 3391 jobs, failed 0.06%, killed 0.00%, canceled 0.03%: 1 canceled jobs 3388 jobs with code 0 2 jobs with code 1 quality assessor says: wow, shit, 2% or even less of errors? Does our cluster work at all? Or you're killing every job? ;)) Memory consumption 0 - 1Mb ==> 19 ] 1Mb - 100Mb ==> 3221 ]============================== 100Mb - 1Gb ==> 126 ]= 1Gb - 2Gb ==> 22 ] 2Gb - 3Gb ==> 3 ] Vmem consumption 0 - 1Mb ==> 19 ] 1Mb - 100Mb ==> 5 ] 100Mb - 1Gb ==> 3272 ]============================== 1Gb - 2Gb ==> 90 ] 2Gb - 3Gb ==> 2 ] 3.2Gb - 3.5Gb* ==> 3 ] CPU time consumption 0 - 1min ==> 3224 ]============================== 1min - 10min ==> 101 ] 10min - 1hour ==> 62 ] 6hours - 1day ==> 4 ] Walltime consumption 0 - 1min ==> 50 ] 1min - 10min ==> 3009 ]============================== 10min - 1hour ==> 312 ]=== 1hour - 6hours ==> 15 ] 6hours - 1day ==> 4 ] 1day - 2days ==> 1 ] 09.02.2012 ========== *queue atlas, 11395 jobs, failed 0.00%, killed 0.00%, canceled 0.01%: 1 canceled jobs 11394 jobs with code 0 quality assessor says: wow, shit, 2% or even less of errors? Does our cluster work at all? Or you're killing every job? ;)) Memory consumption 0 - 1Mb ==> 1764 ]===== 1Mb - 100Mb ==> 8884 ]============================== 100Mb - 1Gb ==> 433 ]= 1Gb - 2Gb ==> 314 ]= Vmem consumption 0 - 1Mb ==> 1764 ]===== 1Mb - 100Mb ==> 165 ] 100Mb - 1Gb ==> 8955 ]============================== 1Gb - 2Gb ==> 461 ]= 2Gb - 3Gb ==> 50 ] CPU time consumption 0 - 1min ==> 10690 ]============================== 1min - 10min ==> 275 ] 10min - 1hour ==> 112 ] 1hour - 6hours ==> 261 ] 6hours - 1day ==> 57 ] Walltime consumption 0 - 1min ==> 2230 ]======= 1min - 10min ==> 8408 ]============================== 10min - 1hour ==> 392 ]= 1hour - 6hours ==> 280 ] 6hours - 1day ==> 84 ] 1day - 2days ==> 1 ] 10.02.2012 ========== *queue atlas, 2162 jobs, failed 0.00%, killed 0.00%, canceled 0.00%: 2162 jobs with code 0 quality assessor says: wow, shit, 2% or even less of errors? Does our cluster work at all? Or you're killing every job? ;)) Memory consumption 0 - 1Mb ==> 165 ]=== 1Mb - 100Mb ==> 1547 ]============================== 100Mb - 1Gb ==> 129 ]== 1Gb - 2Gb ==> 302 ]===== 2Gb - 3Gb ==> 19 ] Vmem consumption 0 - 1Mb ==> 165 ]=== 1Mb - 100Mb ==> 14 ] 100Mb - 1Gb ==> 1610 ]============================== 1Gb - 2Gb ==> 298 ]===== 2Gb - 3Gb ==> 75 ]= CPU time consumption 0 - 1min ==> 1729 ]============================== 1min - 10min ==> 228 ]=== 10min - 1hour ==> 80 ]= 1hour - 6hours ==> 125 ]== Walltime consumption 0 - 1min ==> 199 ]=== 1min - 10min ==> 1526 ]============================== 10min - 1hour ==> 285 ]===== 1hour - 6hours ==> 152 ]== }}} Вчера, конечно, было немного длинных задач, но это пока копейки, менее 1/2 процента от всех. Но если ATLAS что-то поменял в стратегии распределения или запуска задач или в чем-то другом, то об этом, конечно, хочется знать. -- Eygene Ryabinkin, National Research Centre "Kurchatov Institute" Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live.