35th Weekly Operation Report on DIRAC Distributed Computing YAN Tian From 2015-09-02 to 2015-09-16 由于9.7-11的云计算暑期学校太忙,两周合为一次报告
Weekly Running Jobs by User item value active users 3 max running jobs 922 average running jobs 410 total executed jobs 38.0k Notes: CEPC production user weiyq runns sim+rec. jobs. BES user zhanglei & zhus run sra jobs.
Final Status of Running Jobs Failed Reason percent upload failover 2.4% stalled 7.3% application error 35.9% other 0.73% USTC jobs stall CEPC input file problem WHUIHEP-STORM failover zhanglei testing
Output Data Generated and Transfered quality: good. WHUIHEP-STORM has some problems, need failover 16.6TB ~1.85TB/day
Running job by Site 8 sites in production: : OpenStack, OpenNebula WHU, USTC, UMN GRID.INFN-Torino, GRID.JINR CLOUD.TORINO.it
Job Final Status at each Site 1 (CEPC input file problem excuted) OpenStack, 2345 jobs 81.9% done 15.2% app err WHU, 2946 jobs 67.4% done 31.4% pengding request GRID.INFN-Torino, 9049 jobs 84.5% done 10.7% app err OpenNebula, 4511 jobs 83.0% done, 13.6% app err
Job Final Status at each Site 2 (CEPC input file problem excuted) UMN, 4044 jobs 51.1% done 48.9% app err JINR, 1356 jobs 95.5% done 3% stall CLOUD.Torino, 1666 jobs 90.5% done 4.9% killed 3.9% app err USTC, 2666 jobs 11.0% done, 88.7% stalled
Failed Types at Site: Description USTC has many job stalled. Under checking UMN didn’t run jobs since Sep.12. Under checking All other sites are good. Especially two sites in Torino. Both of them works well. Most of the app err is from zhanglei’s testing jobs. It’s not the site problem.
Cumulative User Jobs Total user jobs: 38.0 k weiyq 46.4% zhanglei 37.6% zhus 16.0%
本周运维日志1 9.4凌晨,阅兵期结束,WHU网络恢复。作业开始正常运行。 至9.9六天,CEPC作业运行正常。WHU有failover但不影响成功率。 9.10 起BES用户张磊、朱帅开始交作业。 9.10 CEPC weiyq提交新一批作业发现全部23#错误(表示事例数已跑完)。经查,他没有更新输入文件列表。将2万事例数的文件当做20万事例数的文件处理。 9.11 CEPC weiyq准备好输入文件列表,重交作业,但因为 cefs IO错误而中断。 9.11 张磊作业出错,教他如何获取log后,他自己调试解决。 9.12 CEPC weiyq再次重交作业,提交和运行都正常。 9.13 朱帅不知道我们更新了web界面地址,找不到新交的作业。告诉他地址后解决。在QQ群发了新服务器地址变更通知。 9.14 HyperNews发布新服务器URL变更通知
本周运维日志2 9.14 张磊作业数据集出现下载错误。经查,原因是数据集query不区分大小写,张磊的数据集中有些仅有 diy/DIY 这种evtType有大小写差别的,所以查询时连带以前的也包含进来。 9.14张磊 跨round03,04交作业,生成两个数据集,后一个r04直接采用前面r03的stream号,这会导致数据覆盖已存在的目录。而且存在数据丢失风险(r04不知道数据写到什么目录去了),以及其他不可预知错误。赵祥虎说,gangaBOSS目前只能保证单round运行正常。建议用户不要跨round交作业。 9.15 INFN 两个站点最近运行正常,完成不少作业。 9.15 USTC 大量作业 stall, 原因待查。 9.15 UMN 从9.11日起一直没有作业运行。原因待查。 9.15 CEPC作业来自WHU的 failover request failed 周期性出现。添加 StoRM SE TCP 优化参数,继续观察。CEPC目前每个作业下载350MB,上传540MB
运维日志附图 1 (ustc)
运维日志附图 2(ustc)