November Summary

From GridPP Wiki
Jump to: navigation, search

Dashboard Snapshots

  • Production snapshot:

Production snapshot

  • Prodctionn errors:

Production errors snapshot

Errors

  • EXEPANDA_DQ2_STAGEIN (1898) This is due mostly

1) The aggressive clean up policy of atlas DDM deleted input files before jobs with lower priority that are scheduled in panda but not yet submitted could even start. It mostly concerned one task and it affected all the sites. The policy is being corrected.

2) One of our data server crashed after peaking on load last friday night. This was a few hours window cause I put the site offline when I noticed. Server is back up and we used the time to double the bandwidth to test that 2 2 Gbit/s bonded links have some effect. Up to now not yet stress tested.

  • EXEPANDA_JOBKILL_SIGTERM (1887): our CPUs are slow for certain tasks and

48 hours are not enough. Once the 48 hours are exhausted the batch system kills the job. I increased the number of CPU hours that a job can run to 60. It still needs to be verified we haven't had yet a similar task to do it.

Update 04/12/09 this is actually a bug in the pilot factory that send a signal to kill jobs to the batch system. Discovered today looking at the errors in other UK sites.

  • EXEPANDA_JOBEXPIRED_SIXDAYS (1294): low priority task that couldn't

run because higher priority tasks had precedence are killed by the panda server after 6 days waiting even if they haven't been submitted to the site. Possibly due to slow CPU or bad panda scheduling.

  • EXEPANDA_GET_NOSUCHFILE (332): cleanup policy same task as before.
  • EXEPANDA_JOBDISPATCHER_HEARTBEAT (231) lost contact with panda server, due

to problem on the server side due to the DB being overloaded.