- Production snapshot:
- Prodctionn errors:
- EXEPANDA_DQ2_STAGEIN (1898) This is due mostly
1) The aggressive clean up policy of atlas DDM deleted input files before jobs with lower priority that are scheduled in panda but not yet submitted could even start. It mostly concerned one task and it affected all the sites. The policy is being corrected.
2) One of our data server crashed after peaking on load last friday night. This was a few hours window cause I put the site offline when I noticed. Server is back up and we used the time to double the bandwidth to test that 2 2 Gbit/s bonded links have some effect. Up to now not yet stress tested.
- EXEPANDA_JOBKILL_SIGTERM (1887): our CPUs are slow for certain tasks and
48 hours are not enough. Once the 48 hours are exhausted the batch system kills the job. I increased the number of CPU hours that a job can run to 60. It still needs to be verified we haven't had yet a similar task to do it.
Update 04/12/09 this is actually a bug in the pilot factory that send a signal to kill jobs to the batch system. Discovered today looking at the errors in other UK sites.
- EXEPANDA_JOBEXPIRED_SIXDAYS (1294): low priority task that couldn't
run because higher priority tasks had precedence are killed by the panda server after 6 days waiting even if they haven't been submitted to the site. Possibly due to slow CPU or bad panda scheduling.
- EXEPANDA_GET_NOSUCHFILE (332): cleanup policy same task as before.
- EXEPANDA_JOBDISPATCHER_HEARTBEAT (231) lost contact with panda server, due
to problem on the server side due to the DB being overloaded.