Tier1 Operations Report 2010-05-12

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 12th May 2010.

Review of Issues during week 5th to 12th May 2010.

  • Gdss397, part of ATLASDATADISK, failed over the bank holiday weekend and was only returned to production on the morning of Saturday 8th May. Apart from a delay owing to the bank holiday weekend, significant time was taken while the RAID array rebuilt.
  • There was a problem with the batch system on Saturday 8th. The batch system services (PBS, MAUI) needed restarting on the main batch system. There was a delay in getting the cream CEs back into production as there remained another daemon to restart that was missed when resolving the main problem.
  • In the evening of Monday (10th May) there was a crash of disk server gdss68, part of cmsfarmread (Disk0Tape1). There were no migration candidates on the server at the time and this had negligible impact. It was returned to service the next day.

Current operational status and issues.

  • We have been failing some SAM tests on the Site BDII. On investigation it seems we are not alone in this and a GGUS ticket has been put into the BDII developers.

Declared in the GOC DB:

  • NGS CE has been declared as an Outage (since 6th May) for decommissioning.
  • At Risks on remaining Castor instances while a database table is cleaned (as already done for 'GEN' instance).
    • Monday 17th May: CMS
    • Wednesday 19th May: Atlas
    • Tuesday 25th May: LHCb
  • Tuesday 2nd June (During technical stop) - UPS test (implying site At Risk).

Advanced warning:

The following items remain to be scheduled:

  • CEs taken out of production in rotation (one at a time) while glexec configured. (CE01 and CE06 done so far.)
  • Preventative maintenance work on transformers in R89. Being scheduled some weeks ahead. Likely to lead to two 'At Risks'.
  • Network change to improve bandwith to tape system.

Entries in GOC DB starting between 5th and 12th May 2010.

There were 2 unscheduled entries in the GOC DB for this last week.

  • A problem on the batch system on Saturday (8th)
  • The decommissioning of the NGS service was added to the GOC DB rather late.
Service Scheduled? Outage/At Risk Start End Duration Reason
Castor GEN (srm-alice, srm-dteam, srm-hone, srm-ilc, srm-mice, srm-minos, srm-t2k. SCHEDULED AT_RISK 12/05/2010 09:30 12/05/2010 11:30 2 hours At Risk on Castor 'GEN' instance while redundant information cleaned up from Castor database.
lcgce06.gridpp.rl.ac.uk, SCHEDULED OUTAGE 10/05/2010 10:00 12/05/2010 17:00 2 days, 7 hours Reconfiguring lcgce06 to enable mappings for glexec
All CEs UNSCHEDULED OUTAGE 08/05/2010 06:00 08/05/2010 09:00 3 hours Problem with batch service. Resolved during Saturday morning by on-call. (Outage added retrospectively).
ce.ngs.rl.ac.uk, UNSCHEDULED OUTAGE 06/05/2010 10:00 14/05/2010 16:00 8 days, 6 hours Downtime while decommissioning CE.