Tier1 Operations Report 2010-05-19

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 19th May 2010.

Review of Issues during week 12th to 19th May 2010.

  • gdss380 (lhcbmdst) crashed with a disk failure on Friday night / Saturday morning. Later on Sunday, there was a GGUS ticket GGUS 58253 about disk space. We put a new server in place in lhcbmdst on Sunday. Problems with files not being available persisted until this morning (19th May).
  • A tape problem on Friday 14th caused the loss of 2 files belonging to the MINOS VO. The MINOS experiment representatives were informed by email.
  • A tape problem on Monday 17th caused the loss of a single file for Atlas who were informed (GGUS ticket).

Current operational status and issues.

  • We are still failing some SAM test on the site BDII. Not likely to be fixed in current SAM setup - see GGUS ticket 58054
  • Ongoing issues with the Atlas software server - lcg0617. There are plans in place to replace this machine.

Declared in the GOC DB:

  • At Risks on remaining Castor instances while a database table is cleaned (as already done for 'GEN' & 'CMS' instances).
    • Wednesday 19th May: Atlas (today)
    • Tuesday 25th May: LHCb
  • Wednesday 19th May: lcgce07.gridpp.rl.ac.uk Downtime for glexec upgrade.
  • Until Wednesday 26th May: lcgvo-02-21.gridpp.rl.ac.uk VO box not yet in production.
  • Tuesday 1st June (During technical stop) - UPS test (implying site At Risk).

Advanced warning:

The following items remain to be scheduled:

  • Oracle patching of databases. Will lead to "At Risks" on OGMA (Atlas 3D) on Tuesday 25th May, LUGH (LHCb 3D & LFC) Thursday 27th May and SOMNUS (LFC, FTS) on Wednesday 2nd June.
  • Doubling of network link to network stack for tape robot and Castor head nodes. Tuesday 1st June (during technical stop). Will require a Castor stop, FTS drain and batch pause. Plan to make co-incident with UPS test.
  • CEs taken out of production in rotation (one at a time) while glexec configured. (CE01 and CE06 done so far, CE07 in progress.)
  • Preventative maintenance work on transformers in R89. Being scheduled some weeks ahead. Likely to lead to two 'At Risks'.
  • Closure of SL4 batch workers at RAL

Entries in GOC DB starting between 5th and 12th May 2010.

There were 2 unscheduled entries in the GOC DB for this last week.

  • lcgce07 had a mis-scheduled scheduled downtime and it had to be extended.
  • lcgvo-02-21.gridpp.rl.ac.uk is a CMS VO box not yet in production.
Service Scheduled? Outage/At Risk Start End Duration Reason
lcgce07.gridpp.rl.ac.uk, UNSCHEDULED OUTAGE 17/05/2010 17:00 19/05/2010 17:00 2 day, Downtime to reconfigure glexec. Previous downtime for this machine was mistakenly set to end on the 17/05/2010
lcgvo-02-21.gridpp.rl.ac.uk, SCHEDULED OUTAGE 17/05/2010 11:00 26/05/2010 11:00 9 days, CMS SL5 Phedex VObox not yet in production
lcgce07.gridpp.rl.ac.uk, SCHEDULED OUTAGE 17/05/2010 10:00 17/05/2010 17:00 7 hours Downtime to reconfigure glexec. Alternate CEs are available for the affected VOs.
srm-cms.gridpp.rl.ac.uk, SCHEDULED AT_RISK 17/05/2010 09:30 17/05/2010 11:30 2 hours At Risk on Castor CMS instance while redundant information cleaned up from Castor database.
lcgvo-02-21.gridpp.rl.ac.uk, UNSCHEDULED OUTAGE 14/05/2010 15:15 17/05/2010 11:00 2 days, 19 hours and 45 minutes CMS SL5 Phedex VObox not yet in production
srm-alice.gridpp.rl.ac.uk, srm-dteam.gridpp.rl.ac.uk, srm-hone.gridpp.rl.ac.uk, srm-ilc.gridpp.rl.ac.uk, srm-mice.gridpp.rl.ac.uk, srm-minos.gridpp.rl.ac.uk, srm-t2k.gridpp.rl.ac.uk, SCHEDULED AT_RISK 12/05/2010 09:30 12/05/2010 11:30 2 hours At Risk on Castor 'GEN' instance while redundant information cleaned up from Castor database.
lcgce06.gridpp.rl.ac.uk, SCHEDULED OUTAGE 10/05/2010 10:00 12/05/2010 11:35 2 days, 1 hour and 35 minutes Reconfiguring lcgce06 to enable mappings for glexec