Tier1 Operations Report 2010-05-26

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 26 May 2010

Review of Issues during week 19th to 26th May 2010.

  • gdss380 (lhcbmdst) was put back into service on 20th. It failed again on Saturday morning with similar symptoms to its previous failure.
  • Problem with the CIP not publishing GlueSA information on Thursday 20th May 2010. This was noticed and fixed before any tickets were raised.
  • gdss67 CMS farmRead had problems (FSPROBE errors) and was removed from service on morning of Thursday 20th.
  • On Thursday 20th there was a Problem with gdss228 and gdss229 being deployed into atlasScratchDisk.
  • On Monday 24th gdss207 (aliceTap)was removed from service due to possible file system corruption.
  • Tuesday 25th May 2010 there were problems with the Atlas SRM machines overnight (/var full)
  • On Tuesday morning, the CMS SRM failed tests because of a stager error

Current operational status and issues.

  • We are still failing some SAM test on the site BDII. Not likely to be fixed in current SAM setup - see GGUS ticket 58054
  • Ongoing issues with the Atlas software server - lcg0617. The plans in place to replace this machine have now been approved.

Declared in the GOC DB

  • Wednesday 26th May: lcgce08.gridpp.rl.ac.uk Downtime for glexec upgrade.
  • Tuesday 1st June (During technical stop) - UPS test (implying site At Risk).
  • Wednesday 2nd June Oracle patching on SOMNUS (LFC and FTS).

Advanced warning:

The following items remain to be scheduled:

  • Oracle patching of databases. Will lead to "At Risks" on LUGH (LHCb 3D & LFC) Thursday 27th May and SOMNUS (LFC, FTS) on Wednesday 2nd June.
  • Doubling of network link to network stack for tape robot and Castor head nodes. Tuesday 1st June (during technical stop). Will require a Castor stop, FTS drain and batch pause. Plan to make co-incident with UPS test.
  • CEs taken out of production in rotation (one at a time) while glexec configured. (CE08 in progress.)
  • Preventative maintenance work on transformers in R89. Being scheduled some weeks ahead. Likely to lead to two 'At Risks'.
  • Closure of SL4 batch workers at RAL

Entries in GOC DB starting between 5th and 12th May 2010.

There were 1 unscheduled entry in the GOC DB for this last week.

  • lcgce07 had a mis-scheduled scheduled downtime and it had to be extended


Service Scheduled? Outage/At Risk Start End Duration Reason
ogma.gridpp.rl.ac.uk, SCHEDULED AT_RISK 25/05/2010 10:00 25/05/2010 12:00 2 hours At Risk during application of Oracle PSU patches.
srm-lhcb.gridpp.rl.ac.uk, SCHEDULED AT_RISK 25/05/2010 09:30 25/05/2010 11:30 2 hours At Risk on Castor LHCb instance while redundant information cleaned up from Castor database.
lcgce08.gridpp.rl.ac.uk, SCHEDULED OUTAGE 24/05/2010 10:00 26/05/2010 17:00 2 days, 7 hours CE being reconfigured for glexec roles mapping
srm-atlas.gridpp.rl.ac.uk, SCHEDULED AT_RISK 19/05/2010 09:30 19/05/2010 11:30 2 hours At Risk on Castor 'Atlas' instance while redundant information cleaned up from Castor database.
lcgce07.gridpp.rl.ac.uk, UNSCHEDULED OUTAGE 17/05/2010 17:00 19/05/2010 17:00 2 day, Downtime to reconfigure glexec. Previous downtime for this machine was mistakenly set to end on the 17/05/2010
lcgvo-02-21.gridpp.rl.ac.uk, SCHEDULED OUTAGE 17/05/2010 11:00 25/05/2010 13:32 8 days, 2 hours and 32 minutes CMS SL5 Phedex VObox not yet in production