Tier1 Operations Report 2010-07-21

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 21th July 2010

Review of Issues during the week from 14th July to 21th July 2010.

  • On Thursday 15th July gdss332(lhcbDst) and gdss420(aliceTape) were in downtime from 10:00 – 12:00. gdss332 to replace a faulty IPMI card and gdss420 to replace a RAID battery.
  • On Monday 19th July work on the transformer was canceled.
  • On Monday 19th July the tape system was unavailable from 9:00 am to 11:30 am while work went ahead on swapping tape controller machine.
  • On Tuedsay 20th July the tape system was unavailable from 8:00am to 2:00 pm while work went ahead updating microcode on tape robots.
  • On Tuesday 20th July there was an intervention on the SAN underlying the Ogma, Lugh and Somnus databases. This intervention did not go to plan and it was rolled back. There was an outage of about 1/2 hour on all ogma, lugh and somnus services while this was done. Further interventions for Wednesday and Thursday were canceled and there is an ongoing issue with streaming to the Atlas 3D database (Ogma)

Current operational status and issues.

  • Today (Wednesday 21 July) there is an issue with 3D steaming to OGMA carrying on from yesterday's intervention.
  • GDSS67 (CMSFarmRead) had problems (FSPROBE errors) and was removed from service on morning of Thursday 20th May. It is to be removed from CMS.
  • GDSS207 (aliceTape) was removed from service on 5th July. We are currently awaiting parts for it.
  • GDSS187 (atlasFarm) was removed from service this morning with fsprobe errors.
  • As reported at the last meeting, one power supply (of two) for one of the (two) EMC disk arrays behind the LSC/FTS/3D services was moved to UPS power as a test if the electrical noise has been reduced sufficiently. As of this morning, it has reported three power problems. The test continues.
  • On Saturday 10th July transformer TX2 in R89 tripped. This transformer is the same one as tripped some months ago, and for which remedial work was undertaken a fortnight ago. At the time of preparation of this report we await further information on this incident.
  • Dust in the Computer Room - particularly the HPD room: Remedial work on lagging pipes is ongoing.


Declared in the GOC DB

  • Note. Thursday 22nd July. At Risk on LFC/FTS (somnus) for SAN multipath configuration update has been deleted.
  • Monday 2nd August - Tuesday 10 August lcgce02.gridpp.rl.ac.uk - downtime for lcgce02 to allow draining and de-commissioning.


Advanced warning:

The following items remain to be scheduled:

  • Closure of SL4 batch workers at RAL-LCG2 announced for the start of August.
  • Doubling of network link to network stack for tape robot and Castor head nodes.

Entries in GOC DB starting between 14th July and 21th July 2010.

There were no unscheduled outages during the last week.

Service Scheduled? Outage/At Risk Start End Duration Reason
lcgce02.gridpp.rl.ac.uk, lcgce02.gridpp.rl.ac.uk, SCHEDULED OUTAGE 02/08/2010 01:00 10/08/2010 00:02 7 days, 23 hours and 2 minutes Downtime for lcgce02 to allow draining and de-commissioning. This is part of the de-commissioning to the sl4 workers at RAL.
lcgfts.gridpp.rl.ac.uk, lfc-atlas.gridpp.rl.ac.uk, lfc.gridpp.rl.ac.uk, lfc.gridpp.rl.ac.uk, lhcb-lfc.gridpp.rl.ac.uk, SCHEDULED AT_RISK 22/07/2010 10:00 22/07/2010 14:00 4 hours At risk for Somnus (FTS and non-LCH LFC) database multipath reconfiguration. All database services on same SAN are being marked as at risk.
lcgfts.gridpp.rl.ac.uk, lfc-atlas.gridpp.rl.ac.uk, lfc.gridpp.rl.ac.uk, lfc.gridpp.rl.ac.uk, lhcb-lfc.gridpp.rl.ac.uk, SCHEDULED AT_RISK 21/07/2010 10:00 21/07/2010 14:00 4 hours At risk for Lugh database multipath reconfiguration. All database services on same SAN are being marked as at risk.
lcgfts.gridpp.rl.ac.uk, lfc-atlas.gridpp.rl.ac.uk, lfc.gridpp.rl.ac.uk, lfc.gridpp.rl.ac.uk, lhcb-lfc.gridpp.rl.ac.uk, SCHEDULED AT_RISK 20/07/2010 11:00 20/07/2010 17:00 6 hours At risk for Ogma database multipath reconfiguration. All database services on same SAN are being marked as at risk.
srm-alice.gridpp.rl.ac.uk, srm-atlas.gridpp.rl.ac.uk, srm-cms.gridpp.rl.ac.uk, srm-dteam.gridpp.rl.ac.uk, srm-hone.gridpp.rl.ac.uk, srm-ilc.gridpp.rl.ac.uk, srm-lhcb.gridpp.rl.ac.uk, srm-mice.gridpp.rl.ac.uk, srm-minos.gridpp.rl.ac.uk, srm-superb.gridpp.rl.ac.uk, srm-t2k.gridpp.rl.ac.uk, SCHEDULED AT_RISK 20/07/2010 08:00 20/07/2010 14:00 6 hours During this period all Castor tape systems will be down for microcode updates to the tape robots. Castor disk services will remain up.
srm-alice.gridpp.rl.ac.uk, srm-atlas.gridpp.rl.ac.uk, srm-cms.gridpp.rl.ac.uk, srm-dteam.gridpp.rl.ac.uk, srm-hone.gridpp.rl.ac.uk, srm-ilc.gridpp.rl.ac.uk, srm-lhcb.gridpp.rl.ac.uk, srm-mice.gridpp.rl.ac.uk, srm-minos.gridpp.rl.ac.uk, srm-superb.gridpp.rl.ac.uk, srm-t2k.gridpp.rl.ac.uk, SCHEDULED AT_RISK 19/07/2010 09:00 19/07/2010 11:29 2 hours and 29 minutes During this time Castor tape systems will be down while a new (spare) robot controller is installed and tested. Castor disk services will remain up.
, SCHEDULED AT_RISK 19/07/2010 08:30 19/07/2010 08:48 18 minutes At Risk for site during maintenance work on electrical supply (transformers).

Canceled as work in not now going ahead.