Tier1 Operations Report 2010-08-04

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 4th August 2010

Review of Issues during the week from 28th July to 4th August 2010.

  • gdss187 (AtlasFarm) was removed from service on 21st July with fsprobe errors. It was returned to service on Wednesday 28th July.
  • gdss207 (AliceTape) was removed from service 4 weeks ago. It was returned to service on Thursday 29th July following a replacement of the RAID controller card.
  • On Thursday 29th July gdss419 (Atlas MCDisk) was taken out of production following a double disk failure. This is a RAID 6 system, and the data survived OK. The server was returned to production on Monday (2nd August).
  • On Friday afternoon, 30th July, the Castor Job Manager for the Atlas instance became stuck and had to be restarted. We failed a SAM test. The problem was rapidly resolved.
  • On Sunday 1st August there was a corruption of the Castor Atlas Stager database. This was fixed during the day but led to an extended outage for the Castor Atlas Service from around 11am to 8pm (local time). There was some knock-on effect for LHCb where we failed one SAM test.
  • gdss452(AtlasDataDisk) failed and was taken out of production on Tuesday 3rd August. This was only a single disk failure but the server crashed. It was returned to service around midday today (4th August).
  • On Wednesday 28th July a security change was made to the Somnus (LFC and FTS) databases to block access from any node other than those expected.
  • On Wednesday 28th July glite software updates were applied to RAL top-level BDIIs.
  • On Monday 2nd August the SL4 batch system was turned off.

Current operational status and issues.

  • gdss417 (Atlas MCDisk)failed early on Sunday morning, 1st August. This has been reported to Atlas as a Data Loss. However, shortly before this meeting, using further information received, it appears that the data has been recovered. A Post Mortem will be produced.
  • gdss475 (LHCbUser) failed with a hardware fault early this morning (4th august). This is being investigated.
  • As reported at previous meetings, one power supply (of two) for one of the (two) EMC disk arrays behind the LSC/FTS/3D services was moved to UPS power as a test if the electrical noise has been reduced sufficiently. The test is ongoing but some errors have been seen.
  • On Saturday 10th July transformer TX2 in R89 tripped. This transformer is the same one as tripped some months ago, and for which remedial work was undertaken. We still await further information on this.
  • Dust in the Computer Room - Remedial work on lagging pipes is ongoing. Only the pipes directly under the CRAC units remain to be done. All the work in the HPD room is complete.

Declared in the GOC DB

  • Monday 2nd August - Tuesday 10 August lcgce02.gridpp.rl.ac.uk - downtime for lcgce02 to allow draining and de-commissioning.

Advanced warning:

The following items remain to be scheduled:

  • Doubling of network link to network stack for tape robot and Castor head nodes.
  • re-visit the SAN / multipath issue for the non-castor databases.
  • Update firmware in RAID controller cards for a batch of disk servers.

Entries in GOC DB starting between 28th July and 4th August 2010.

There were 2 unscheduled outages during the last week. Both relate to the database problem behind the Castor Atlas instance on Sunday. One was the outage, the second was an 'At Risk', although this latter one was erroneously marked as being on lfc-atlas rather than srm-atlas.

Service Scheduled? Outage/At Risk Start End Duration Reason
lcgce02 SCHEDULED OUTAGE 02/08/2010 01:00 10/08/2010 00:02 7 days, 23 hours and 2 minutes Downtime for lcgce02 to allow draining and de-commissioning. This is part of the de-commissioning to the sl4 workers at RAL.
lfc-atlas UNSCHEDULED AT_RISK 01/08/2010 21:52 02/08/2010 11:10 13 hours and 18 minutes Atlas SRM is working again, but leaving it at RISK as a precaution.
srm-atlas UNSCHEDULED OUTAGE 01/08/2010 20:05 01/08/2010 21:48 1 hour and 43 minutes We are experiencing continuing problems with corruption on Oracle database for Atlas stager.
lcgfts SCHEDULED OUTAGE 29/07/2010 08:00 29/07/2010 10:00 2 hours The current FTS server is showing disk errors. This is an outage to drain the FTS service so we can fail over to the standby FTS host.
lcgbdii SCHEDULED AT_RISK 28/07/2010 10:00 28/07/2010 13:00 3 hours At risk to apply glite software updates to RAL top-level BDIIs
lcgfts, lfc-atlas, lfc.gridpp, lfc, lhcb-lfc SCHEDULED AT_RISK 28/07/2010 09:00 28/07/2010 11:00 2 hours At risk on LFC and FTS while security configuration is applied.