Tier1 Operations Report 2010-08-18

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 18th August 2010

Review of Issues during the week from 11th to 18th August 2010.

  • There was a batch problem for some VOs (including LHCb) caused by a broken gridmapdir file on 11/12 August.
  • Gdss417 (Atlas MCDisk) Post Mortem (still to be finalised) at:
 http://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20100801_Disk_Server_Data_Loss_Atlas
  • WMS03 was unavailable (as scheduled) for a glite upgrade. That system has now been returned to production.
  • Yesterday evening (17th) there was a very high peak in load on the LHCb 3D database (Lugh) which stopped access for a while.
  • There have been problems with the disk array behind the Castor pre-production (2.1.9 test) instance. An alternative array has been brought into use and Castor 2.1.9 tests are continuing.

Current operational status and issues.

  • Gdss381 (CMSTemp) failed on Monday (9th August). The file system was read-only. Since then it has shown FSProbe errors. FSCK reported errors on one of the partitions. Investigations ongoing. The RAID array is currently (Wednesday morning) rebuilding following a disk replacement (This should finish at the end of today). The system will then be checked out before returning to production.
  • On Tuesday (17th) there was an "At Risk" for glite updates to the Top-BDII nodes. The update was successful but during the afternoon one of the nodes fell over, and was dropped out of the DNS round robin (which dropped from 5 nodes to 4). The failed node is being checked for hardware problems.
  • As reported at previous meetings, one power supply (of two) for one of the (two) EMC disk arrays behind the LSC/FTS/3D services was moved to UPS power as a test if the electrical noise has been reduced sufficiently. The test is ongoing but some errors (roughly once/week) have been seen.
  • On Saturday 10th July transformer TX2 in R89 tripped. This transformer is the same one as tripped some months ago, and for which remedial work was undertaken. I had reported last week that the cause appeared to be related to temperature. However, further investigations suggest the cause is related to earth leakage detection. Two (of the four) transformers still require checking out as part of planned work following the first failure of TX2. One of these (TX4) will be checked during the next technical stop (see below). To check the other (TX1) would require us to reply on the failed transformer, TX2, while the checks are carried out. This will not proceed until the cause of the TX2 failure is fully understood and fixed.
  • There has been some discussion over the batch scheduling as regards long jobs (Alice in this case) filling the farm during a 'quiet' period and preventing a other VO's jobs starting until the jobs have finished. This is compounded by further contention between other VOs (Atlas, CMS, LHCb) as job slots become free. We are looking if this could be improved. We also note the batch farm has been full this last week.

Declared in the GOC DB

  • None

Advanced warning:

The following items remain to be scheduled/announced:

  • Test of fail-over of OPN link to CERN (Tuesday morning 24th August).
  • Wednesday 1st September (during LHC technical stop) - Maintenance on one of transformers in R89.
  • lcgwms01 (4-day drain period + upgrade to glite-WMS 3.1.29-0) - Thu 2 Sep to Thu 9 Sep
  • lcgwms02 (4-day drain period + upgrade to glite-WMS 3.1.29-0) - Thu 9 Sep to Thu 16 Sep
  • Weekend Power Outage in Atlas building 2/3 October. Plans under way for the necessary moves to ensure continuity for any services still using equipment in the Atlas building.
  • Glite update on worker nodes.
  • Replacement of Site-BDIIs.
  • Update firmware in RAID controller cards for a batch of disk servers.
  • Doubling of network link to network stack for tape robot and Castor head nodes.
  • Re-visit the SAN / multipath issue for the non-castor databases.

Entries in GOC DB starting between 11th and 18th August 2010.

There were no unscheduled entries in the GOC DB for this last week.

Service Scheduled? Outage/At Risk Start End Duration Reason
lcgbdii.gridpp.rl.ac.uk, SCHEDULED AT_RISK 17/08/2010 10:00 17/08/2010 13:00 3 hours At Risk on Top-BDII for glite update.
lcgwms03.gridpp.rl.ac.uk, SCHEDULED OUTAGE 12/08/2010 15:00 19/08/2010 15:00 7 days, WMS update to version 3.1.29-0. From the start of the Outage until Tuesday 17th August (10:00 UTC) WMS03 will be in draining mode when existing jobs will be allowed to finish and output retrieved.