Tier1 Operations Report 2010-09-01

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 1st September 2010

Review of Issues during the two weeks from 18th August to 1st September 2010.

This report covers a two week period. Operationally this has been a challenging fortnight with both staff attendance at GridPP and the extended bank holiday with RAL closed on both Monday & Tuesday 30/31 August.

  • Gdss417 (Atlas MCDisk) Post Mortem (still to be finalised) at:
 http://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20100801_Disk_Server_Data_Loss_Atlas
  • The system that failed in the Top-BDII set on 17th August, reducing the number of nodes in the service from 5 to 4 was returned to service and the DNS entry amended to point back to all five Top-BDII nodes on Monday 23rd August.
  • There have been two instances of very high load on the LHCb 3D database (Lugh) - 17th & 21st August - which stopped access for a while.
  • Gdss381 (CMSTemp) failed on Monday (9th August). The file system was read-only. Since then it has shown FSProbe errors. FSCK reported errors on one of the partitions. Investigations ongoing. Returned to service on 24th August following investigations and a disk replacement.
  • On Tuesday 24th August the failover of the RAL-CERN OPN link was made successfully.
  • On Wednesday 25th August the link from RAL to Janet was upgraded to be a 20Gbit link with a 20Gbit failover. (Note added after meeting.)
  • The maintenance on one of transformers in R89, planned for today, Wednesday 1st September (during LHC technical stop) was cancelled.

Current operational status and issues.

  • Gdss280 (CMSFarmRead) showed FSProbe errors and was taken out of production on Thursday 19th August.
  • Gdss547 (AtlasScratchDisk) was found on 23rd August not to be able to communicate with CERN. Investigations still ongoing. It was put back into production in 'draining' mode on Thursday 26th August.
  • Gdss81 (AtlasdataDisk) had a problem (read only file system) on Wednesday 25th August and was taken out of production. Investigations showed disk array problems, and FSCK errors on one of the partitions. It was returned to production, but in 'draining' mode on Friday 27th.
  • Multiple problems on some LHCb Disk Servers. Several servers in the LHCbMDst service class have failed. The server stops but subsequently there is no obvious reason for the failure (i.e. no hardware failure etc.) Investigations are ongoing. These servers have been put back into production. The list of these failures so far is:
Server Date Space Token
gdss472 2010-08-07 LHCbMdst
gdss475 2010-08-04 LHCbUser
gdss475 2010-08-25 LHCbUser
gdss470 2010-08-26 LHCbMdst
gdss470 2010-08-28 LHCbMdst
gdss468 2010-08-29 LHCbMdst

Over the weekend we have had experienced significant operation problems for LHCb. The number of Job Slots per LHCb disk server was reduced significantly. It was reduced from the nominal value of 400 to 300 on Thursday 26th August, to 200 on Friday 27th (ahead of the long weekend) and down to 100 following the double disk server failure in the night 28/29 August. The maximum number of concurrent LHCb batch jobs had been reduced to 1500 on Friday (27th) and was further reduced to 1000 on 29th August. On Wednesday morning, 1st September, the Job Slots were increased back to 200 across all LHCb disk servers. This issue has been complicated by the LHCbUser space token filling up resulting in a SAM test failure on Thursday 27th August.

  • As reported at previous meetings, one power supply (of two) for one of the (two) EMC disk arrays behind the LSC/FTS/3D services was moved to UPS power as a test if the electrical noise has been reduced sufficiently. The test is ongoing but some errors (roughly once/week) have been seen.
  • On Saturday 10th July transformer TX2 in R89 tripped. This transformer is the same one as tripped some months ago, and for which remedial work was undertaken. I had reported last week that the cause appeared to be related to temperature. However, further investigations suggest the cause is related to earth leakage detection. Two (of the four) transformers still require checking out as part of planned work following the first failure of TX2.
  • There has been some discussion over the batch scheduling as regards long jobs (Alice in this case) filling the farm during a 'quiet' period and preventing a other VO's jobs starting until the jobs have finished. This is compounded by further contention between other VOs (Atlas, CMS, LHCb) as job slots become free. Ways of improving this are still being reviewed.

Declared in the GOC DB

  • 1-8 September. WMS01. Maintenance and update (glite-WMS 3.1.29). Includes time for drain ahead of intervention.
  • 9-16 September. WMS02. Maintenance and update (glite-WMS 3.1.29). Includes time for drain ahead of intervention.

Advanced warning:

The following items remain to be scheduled/announced:

  • Replacement of Site-BDIIs. At Risk for these on Tuesday 7th September.
  • Weekend Power Outage in Atlas building 2/3 October. Plans under way for the necessary moves to ensure continuity for any services still using equipment in the Atlas building.
  • Glite update on worker nodes.
  • Update firmware in RAID controller cards for a batch of disk servers.
  • Doubling of network link to network stack for tape robot and Castor head nodes.
  • Re-visit the SAN / multipath issue for the non-castor databases.
  • During next LHC technical stop (18-21 October): UPS maintenance and checks on transformers.

Entries in GOC DB starting between 18th August and 1st September 2010.

There were no unscheduled entries in the GOC DB for this last week.

Service Scheduled? Outage/At Risk Start End Duration Reason
lcgwms01.gridpp.rl.ac.uk, SCHEDULED OUTAGE 01/09/2010 10:00 08/09/2010 11:00 7 days, 1 hour Maintenance and update (glite-WMS 3.1.29). Includes time for drain ahead of intervention.
Whole Site. SCHEDULED AT_RISK 24/08/2010 07:30 24/08/2010 10:00 2 hours and 30 minutes At Risk for whole site: Failover test on RAL-CERN OPN link following configuration of the backup connection.
lcgwms03.gridpp.rl.ac.uk, SCHEDULED OUTAGE 12/08/2010 15:00 18/08/2010 11:24 5 days, 20 hours and 24 minutes WMS update to version 3.1.29-0. From the start of the Outage until Tuesday 17th August (10:00 UTC) WMS03 will be in draining mode when existing jobs will be allowed to finish and output retrieved.