RAL Tier1 Operations Report for 1st September 2010

Review of Issues during the two weeks from 18th August to 1st September 2010.

This report covers a two week period. Operationally this has been a challenging fortnight with both staff attendance at GridPP and the extended bank holiday with RAL closed on both Monday & Tuesday 30/31 August.

Gdss417 (Atlas MCDisk) Post Mortem (still to be finalised) at:

 http://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20100801_Disk_Server_Data_Loss_Atlas

The system that failed in the Top-BDII set on 17th August, reducing the number of nodes in the service from 5 to 4 was returned to service and the DNS entry amended to point back to all five Top-BDII nodes on Monday 23rd August.
There have been two instances of very high load on the LHCb 3D database (Lugh) - 17th & 21st August - which stopped access for a while.
Gdss381 (CMSTemp) failed on Monday (9th August). The file system was read-only. Since then it has shown FSProbe errors. FSCK reported errors on one of the partitions. Investigations ongoing. Returned to service on 24th August following investigations and a disk replacement.
On Tuesday 24th August the failover of the RAL-CERN OPN link was made successfully.
On Wednesday 25th August the link from RAL to Janet was upgraded to be a 20Gbit link with a 20Gbit failover. (Note added after meeting.)
The maintenance on one of transformers in R89, planned for today, Wednesday 1st September (during LHC technical stop) was cancelled.

Current operational status and issues.

Gdss280 (CMSFarmRead) showed FSProbe errors and was taken out of production on Thursday 19th August.
Gdss547 (AtlasScratchDisk) was found on 23rd August not to be able to communicate with CERN. Investigations still ongoing. It was put back into production in 'draining' mode on Thursday 26th August.
Gdss81 (AtlasdataDisk) had a problem (read only file system) on Wednesday 25th August and was taken out of production. Investigations showed disk array problems, and FSCK errors on one of the partitions. It was returned to production, but in 'draining' mode on Friday 27th.
Multiple problems on some LHCb Disk Servers. Several servers in the LHCbMDst service class have failed. The server stops but subsequently there is no obvious reason for the failure (i.e. no hardware failure etc.) Investigations are ongoing. These servers have been put back into production. The list of these failures so far is:

Server	Date	Space Token
gdss472	2010-08-07	LHCbMdst
gdss475	2010-08-04	LHCbUser
gdss475	2010-08-25	LHCbUser
gdss470	2010-08-26	LHCbMdst
gdss470	2010-08-28	LHCbMdst
gdss468	2010-08-29	LHCbMdst

Over the weekend we have had experienced significant operation problems for LHCb. The number of Job Slots per LHCb disk server was reduced significantly. It was reduced from the nominal value of 400 to 300 on Thursday 26th August, to 200 on Friday 27th (ahead of the long weekend) and down to 100 following the double disk server failure in the night 28/29 August. The maximum number of concurrent LHCb batch jobs had been reduced to 1500 on Friday (27th) and was further reduced to 1000 on 29th August. On Wednesday morning, 1st September, the Job Slots were increased back to 200 across all LHCb disk servers. This issue has been complicated by the LHCbUser space token filling up resulting in a SAM test failure on Thursday 27th August.

As reported at previous meetings, one power supply (of two) for one of the (two) EMC disk arrays behind the LSC/FTS/3D services was moved to UPS power as a test if the electrical noise has been reduced sufficiently. The test is ongoing but some errors (roughly once/week) have been seen.
On Saturday 10th July transformer TX2 in R89 tripped. This transformer is the same one as tripped some months ago, and for which remedial work was undertaken. I had reported last week that the cause appeared to be related to temperature. However, further investigations suggest the cause is related to earth leakage detection. Two (of the four) transformers still require checking out as part of planned work following the first failure of TX2.
There has been some discussion over the batch scheduling as regards long jobs (Alice in this case) filling the farm during a 'quiet' period and preventing a other VO's jobs starting until the jobs have finished. This is compounded by further contention between other VOs (Atlas, CMS, LHCb) as job slots become free. Ways of improving this are still being reviewed.

Declared in the GOC DB

1-8 September. WMS01. Maintenance and update (glite-WMS 3.1.29). Includes time for drain ahead of intervention.
9-16 September. WMS02. Maintenance and update (glite-WMS 3.1.29). Includes time for drain ahead of intervention.

Advanced warning:

The following items remain to be scheduled/announced:

Replacement of Site-BDIIs. At Risk for these on Tuesday 7th September.
Weekend Power Outage in Atlas building 2/3 October. Plans under way for the necessary moves to ensure continuity for any services still using equipment in the Atlas building.
Glite update on worker nodes.
Update firmware in RAID controller cards for a batch of disk servers.
Doubling of network link to network stack for tape robot and Castor head nodes.
Re-visit the SAN / multipath issue for the non-castor databases.
During next LHC technical stop (18-21 October): UPS maintenance and checks on transformers.

Entries in GOC DB starting between 18th August and 1st September 2010.

There were no unscheduled entries in the GOC DB for this last week.

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
lcgwms01.gridpp.rl.ac.uk,	SCHEDULED	OUTAGE	01/09/2010 10:00	08/09/2010 11:00	7 days, 1 hour	Maintenance and update (glite-WMS 3.1.29). Includes time for drain ahead of intervention.
Whole Site.	SCHEDULED	AT_RISK	24/08/2010 07:30	24/08/2010 10:00	2 hours and 30 minutes	At Risk for whole site: Failover test on RAL-CERN OPN link following configuration of the backup connection.
lcgwms03.gridpp.rl.ac.uk,	SCHEDULED	OUTAGE	12/08/2010 15:00	18/08/2010 11:24	5 days, 20 hours and 24 minutes	WMS update to version 3.1.29-0. From the start of the Outage until Tuesday 17th August (10:00 UTC) WMS03 will be in draining mode when existing jobs will be allowed to finish and output retrieved.

Tier1 Operations Report 2010-09-01

Contents

RAL Tier1 Operations Report for 1st September 2010

Review of Issues during the two weeks from 18th August to 1st September 2010.

Current operational status and issues.

Declared in the GOC DB

Advanced warning:

Entries in GOC DB starting between 18th August and 1st September 2010.

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools