RAL Tier1 Operations Report for 17th October 2012

Review of Issues during the fortnight 3rd to 17th October 2012

There was a problem with the FTS during the evening of 4th October. The FTS has recently been patched - but the patch does not fix this problem (although it does slightly change its behaviour). There were problems again during the evening of Monday 8th Oct, and Monday 15th Oct.
On 5th October we declared one lost file to CMS. This was picked up by the checksum checker. The file was on a tape backed service class but had not yet been migrated to tape.
On Saturday morning 6th Oct one of the nodes in the TopBDII crashed. The BDII set ran with four out of five nodes available until the node was restarted on Monday morning.
The planned upgrade of the LHCb Castor instance to version 2.1.12 on 9th Oct. was cancelled at he end of the preceding afternoon. An increase in free disk space in the period shortly after the upgrade of the Atlas Castor instance was noticed and further upgrades were put on hold until the issue was understood (which it now is). See Tier1 BLOG entry.

Resolved Disk Server Issues

GDSS673 (CMSTape - D0T1) crashed on Friday evening 5th Oct. It was returned to production on Sunday evening (7th).
GDSS555 (AtlasDataDisk - D1T0) crashed on Wednesday afternoon (10th Oct). After restarting a memory test was run and the disk array was verified. The system was returned to production on Friday morning (12th July) when the disks verification was around one third completed without problems.

Current operational status and issues

On 12th/13th June the first stage of switching ready for the work on the main site power supply took place. The work on the two transformers is expected to take until 18th December and involves powering off one half of the resilient supply for 3 months while being overhauled, then repeat with the other half. The work is running to schedule. One half of the new switchboard has been refurbished and was brought into service on 17 September.
High load observed on uplink to one of network stacks (stack 13), serving SL09 disk servers (~ 3PB of storage). Ongoing work by Fabric team looking to improve the uplink.

Ongoing Disk Server Issues

GDSS454 (AtlasDataDisk - D1T0) failed yesterday afternoon (16th Oct). The RAID array is currently being verified. We have lost one file from this server.

Notable Changes made this last week

5th Oct - Migration of LHCb data from T10KA to T10KC tapes completed successfully. See Tier1 BLOG entry.
9th Oct - CMS castor instance made accessible from UK/European/global xrootd redirectors.
10th Oct - WMS03 upgraded to EMI version.
15th Oct - Oracle patches applied to the databases behind the Atlas conditions, FTS and non-LHC LFC services (databases SOMNUS & OGMA)
16th Oct - CMS Castor instance upgraded to version 2.1.12-10.
The LHCb 3D & LFC database systems have been withdrawn from service.
Updated errata rolled out across batch farm.
Continuing test of hyperthreading on one batch of worker nodes.
As stated before: CVMFS available for testing by non-LHC VOs (including "stratum 0" facilities).
A test queue ("gridTest") is available with (currently) four worker nodes running EMI2/SL5. In addition a further ten nodes (one from each hardware generation/batch) installed with EMI-2/SL5 are running as part of the normal batch system.
Test instance of FTS version 3 available. The non-LHC VOs that use the existing service have been enabled on it and we are looking for one of the VOs to test it.

Declared in the GOC DB

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced.

Tuesday 23rd October: Upgrade of LHCb Castor instance to Version 2.1.12-10. (Re-scheduled after not being done on 9th Oct.)
Tuesday 30th October: Upgrade of GEN Castor instance to Version 2.1.12-10. (Re-scheduled from 23rd Oct.)
Upgrade of CEs to EMI version. If final tests go well propose doing this next Tuesday (23rd Oct.)

Listing by category:

Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
Castor:
- Upgrade to version 2.1.12. (As detailed above).
Networking:
- Install new Routing layer for Tier1 and update the way the Tier1 connects to the RAL network. (Plan to co-locate with replacement of UKlight network).
- Update Spine layer for Tier1 network.
- Replacement of UKLight Router.
- Addition of caching DNSs into the Tier1 network.
Grid Services:
- CEs being upgraded to EMI version now.
- Rolling upgrade of WMSs to EMI version underway.
- Enabling overcommit on WNs to make use of hyperthreading (will be implemented after the CE upgrades are complete).

Updates of Grid Services as appropriate. (Services now on EMI/UMD versions unless there is a specific reason not.)

Infrastructure:
- Intervention required on the "Essential Power Board". (An "At Risk"). Proposed Date 20th November.
- Remedial work on three (out of four) transformers. Will require two "At Risk" periods. Likely to be in November.
- Remedial work on the BMS (Building Management System) due to one its three modules being faulty. Will require a further “At Risk”.

Entries in GOC DB starting between 3rd and 17th October 2012

The only unscheduled outage in the GOC DB for this period is for the retirement of a CMS VO box.

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
lcgwms01	SCHEDULED	OUTAGE	17/10/2012 15:00	19/10/2012 13:00	1 day, 22 hours	EMI WMS update to v3.3.8
srm-cms	SCHEDULED	OUTAGE	16/10/2012 08:00	16/10/2012 14:00	6 hours	Upgrade of CMS Castor instance to Version 2.1.12-10.
lcgfts.gridpp.rl.ac.uk, lfc.gridpp.rl.ac.uk,	SCHEDULED	WARNING	15/10/2012 09:00	15/10/2012 13:00	4 hours	Rolling application of Oracle Patches to database systems behind these services.
lcgwms01	SCHEDULED	OUTAGE	12/10/2012 10:00	17/10/2012 15:00	5 days, 5 hours	EMI WMS update to v3.3.8
lcgwms03	SCHEDULED	OUTAGE	04/10/2012 11:00	10/10/2012 12:00	6 days, 1 hour	EMI WMS update to v3.3.8
lcgvo-02-21.gridpp.rl.ac.uk,	UNSCHEDULED	OUTAGE	03/10/2012 15:30	31/10/2012 23:00	28 days, 8 hours and 30 minutes	System being decommissioned (This is a CMS VOBOX)

Open GGUS Tickets


GGUS ID	Level	Urgency	State	Creation	Last Update	VO	Subject
87455	Green	Urgent	In Progress	2012-10-17	2012-10-17	Atlas	RAL-LCG2_HIMEM: jobs failing with stage-in errors
86705	Red	Less Urgent	In Progress	2012-10-03	2012-10-09	SNO+	RAL jobs returning errors
86690	Red	Urgent	In Progress	2012-10-03	2012-10-11	T2K	JPKEKCRC02 missing from FTS ganglia metrics
86152	Red	Less Urgent	In Progress	2012-09-17	2012-09-19		correlated packet-loss on perfsonar host
68853	Red	Less Urgent	In Progress	2011-03-22	2012-08-10	N/A	Retirenment of SL4 and 32bit DPM Head nodes and Servers

Availability Report


Day	OPS	Alice	Atlas	CMS	LHCb	Comment
03/10/12	100	100	99.1	100	100	Single failure to connect to srm-atlas.
04/10/12	100	100	100	100	100
05/10/12	100	100	100	100	100
06/10/12	100	100	100	100	100
07/10/12	100	100	100	100	100
08/10/12	100	100	100	100	100
09/10/12	100	100	100	100	100
10/10/12	100	100	100	100	100
11/10/12	100	95.9	100	100	100	Timeout for the CE test job exceeded.
12/10/12	100	84.7	100	100	100	Timeout for the CE test job exceeded.
13/10/12	100	100	100	100	100
14/10/12	100	100	100	100	100
15/10/12	95.8	100	100	100	100	CE tests failed. testing of new EMI CEs affected BDII data.
16/10/12	100	100	100	85.3	100	CMS Castor instance upgraded to version 2.1.12.

Tier1 Operations Report 2012-10-17