RAL Tier1 Operations Report for 3rd October 2012

Review of Issues during the fortnight 19th September to 3rd October 2012

During the evening of Thursday 20th Sep there was a problem with the Atlas Castor instance. This was traced to a problem in the Castor database (an "orphaned sub-request"). The problem caused the Atlas Castor instance to be unavailable for some hours and lasted just into the following day.
On Sunday evening (30th September) there was a problem with the Atlas SRM database. This is believed to be an Oracle bug, and unrelated to the Castor 2.1.12 update of some days earlier. SRM_Atlas was unavailable for several hours.
Yesterday (Tuesday 2nd October) four files were reported lost to LHCb. These were discovered when attempting to recall them from tape. All four files were on the same tape. No other files on that tape are affected.

Resolved Disk Server Issues

GDSS399 (LhcbRawRdst - D0T1) was taken out of production on Monday (1st Oct). A failed disk was replaced but the rebuild did not go normally - it was very slow. The machine was rebooted and the RAID rebuild tracked to ensure it was OK. System returned to production this morning (3rd Oct).

Current operational status and issues

On 12th/13th June the first stage of switching ready for the work on the main site power supply took place. The work on the two transformers is expected to take until 18th December and involves powering off one half of the resilient supply for 3 months while being overhauled, then repeat with the other half. The work is running to schedule. One half of the new switchboard has been refurbished and was brought into service on 17 September.
The migration of LHCb data from the T10KA to the T10KC tapes is progressing. 193 LHCb tapes are left to migrate and the process should be finished by next week.
High load observed on uplink to one of network stacks (stack 13), serving SL09 disk servers (~ 3PB of storage). This appears to be having a negative impact on the job efficiency of CMS jobs. Fabric team are looking at improving uplink.

Ongoing Disk Server Issues

Notable Changes made this last week

Tuesday 25th Sep. Upgrade of Atlas Castor instance to Version 2.1.12-10.
On Tuesday 25th Sep. Oracle updates were applied to the Atlas TAGS database.
The Rolling (transparent) migration of LHCb LFC front ends to EMI-2 on Virtual Machines has been completed.
On Monday 1st October the FTS front ends were moved to virtual machines and a patch applied that addresses the problem of the 'wrong' proxy being picked up.
Updated errata being rolled out across batch farm.
Continuing test of hyperthreading on one batch of worker nodes (the Dell 2011 batch). No problems observed with 50% overcommit. Pending Change Control approval expect to increase over-commit on all hyper-threaded nodes at the start of October. Comments/concerns from VOs welcome.
As stated before: CVMFS available for testing by non-LHC VOs (including "stratum 0" facilities).
A test queue ("gridTest") is available with (currently) four worker nodes running EMI2/SL5. In addition a further ten nodes (one from each hardware generation/batch) installed with EMI-2/SL5 are running as part of the normal batch system.
Test instance of FTS version 3 available. The non-LHC VOs that use the existing service have been enabled on it and we are looking for one of the VOs to test it.

Declared in the GOC DB

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced.

Listing by category:

Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
Castor:
- Upgrade to version 2.1.12. (As detailed above).
Networking:
- Install new Routing layer for Tier1 and update the way the Tier1 connects to the RAL network. (Plan to co-locate with replacement of UKlight network).
- Update Spine layer for Tier1 network.
- Replacement of UKLight Router.
- Addition of caching DNSs into the Tier1 network.
Grid Services:
- Updates of Grid Services as appropriate. (Services now on EMI/UMD versions unless there is a specific reason not.)
Infrastructure:
- Intervention required on the "Essential Power Board". (An "At Risk"). Proposed Date 20th November.
- Remedial work on three (out of four) transformers. Will require two "At Risk" periods. Likely to be in November.
- Remedial work on the BMS (Building Management System) due to one its three modules being faulty. Will require a further “At Risk”.

Entries in GOC DB starting between 19th September and 3rd October 2012

There were two unscheduled outages (one followed by a 'warning') in the GOC DB for this period. Both refer to srm-atlas and are detailed above.

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
srm-atlas	UNSCHEDULED	WARNING	01/10/2012 00:11	01/10/2012 12:00	11 hours and 49 minutes	Following the ATLAS SRM database problems, we are returning to service with as an AT-RISK until tomorrow.
srm-atlas	UNSCHEDULED	OUTAGE	30/09/2012 21:00	01/10/2012 00:08	3 hours and 8 minutes	Atlas SRM failing due to problems in the underlying databases.
srm-atlas	SCHEDULED	OUTAGE	25/09/2012 09:00	25/09/2012 12:10	3 hours and 10 minutes	Upgrade of Atlas Castor instance to Version 2.1.12-10.
srm-atlas	UNSCHEDULED	OUTAGE	20/09/2012 21:30	21/09/2012 01:15	3 hours and 45 minutes	Atlas SRMs failed due to an orphaned subrequest causing database queries to block for the Atlas Castor instance.

Open GGUS Tickets


GGUS ID	Level	Urgency	State	Creation	Last Update	VO	Subject
86570	Yellow	Very Urgent	In Progress	2012-10-01	2012-10-01		Moving to SHA2 GGUS certificate
86152	Red	Less Urgent	In Progress	2012-09-17	2012-09-19		correlated packet-loss on perfsonar host
85077	Red	Less Urgent	In Progress	2012-08-13	2012-09-17	Biomed	CE lcgce05.gridpp.rl.ac.uk job cannot register file on SE srm-biomed.gridpp.rl.ac.uk
68853	Red	Less Urgent	On hold	2011-03-22	2012-09-04	N/A	Retirenment of SL4 and 32bit DPM Head nodes and Servers

Availability Report


Day	OPS	Alice	Atlas	CMS	LHCb	Comment
19/09/12	100	100	100	100	100
20/09/12	100	100	86.7	100	100	Castor Database blocking issue late evening
21/09/12	100	100	97.3	100	100	Continuation of above after midnight
22/09/12	100	100	100	100	100
23/09/12	100	100	100	100	100
24/09/12	100	100	99.0	100	100	Single failure to connect to srm-atlas.
25/09/12	100	100	85.3	100	100	Scheduled Atlas Castor Stager 2.1.12-10 update.
26/09/12	100	100	100	100	100
27/09/12	100	100	100	100	100
28/09/12	100	100	100	100	100
29/09/12	100	100	100	100	100
30/09/12	100	100	78.7	100	100	Problem with Atlas SRM database.
01/10/12	100	100	100	100	100
02/10/12	100	93.4	100	100	100	CE test failed when job aborted by VO.

Tier1 Operations Report 2012-10-03