RAL Tier1 Operations Report for 26th June 2013

Review of Issues during the week 19th to 26th June 2013.

Last week's report referred to an announced scheduled maintenance on both CERN Primary & Backup links overnight Tue/Wed 25/26 June. This was subsequently clarified to only affect the Backup Link - and no break in the connectivity of the Primary Link was seen during the announced time window.
There was a problem with the Atlas SRM database late on Saturday evening (22nd June). This was resolved around midnight by the call-out team. The problem persisted a few hours in total.
There have been further problems with the Atlas SRM database this morning (an Unscheduled Outage has been declared in the GOC DB.) One of the Oracle RAC nodes became unstable.
A problem was seen on the RAL site firewall since Saturday (22nd) with the logging of a large number of connection requests causing high load. This was traced to the ALICE file sharing system making many connections outbound. The number of ALICE jobs has been being restricted temporarily - which has eased the problem while a better fix is decided on.
Problems reported last week of intermittent problems starting LHCb batch jobs have largely disappeared this week. However, the cause is not fully understood.
The issue of LHCb CE tests ending up in the 'whole node' queue (reported last week) is now understood. Discussions with LHCb ongoing.

Resolved Disk Server Issues

Disk Server GDSS720 (AtlasDataDisk - D1T0) has been drained completely following its failure a couple of weeks ago. The server is out of production for a harware fix & testing.

Current operational status and issues

The successful UPS/Generator load test yesterday gives us much more confidence this system would work if there were to be a power failure.
The uplink to the UKLight Router is running on a single 10Gbit link, rather than a pair of such links.
The problem LHCb jobs failing due to long job set-up times remains and investigations continue. Recent updates to the CVMFS clients have improved the situation for Atlas.
The testing of FTS3 is continuing. (This runs in parallel with our existing FTS2 service).
We are participating in xrootd federated access tests for Atlas.
Testing is ongoing with the proposed new batch system (ARC-CEs, Condor, SL6). Atlas and CMS running work through this. ALICE & LHCb being brought on-board with the testing.

Ongoing Disk Server Issues

Notable Changes made this last week

Declared in the GOC DB

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced.

The first part of the Castor 2.1.13 upgrade, updating the Castor Nameserver, is being planned for next Wednesday (3rd July) and will entail a complete Castor stop. The Castor Stagers for the individual instances will be upgraded in the following weeks.
The test ARC-CEs will be added into the BDII tomorrow morning (27th June).
Re-establishing the paired (2*10Gbit) link to the UKLight router. (Aiming to do in next weeks).

Listing by category:

Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
Castor:
- Upgrade to version 2.1.13
Networking:
- Single link to UKLight Router to be restored as paired (2*10Gbit) link.
- Update core Tier1 network and change connection to site and OPN including:
  - Install new Routing layer for Tier1
  - Change the way the Tier1 connects to the RAL network.
  - These changes will lead to the removal of the UKLight Router.
Grid Services
- Testing of alternative batch systems (SLURM, Condor) along with ARC-CEs and SL6 Worker Nodes.
Fabric
- One of the disk arrays hosting the FTS, LFC & Atlas 3D databases is showing a fault and an intervention is required.
Infrastructure:
- A 2-day maintenance is being planned sometime in October or November for the following. This is expected to require around a half day outage of power to the UPS room with castor & Batch down for the remaining 1.5 days as equipment is switched off in rotation for the tests.
  - Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
  - Remedial work on the BMS (Building Management System) due to one its three modules being faulty.
  - Electrical safety check. This will require significant (most likely 2 days) downtime during which time the above infrastructure issues will also be addressed.

Entries in GOC DB starting between 19th and 26th June 2013.

There was one unscheduled entries in the GOC DB - for the Atlas SRM problems this morning.

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
srm-atlas	UNSCHEDULED	OUTAGE	26/06/2013 10:40	26/06/2013 11:50	2 hours	Problem with Database Behind Atlas SRM.
Whole Site	SCHEDULED	WARNING	25/06/2013 10:00	25/06/2013 11:20	1 hour and 20 minutes	At Risk during UPS/Generator load test.

Open GGUS Tickets (Snapshot at time of meeting)


GGUS ID	Level	Urgency	State	Creation	Last Update	VO	Subject
95104	Green	Less Urgent	On Hold	2013-06-26	2013-06-26	CMS	glidein Hammer Cloud problem at T1_UK_RAL
91658	Red	Less Urgent	On Hold	2013-02-20	2013-06-19		LFC webdav support
86152	Red	Less Urgent	On Hold	2012-09-17	2013-17-06		correlated packet-loss on perfsonar host

Availability Report


Day	OPS	Alice	Atlas	CMS	LHCb	Comment
19/06/13	100	100	99.1	100	100	Single failure on SRM 'GET'. Couldn't contact disk server.
20/06/13	100	100	100	95.9	100	Single failure of SRM PUT test. (Timeout.)
21/06/13	100	100	100	100	100
22/06/13	100	100	85.9	100	100	Problem with database behind the Atlas SRM. (Started late evening).
23/06/13	100	100	97.7	100	100	Tail end of above problem.
24/06/13	100	100	100	100	100
25/06/13	100	100	99.2	100	100	Single SRM test failure - failed to delete a file.

Tier1 Operations Report 2013-06-26