RAL Tier1 Operations Report for 11th September 2013

Review of Issues during the week 4th to 11th September 2013.

On Thursday (5th Sep) there were problems with the CREAM CE in front of the Condor farm (cream-ce01). Following investigations and a start at draining the CE, it was re-installed on Friday morning (6th Sep) and the problems resolved.
On Saturday morning (7th Sep) there were problems with the Atlas & GEN Castor instances / SRMs. However, it was difficult to investigate as there was a gap in the logging - which was also seen in other Castor headnodes and disk servers. The problem was partly worked around. On Monday a problem was found (and fixed) with one of the system loggers.
On Sunday late afternoon a Stage1 Fire alarm was activated in the R89 machine room. Staff attended and the problem was traced to an Active Harmonic (electrical) Filter in one of the Power Distribution Units. There was no further effect of the failure. On Monday morning the filter was switched off.

Resolved Disk Server Issues

Current operational status and issues

The uplink to the UKLight Router is running on a single 10Gbit link, rather than a pair of such links.
The FTS3 testing has continued very actively. Atlas have moved the UK, German and French clouds to use it. Problems with FTS3 are being uncovered during these tests. Patches are being regularly applied to FTS3 to deal with issues. (The FTS3 servers were upgraded to version 3.1.9 1.el6 during this last week.)
We are participating in xrootd federated access tests for Atlas. The server has now been successfully configured to work as an xroot redirector, whereas before it could only serve as a proxy.
The Change Control process has agree that we will move to a Condor batch farm with (at least initially) both ARC and CREAM CEs. Final testing is ongoing with this farm (ARC-CEs, Condor, SL6). The '08 and '09 batches of worker nodes being currently being drained and moved from the old (Torque/Maui) farm to the Condor farm. However, we have not yet finalised whether the migration of all nodes to SL6 will be done by moving the remaining WNs to the Condor farm or if a portion of the farm will be upgraded 'in-situ' in the Torque/Maui farm.

Ongoing Disk Server Issues

Notable Changes made this last fortnight.

Both lcgwms04 & lcgwms05 have been upgraded and are now EMI-3 SL6 WMS servers.(The installation is SHA-2 compliant and includes also the condor release needed for interfacing with ARC-CE nodes.)
On Monday (9th Sep) FTS3 was updated to version 3.1.9-1.el6
Access to the 'whole node' queue on the Torque/maui farm, has been restricted (access via lcgce02 and lcgce04 stopped).

Declared in the GOC DB

LCGCE12 (CE for SL6 test Queue on the production batch farm) is in a long Outage ready for decommissioning.

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced.

The "Whole Node" queue on the Torque/Maui batch service is being terminated. Multi-core jobs and those requiring SL6 can be run on the Condor batch system.
Thursday 26th September is being provisionally allocated for upgrading the Torque/Maui batch farm WNs to SL6.
On Tuesday 1st October RAL network connections will move to SuperJanet 6.
Monday 7th October: Replacement of fans in UPS (UPS not available for 4-5 hours).
Re-establishing the paired (2*10Gbit) link to the UKLight router.

Listing by category:

Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
Castor:
- None
Networking:
- Single link to UKLight Router to be restored as paired (2*10Gbit) link.
- Update core Tier1 network and change connection to site and OPN including:
  - Install new Routing layer for Tier1
  - Change the way the Tier1 connects to the RAL network.
  - These changes will lead to the removal of the UKLight Router.
Grid Services
- Testing of alternative batch systems (Condor) along with ARC & CREAM CEs and SL6 Worker Nodes.
Fabric
- One of the disk arrays hosting the FTS, LFC & Atlas 3D databases is showing a fault and an intervention is required - initially to update the disk array's firmware.
Infrastructure:
- A 2-day maintenance on the UPS along with the safety testing of associated electrical circuits is being planned for the 5th/6th November (TBC). The impact of this on our services is still being worked out. During this the following issues will be addressed:
  - Intervention required on the "Essential Power Board".
  - Remedial work on the BMS (Building Management System) due to one its three modules being faulty.
  - Electrical safety check. This will take place over a couple of days during which time individual UPS circuits will need to be powered down.

Entries in GOC DB starting between the 4th and 11th September 2013.

There were no unscheduled outages in the GOC DB for this period.

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
lcgwms05.gridpp.rl.ac.uk	SCHEDULED	OUTAGE	06/09/2013 13:00	11/09/2013 10:55	6 days, 2 hours	upgrade to EMI-3
lcgce12.gridpp.rl.ac.uk	SCHEDULED	OUTAGE	05/09/2013 13:00	04/10/2013 13:00	29 days,	CE (and the SL6 batch queue behind it) being decommissioned.
lcgwms04.gridpp.rl.ac.uk	SCHEDULED	OUTAGE	29/08/2013 17:00	05/09/2013 09:05	6 days, 16 hours and 5 minutes	upgrade to EMI-3
lcgce12.gridpp.rl.ac.uk	SCHEDULED	OUTAGE	06/08/2013 13:00	05/09/2013 13:00	30 days,	CE (and the SL6 batch queue behind it) being decommissioned.

Open GGUS Tickets (Snapshot at time of meeting)


GGUS ID	Level	Urgency	State	Creation	Last Update	VO	Subject
97168	Green	Less urgent	In Progress	2013-09-09	2013-09-09	londongrid	LFC problems for londongrid vo (lfc.gridpp.rl.ac.uk)
97025	Red	Less urgent	In Progress	2013-09-03	2013-09-04		Myproxy server certificate does not contain hostname
96235	Red	Less urgent	On Hold	2013-07-29	2013-09-09	hyperk.org	LFC for hyperk.org
96233	Red	Less Urgent	On Hold	2013-07-29	2013-09-09	hyperk.org	WMS for hyperk.org - RAL
95996	Red	Urgent	In Progress	2013-07-22	2013-09-05	OPS	SHA-2 test failing on lcgce01
91658	Red	Less Urgent	On Hold	2013-02-20	2013-09-03		LFC webdav support
86152	Red	Less Urgent	On Hold	2012-09-17	2013-16-17		correlated packet-loss on perfsonar host

Availability Report


Day	OPS	Alice	Atlas	CMS	LHCb	Comment
04/09/13	100	100	96.2	95.5	100	Atlas a few Castor SUM test failures; CMS a single one. (Error reading token data)
05/09/13	100	100	99.4	100	100	Single SRM SUM test failure: Error on Del. "Error reading token data header: Connection closed"
06/09/13	100	100	99.3	100	100	Single SRM SUM test failure on put: ERROR: 'NoneType' object has no attribute 'kill' exceptions.AttributeError
07/09/13	100	100	77.5	100	100	Problem with central loggers may have caused problems (& lack of logging info led to problems with diagnostics).
08/09/13	100	-100	94.8	100	100	Problem with central loggers may have caused problems (& lack of logging info led to problems with diagnostics).
09/09/13	100	100	100	100	100
10/09/13	100	100	100	100	100

Tier1 Operations Report 2013-09-11