RAL Tier1 Operations Report for 9th October 2013

Review of Issues during the week 2nd to 9th October 2013.

At the time of the meeting last week (2nd Oct) we had a problem with a Network switch. This caused a loss of access to some older batch nodes and the disk servers in AtlasHotDisk. The problem was resolved after about 90 minutes.
There was a network break of about 45 minutes overnight Wed/Thu 2/3 October. This broke our connectivity from RAL to the rest of the world. Staff attended on site to fix it. Following this there were some issues with one of the Site routers which was restarted at 10am that day.
During the upgrade of the Torque/Maui farm to SL6 we suffered a significant loss of availability for Atlas. During this time Atlas were successfully running jobs through the ARC CEs and the Condor batch farm. However, Atlas do not have critical tests on the ARC CEs and these were not included in their availability calculations.

Resolved Disk Server Issues

GDSS673 (CMSDisk - D1T0) failed on Saturday (28th Sep) - possibly due to a disk failing during a RAID verify. The system was then returned to service on Monday 30th Sep. However, it failed again later that evening. A further two disks failed while it was still rebuilding the RAID array. The system was returned to service on Thursday 3rd Oct. Four CMS data files were declared lost following this incident. These were discovered while performing a routine checksum validation before returning the machine to production. Investigations suggest all four files were in-flight when the machine went down.

Current operational status and issues

The uplink to the UKLight Router is running on a single 10Gbit link, rather than a pair of such links.
The FTS3 testing has continued very actively with Atlas. Problems with FTS3 are being uncovered during these tests. Patches are being regularly applied to FTS3 to deal with issues.
We are participating in xrootd federated access tests for Atlas. The server has now been successfully configured to work as an xroot redirector, whereas before it could only serve as a proxy.
The Condor batch farm has been marked as in production. This contains around 50% of the total batch capacity. All its WNs running SL6. The remaining nodes are in the Torque/Maui farm and its WNs have been upgraded to Sl6 as well. We plan to keep this configuration (with both farms running SL6 WNs with 50% of the total capacity) until early November.

Ongoing Disk Server Issues

None

Notable Changes made this last week.

FTS3 was upgraded to version 3.1.22-1.el6 during the afternoon of Wed 2nd October; and then again to 3.1.26-1 on Tuesday 8th Oct.
On Thursday 3rd Oct all SL5 nodes in the Torque/Maui farm were stopped and the farm re-started with worker nodes running SL6. Two batches of SL6 WNs were put into the farm that day and another batch the next day. The two final batches of WNs were re-installed with SL6 at the start of this week and added back into the Torque/maui farm. Some configuration issues with these last two batches of WNs have since been discovered and are being investigated.
On Monday morning 7th October a set of fans in the UPS were replaced. During this time there was a marginal additional risk to our services as the UPS was bypassed and would not have been available had there been a general power cut.
Tuesday 8th October: Update to Janet6 infrastructure for the primary OPN link to CERN. This was transparent as we switched to the backup link while the work was carried out.

Declared in the GOC DB

None

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced.

Re-establishing the paired (2*10Gbit) link to the UKLight router.
Interruption to some services over Tuesday/Wednesday 5/6 November during work on the UPS and safety testing of its circuits.
Alastair reported verbally at the meeting: We plan to start testing CVMFS 2.1.15 now that the SL6 migration has been completed successfully. We are not aware of any specific VO concerns (eg GGUS tickets) with the current release and therefore will be testing gradually. We will keep the VOs informed, please let us know of problems. If things go well we should be doing large scale testing the week after next (week beginning 21st October).

Listing by category:

Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
Castor:
- None
Networking:
- Single link to UKLight Router to be restored as paired (2*10Gbit) link.
- Update core Tier1 network and change connection to site and OPN including:
  - Install new Routing layer for Tier1
  - Change the way the Tier1 connects to the RAL network.
  - These changes will lead to the removal of the UKLight Router.
Fabric
- One of the disk arrays hosting the FTS, LFC & Atlas 3D databases is showing a fault and an intervention is required - initially to update the disk array's firmware.
Infrastructure:
- A 2-day maintenance on the UPS along with the safety testing of associated electrical circuits is being planned for the 5th/6th November (TBC). The impact of this on our services is still being worked out. During this the following issues will be addressed:
  - Intervention required on the "Essential Power Board".
  - Remedial work on the BMS (Building Management System) due to one its three modules being faulty.
  - Electrical safety check. This will take place over a couple of days during which time individual UPS circuits will need to be powered down.

Entries in GOC DB starting between the 2nd and 9th October 2013.

There was one unscheduled outage in the GOC DB for this period. This is the Warning for the JANET 6 upgrade of the Primary OPN link to CERN which we advertised late.

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
Whole Site	UNSCHEDULED	WARNING	08/10/2013 10:30	08/10/2013 11:30	1 hour	Primary OPN link to CERN being migrated to new Janet 6 infrastructure.
lcgce01, lcgce02, lcgce04, lcgce10, lcgce11	SCHEDULED	OUTAGE	02/10/2013 11:15	03/10/2013 14:00	1 day, 2 hours and 45 minutes	Upgrading WNs to SL6. Will drain queues out ahead of WN re-installs with the new OS.
Whole Site	SCHEDULED	WARNING	02/10/2013 10:00	02/10/2013 11:30	1 hour and 30 minutes	At Risk during test of generator backup to the main UPS.

Open GGUS Tickets (Snapshot at time of meeting)


GGUS ID	Level	Urgency	State	Creation	Last Update	VO	Subject
97868	Green	Less Urgent	In Progress	2013-10-08	2013-10-08	T2K	CVMFS for t2k.org
97759	Yellow	Urgent	On Hold	2013-10-04	2013-10-04	OPS	SHA-2 test failing on lcgce01
97516	Red	Urgent	In Progress	2013-09-23	2013-09-30	T2K	[SE][StatusOfPutRequest][SRM_REQUEST_INPROGRESS] errors.
97479	Red	Very Urgent	On Hold	2013-09-20	2013-09-30	Atlas	RAL-LCG2, high job failure rate
97385	Red	Less Urgent	On Hold	2013-09-17	2013-09-26	HyperK	CVMFS for hyperk.org
97025	Red	Less urgent	On Hold	2013-09-03	2013-09-12		Myproxy server certificate does not contain hostname
91658	Red	Less Urgent	On Hold	2013-02-20	2013-09-03		LFC webdav support
86152	Red	Less Urgent	On Hold	2012-09-17	2013-06-17		correlated packet-loss on perfsonar host

Availability Report


Day	OPS	Alice	Atlas	CMS	LHCb	Comment
02/10/13	100	100	42.7	95.9	100	Atlas: Drain of Torque/Maui farm left Atlas without working CEs in profile; CMS Single SRM test failure on GET
03/10/13	98.4	100	25.6	85.3	95.8	Atlas: As for 02/10; Others: Site Network Break
04/10/13	100	100	100	100	100
05/10/13	100	100	100	100	100
06/10/13	100	100	100	100	100
07/10/13	100	100	100	100	100
08/10/13	100	100	100	100	100

Tier1 Operations Report 2013-10-09

RAL Tier1 Operations Report for 9th October 2013

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools