Latest revision as of 12:19, 24 July 2013

RAL Tier1 Operations Report for 24th July 2013

Review of Issues during the week 17th to 24th July 2013.

On Thursday (18th) the main RAL link to Janet was failed over to the alternative route (via London) when one of the multiple connections to Reading failed. This was transparent to us.
On Thursday (18th) a configuration error affecting all the CEs caused batch problems for a couple of hours until noticed and corrected.
Yesterday (Tuesday 23rd) the primary OPN link to CERN failed and we switched over to the backup route. We ran on the backup link for around 7 hours until the problem (JANET ticket reports a broken fibre) was fixed.

A post mortem report of the Atlas Castor outage on 28-30 June has been prepared and can be seen at:

 https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20130628_Atlas_Castor_Outage

Resolved Disk Server Issues

GDSS664 (AtlasDataDisk, D1T0) failed on 11th July. Following a period when the server was down, which lasted until Tuesday 16th July, it has now been completely drained. All files that were on the server are available to Atlas. The server itself is undergoing hardware checks before being returned to service.

Current operational status and issues

There has been a problem for a few weeks with the batch server and investigations are continuing. The problem started at the same time as the batch server was updated. Although this update was rolled back the problem has remained.
The uplink to the UKLight Router is running on a single 10Gbit link, rather than a pair of such links.
The problem LHCb jobs failing due to long job set-up times is still under investigation. The recent updates to the CVMFS clients to v2.1.12 is promising.
The testing of FTS3 is continuing and the service is being put on a more 'production' footing. (This runs in parallel with our existing FTS2 service).
We are participating in xrootd federated access tests for Atlas.
Testing is ongoing with the proposed new batch system (ARC-CEs, Condor, SL6). Atlas and CMS running work through this. ALICE, LHCb & H1 being brought on-board with the testing.

Ongoing Disk Server Issues

Notable Changes made this last week

CMS & LHCb Castor instance (stager) were upgraded to version 2.1.13-9 yesterday (Tuesday 23rd).

Declared in the GOC DB

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced.

Update the remaining Castor stager GEN on Tuesday 30th July.
The SL6 and "Whole Node" queues on the production batch service will be terminated. Multi-core jobs and those requiring SL6 can be run on the test Condor batch system.
Wednesday 24th July: Transition of Thames Valley Network to Janet 6.
Re-establishing the paired (2*10Gbit) link to the UKLight router.

Listing by category:

Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
Castor:
- Upgrade to version 2.1.13 (ongoing)
Networking:
- Single link to UKLight Router to be restored as paired (2*10Gbit) link.
- Update core Tier1 network and change connection to site and OPN including:
  - Install new Routing layer for Tier1
  - Change the way the Tier1 connects to the RAL network.
  - These changes will lead to the removal of the UKLight Router.
Grid Services
- Testing of alternative batch systems (SLURM, Condor) along with ARC-CEs and SL6 Worker Nodes.
Fabric
- One of the disk arrays hosting the FTS, LFC & Atlas 3D databases is showing a fault and an intervention is required.
Infrastructure:
- A 2-day maintenance is being planned sometime in October or November for the following. This is expected to require around a half day outage of power to the UPS room with castor & Batch down for the remaining 1.5 days as equipment is switched off in rotation for the tests.
  - Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
  - Remedial work on the BMS (Building Management System) due to one its three modules being faulty.
  - Electrical safety check. This will require significant (most likely 2 days) downtime during which time the above infrastructure issues will also be addressed.

Entries in GOC DB starting between the 17th and 24th July 2013.

There were no unscheduled entries in the GOC DB during this last week.

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
srm-cms, srm-lhcb	SCHEDULED	OUTAGE	23/07/2013 10:00	23/07/2013 13:54	3 hours and 54 minutes	Upgrade of CMS and LHCb Castor instances to version 2.1.13-9
Whole site	SCHEDULED	WARNING	23/07/2013 07:45	23/07/2013 08:45	1 hour	Site warning for one hour around two reboots of the site firewall which will take place within this time window.
lcgwms06	SCHEDULED	OUTAGE	19/07/2013 10:00	25/07/2013 12:00	6 days, 2 hours	Upgrade to EMI-3

Open GGUS Tickets (Snapshot at time of meeting)


GGUS ID	Level	Urgency	State	Creation	Last Update	VO	Subject
96102	Green	Less Urgent	In Progress	2013-07-24	2013-07-24	CMS	File Read Error: T1_UK_RAL
96079	Green	Urgent	In Progress	2013-07-23	2013-07-23	Atlas	Slow deletion rate at RAL
95996	Green	Urgent	In Progress	2013-07-22	2013-07-22	OPS	SHA-2 test failing on lcgce01
95904	Yellow	Very Urgent	In Progress	2013-07-20	2013-07-22	LHCb	Pilots aborted at RAL-LCG2
95435	Red	Urgent	In Progress	2013-07-04	2013-07-19	LHCb	CVMFS problem at RAL-LCG2
91658	Red	Less Urgent	Waiting Reply	2013-02-20	2013-07-16		LFC webdav support
86152	Red	Less Urgent	On Hold	2012-09-17	2013-17-06		correlated packet-loss on perfsonar host

Availability Report


Day	OPS	Alice	Atlas	CMS	LHCb	Comment
17/07/13	100	96.9	100	100	96.9	Batch server problems - CEs then unable to contact it.
18/07/13	96.9	91.6	92.3	95.4	88.9	Configuration error on CEs.
19/07/13	100	96.4	100	97.5	100	ALICE: Batch server problems - CEs then unable to contact it; CMS: Single failure of SRM Put Test.
20/07/13	98.3	84.9	100	100	85.7	Batch server problems - CEs then unable to contact it.
21/07/13	100	100	100	100	96.9	Batch server problems - CEs then unable to contact it.
22/07/13	100	100	100	100	100
23/07/13	100	100	100	83.8	83.8	Castor 2.1.13 upgrade for CMS & LHCb.