RAL Tier1 Operations Report for 17th July 2013

Review of Issues during the week 10th to 17th July 2013.

The CVMFS problems - notably affecting CMS - have been ongoing as we verified that CVMFS client version 2.1.12 works OK. This has been the case and this version has been rolled out across the batch farm.
The Atlas Castor 2.1.13-9 upgrade overran significantly (4 hours) last Wednesday (10th July). The problems were in the updating of the configurations and OS of the head nodes and disk servers. The upgrade was completed OK.
The problem reported last week with connections to the batch server failing has continued. The problem started at the same time as the batch server was updated. This update was rolled back last Thursday (11th) but the problem remains.
On Wednesday late afternoon monitoring showed unusual activity (or lack of it) on the Castor GEN instance which was put into a 'warning' state in the GOC DB overnight . No problems were subsequently identified.

Resolved Disk Server Issues

None

Current operational status and issues

The uplink to the UKLight Router is running on a single 10Gbit link, rather than a pair of such links.
The problem LHCb jobs failing due to long job set-up times is still under investigation. The recent updates to the CVMFS clients to v2.1.12 is promising.
The testing of FTS3 is continuing and the service is being put on a more 'production' footing. (This runs in parallel with our existing FTS2 service).
We are participating in xrootd federated access tests for Atlas.
Testing is ongoing with the proposed new batch system (ARC-CEs, Condor, SL6). Atlas and CMS running work through this. ALICE, LHCb & H1 being brought on-board with the testing.

Ongoing Disk Server Issues

On Thursday evening, 11th July, GDSS664 (AtlasDataDisk, D1T0) failed. There have been significant problems rebuilding the RAID array containing the data and at one point Atlas were warned we may have data loss. However, the server was brought backup on Tuesday (16th) and following checksumming of a sample of files to validate the data the server is being drained ahead of further investigations.

Notable Changes made this last week

Castor Atlas instance (stager) was upgraded to version 2.1.13-9 last Wednesday (10th).
CVMFS client version 2.1.12 has been rolled out to most of the batch farm.
Software updates applied to the batch server the week before were rolled back on Thursday 11th July.
The two ARC-CEs were added to the GOC DB a week ago and were set to 'monitored' this Monday (15th).

Declared in the GOC DB

Tuesday 23rd July: Upgrade of CMS and LHCb Castor instances to version 2.1.13-9

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced.

Two reboots of site firewall between 07:45 and 08:45: Tuesday 23rd July.
Update the remaining Castor stagers on the following dates: CMS & LHCb: Tuesday 23rd July; GEN Tuesday 30th July.
Wednesday 24th July: Transition of Thames valley Network to Janet 6.
Re-establishing the paired (2*10Gbit) link to the UKLight router.

Listing by category:

Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
Castor:
- Upgrade to version 2.1.13 (ongoing)
Networking:
- Single link to UKLight Router to be restored as paired (2*10Gbit) link.
- Update core Tier1 network and change connection to site and OPN including:
  - Install new Routing layer for Tier1
  - Change the way the Tier1 connects to the RAL network.
  - These changes will lead to the removal of the UKLight Router.
Grid Services
- Testing of alternative batch systems (SLURM, Condor) along with ARC-CEs and SL6 Worker Nodes.
Fabric
- One of the disk arrays hosting the FTS, LFC & Atlas 3D databases is showing a fault and an intervention is required.
Infrastructure:
- A 2-day maintenance is being planned sometime in October or November for the following. This is expected to require around a half day outage of power to the UPS room with castor & Batch down for the remaining 1.5 days as equipment is switched off in rotation for the tests.
  - Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
  - Remedial work on the BMS (Building Management System) due to one its three modules being faulty.
  - Electrical safety check. This will require significant (most likely 2 days) downtime during which time the above infrastructure issues will also be addressed.

Entries in GOC DB starting between 3rd and 10th July 2013.

There were three unscheduled entries in the GOC DB. One was an unscheduled OUTAGE - when the upgrade to the Atlas Castor upgrade overran. The other two were unscheduled WARNINGs. One for the batch system (as a change made earlier was reverted). The other for the Castor 'GEN' instance which was experiencing problems.

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
All CEs: lcgce01, lcgce02, lcgce04, lcgce10, lcgce11, lcgce12.	UNSCHEDULED	WARNING	11/07/2013 09:30	11/07/2013 10:35	1 hour and 5 minutes	Batch service At Risk during work on batch server.
Castor GEN instance: srm-alice, srm-biomed, srm-dteam, srm-hone, srm-ilc, srm-mice, srm-minos, srm-na62, srm-snoplus, srm-superb, srm-t2k,	UNSCHEDULED	WARNING	10/07/2013 18:00	11/07/2013 09:25	15 hours and 25 minutes	Some problems seen with Castor GEN instance which are not fully understood. Instance working but being put in Warning overnight.
srm-atlas	UNSCHEDULED	OUTAGE	10/07/2013 14:00	10/07/2013 18:00	4 hours	Extending outage of Atlas Castor instance as the upgrade is overrunning.
srm-atlas	SCHEDULED	OUTAGE	10/07/2013 09:00	10/07/2013 14:00	5 hours	Upgrade of Atlas Castor Stager to version 2.1.13-9.

Open GGUS Tickets (Snapshot at time of meeting)


GGUS ID	Level	Urgency	State	Creation	Last Update	VO	Subject
95820	Green	Less Urgent	In Progress	2013-07-17	2013-07-17	CMS	Many errors with file access at RAL today, maybe related to high load (~5000 jobs running) on the file server.
95757	Green	Less Urgent	In Progress	2013-07-15	2013-07-17	CMS	Jobs are failing at a particular node.
95671	Yellow	Less Urgent	In Progress	2013-07-11	2013-07-17	LHCb	Many jobs are falling at T1_UK_RAL related availability CMSSW release
95435	Red	Urgent	In Progress	2013-07-04	2013-07-04	LHCb	CVMFS problem at RAL-LCG2
91658	Red	Less Urgent	In Progress	2013-02-20	2013-07-16		LFC webdav support
86152	Red	Less Urgent	On Hold	2012-09-17	2013-17-06		correlated packet-loss on perfsonar host

Availability Report


Day	OPS	Alice	Atlas	CMS	LHCb	Comment
10/07/13	100	97.1	67.6	100	100	Atlas: Castor Upgrade; ALICE: CE test failures (CEs could not contact batch server.)
11/07/13	100	97.5	100	100	96.2	CE test failures (CEs could not contact batch server.)
12/07/13	100	100	100	100	96.9	CE test failures (CEs could not contact batch server.)
13/07/13	100	100	98.7	100	100	SRM test failures (Castor)
14/07/13	100	100	90.4	100	100	SRM test failures (Castor)
15/07/13	100	100	94.8	91.9	100	SRM test failures (Castor)
16/07/13	100	96.9	100	95.9	100	ALICE: CE test failure; CMS: SRM test failures (Castor)

Tier1 Operations Report 2013-07-17

RAL Tier1 Operations Report for 17th July 2013

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools