RAL Tier1 Operations Report for 17th April 2013

Review of Issues during the week 10th to 17th April 2013.

On Thursday (11th) two interventions were made on the Tier1 network to fix links with high error rates. One was traced to badly seated cable, the other appears to be a faulty fibre cable. This latter cable provides the link to some of the newly installed equipment (including recently deployed Atlas & CMS disk servers). There was a short (10 minute) break in connectivity to these servers as the faulty cable was by-passsed.
On Thursday (11th) there was a planned intervention on the databse behind the FTS & LFC services. These services were stopped for around an hour.
On Friday (12th) a configuration error caused an update to the sudoers file that was incompatible with the CEs. For around an hour or so (until the problem was fixed) we were not able to start batch jobs.
On Friday (12th) afternoon there was a problem with the CRLs on a specific batch worker node.
On Tuesday (16th) there was a short (15 - 20 minute) stop of the Atlas Software server, during which we stopped Atlas batch jobs starting. This nodes is a 'twin' with another node (one of the BDIIs) that had shown problems and required investigating.

Resolved Disk Server Issues

Current operational status and issues

The uplink to the UKLight Router is running on a single 10Gbit link, rather than a pair of such links.
The problem LHCb and Atlas jobs failing due to long job set-up times remains and investigations continue.
The testing of FTS3 is continuing. (This runs in parallel with our existing FTS2 service).
We are participating in xrootd federated access tests for Atlas.
Test batch queue with five SL6/EMI-2 worker nodes and own CE in place.
There is an outstanding problem (and GGUS ticket) affecting the certificate on the MyProxy server.

Ongoing Disk Server Issues

GDSS371 (AtlasTape - D0T1) failed yesterday evening (16th April). There are six files awaiting migration to tape not available. Expect server back by end of this afternoon.

Notable Changes made this last week

On Thursday (11th) the 'Somnus' (LFC & FTS) database was 'rebalanced'. During this the data which had been spread across two disk arrays was consolidated onto one of the arrays. The other array has been reporting some errors and this change paves the way for an intervention to take place on the faulty array.
Kernel/errata updates and removal of AFS software (as opposed to just disabling) completed across worker nodes.
The second of the two batches of new worker nodes (the one from Viglen) has been deployed into production. All the 2012 CPU purchase is now in service and the batch farm currently has over 10,000 job slots.
A dozen batch farm nodes have been reserved for for CMS disk/tape separation testing.
This morning (17th April) the batch server was upgraded to UMD-2. (Note that this does not alter the versions of torque/maui running on the server.)

Declared in the GOC DB

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced.

A program of updating the disk controller firmware in the 2011 Clustervision batch of disk servers is ongoing. (Alice disk servers in this batch remain to be done).

Listing by category:

Databases:
- Apply latest Oracle 'PSU' patches.
- Switch LFC/FTS/3D to new Database Infrastructure.
Castor:
- Upgrade to version 2.1.13
Networking:
- Single link to UKLight Router to be restored as paired (2*10Gbit) link.
- Update core Tier1 network and change connection to site and OPN including:
  - Install new Routing layer for Tier1
  - Change the way the Tier1 connects to the RAL network.
  - These changes will lead to the removal of the UKLight Router.
- Addition of caching DNSs into the Tier1 network.
Grid Services
- Testing of alternative batch systems (e.g. SLURM).
- Upgrade of other EMI-1 components (APEL, UI) under investigation.
Fabric
- One of the disk arrays hosting the FTS, LFC & Atlas 3D databases is showing a fault and an intervention is required.
Infrastructure:
- Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
- Remedial work on the BMS (Building Management System) due to one its three modules being faulty.
- Electrical safety check. This will require significant (maybe 2 days) downtime.

Entries in GOC DB starting between 10th and 17th April 2013.

There were no unscheduled outages during the last week.

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
lcgfts.gridpp.rl.ac.uk, lfc.gridpp.rl.ac.uk	SCHEDULED	OUTAGE	11/04/2013 10:00	11/04/2013 11:05	1 hour and 5 minutes	Outage of LFC and FTS services. The Oracle database behind these services uses two disk arrays. One of the arrays is reporting errors and the database will be reconfigured (rebalanced) to move the data off the faulty array. FTS transfers will be drained before the outage.
Whole site	SCHEDULED	WARNING	10/04/2013 18:00	11/04/2013 00:00	6 hours	An emergency maintenance has been announced for both the main and backup OPN links RAL - CERN.

Open GGUS Tickets (Snapshot at time of meeting)


GGUS ID	Level	Urgency	State	Creation	Last Update	VO	Subject
93315	Green	Urgent	Waiting Reply	2013-04-13	2013-04-15	Atlas	"Checksum mismatch" error at site RAL-LCG2
93149	Red	Less Urgent	On Hold	2013-04-05	2013-04-08	Atlas	RAL-LCG2: jobs failing with " cmtside command was timed out"
93136	Red	Less Urgent	In Progress	2013-04-05	2013-04-15	EPIC	Problems downloading job output using RAL WMS (epic VO)
92266	Red	Less Urgent	In Progress	2013-03-06	2013-04-16		Certificate for RAL myproxy server
91658	Red	Less Urgent	On Hold	2013-02-20	2013-04-03		LFC webdav support
91029	Red	Very Urgent	On Hold	2013-01-30	2013-02-27	Atlas	FTS problem in queryin jobs
86152	Red	Less Urgent	On Hold	2012-09-17	2013-03-19		correlated packet-loss on perfsonar host

Availability Report


Day	OPS	Alice	Atlas	CMS	LHCb	Comment
10/04/13	100	100	100	100	100
11/04/13	100	100	100	95.9	100	Single SRM test failure "user timeout"
12/04/13	96.2	91.3	93.9	100	95.5	failed CE tests after configuration error led to problem with sudo.
13/04/13	100	100	100	95.9	100	Single SRM test failure "user timeout"
14/04/13	100	100	100	95.9	100	Single SRM test failure "user timeout"
15/04/13	100	100	100	100	100
16/04/13	100	100	100	95.9	100	Single SRM test failure "user timeout"

Tier1 Operations Report 2013-04-17