RAL Tier1 Operations Report for 20th March 2013

Review of Issues during the week 13th to 20th March 2013.

On Friday afternoon (15th Mar) there was a problem that lasted about 6 minutes on our Tier1 network at approximately 15:40. This caused a spike in FTS transfer failures as well as some SUM test failures. In general services continues OK but staff made a check round for possible problems.
On Tuesday (19th Mar) at around midday there was a problem on the site network that lasted around 15 minutes. Tier1 services carried on running although there were some test failures at this time.

Resolved Disk Server Issues

GDSS519 (GenTape D0T1) was put into a draining mode following discovery of a single corrupt file last Wed morning (13th Mar). The server was checked out, confirming only one file was bad and revealing a faulty disk drive that was replaced. The server was returned to production later that day.

Current operational status and issues

The uplink to the UKLight Router is running on a single 10Gbit link, rather than a pair of such links.
There have been intermittent problems over the past few weeks with the start rate for batch jobs. A script has been introduced to regularly check for this and take action to minimise its effects.
The problem LHCb and Atlas jobs failing due to long job set-up times remains. (The change to run jobs re-niced has not resolved the problem). Investigations continue.
The testing of FTS3 is continuing. (This runs in parallel with our existing FTS2 service).
We are participating in xrootd federated access tests for Atlas.
Test batch queue with five SL6/EMI-2 worker nodes and own CE in place.
The change to the certificate used by the MyProxy server announced for Monday 18th Mar. had to be backed out. An alternatice solution to the MyProxy certificate problem reported in GGUS#92266 is being worked on.

Ongoing Disk Server Issues

Notable Changes made this last week

An analysis of data rates shows that the intervention on Tuesday 12th Mar (replacing the core C300 switch and modifying the link to the UKLight router) has resolved the problem of asymmetric data rates in/out of the Tier1.
The APEL publisher was upgraded from UMD-1 to UMD-2 last Thursday (14th Mar).
The Castor client software has been upgraded to version 2.1.13 on all worker nodes.
Updating disk controller firmware on Clustervision '11 batch of disk servers ongoing.

Declared in the GOC DB

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced.

A program of updating the disk controller firmware in the 2011 Clustervision batch of disk servers is ongoing.

Listing by category:

Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
Castor:
- Upgrade to version 2.1.13
Networking:
- Single link to UKLight Router to be restored as paired (2*10Gbit) link.
- Update core Tier1 network and change connection to site and OPN including:
  - Install new Routing layer for Tier1
  - Change the way the Tier1 connects to the RAL network.
  - These changes will lead to the removal of the UKLight Router.
- Addition of caching DNSs into the Tier1 network.
Grid Services
- Upgrade of other EMI-1 components (APEL, UI) under investigation.
Infrastructure:
- Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
- Remedial work on the BMS (Building Management System) due to one its three modules being faulty.
- Electrical safety check (will require some downtime).

Entries in GOC DB starting between 13 and 20th March 2013.

There were no unscheduled entries in the GOC DB for the last week.

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
lcgrbp01.gridpp.rl.ac.uk,	SCHEDULED	WARNING	18/03/2013 10:00	18/03/2013 11:00	1 hour	Warning for hour following replacement of a certificate on the MyProxy server. (Ref GGUS ticket 92266)

Open GGUS Tickets (Snapshot at time of meeting)


GGUS ID	Level	Urgency	State	Creation	Last Update	VO	Subject
92266	Amber	Less Urgent	In Progress	2013-03-06	2013-03-19		Certificate for RAL myproxy server
91974	Red	Urgent	In Progress	2013-03-04	2013-03-13		NAGIOS eu.egi.sec.EMI-1 failed on lcgwms01.gridpp.rl.ac.uk@RAL-LCG2
91658	Red	Less Urgent	In Progress	2013-02-20	2013-03-13		LFC webdav support
91146	Red	Urgent	In Progress	2013-02-04	2013-03-14	Atlas	RAL input bandwith issues
91029	Red	Very Urgent	On Hold	2013-01-30	2013-02-27	Atlas	FTS problem in queryin jobs
86152	Red	Less Urgent	On Hold	2012-09-17	2013-03-19		correlated packet-loss on perfsonar host

Availability Report


Day	OPS	Alice	Atlas	CMS	LHCb	Comment
13/03/13	100	100	100	100	100
14/03/13	100	100	100	100	100
15/03/13	100	100	99.2	91.8	95.8	Problem with Tier1 Network (packet storm) around 15:40 caused some failures. Also CMS had a single "SRM timeout" earlier in the day.
16/03/13	100	100	100	100	100
17/03/13	100	100	99.1	100	100	Single SRM Test failure "zero number of replicas"
18/03/13	100	100	100	100	100
19/03/13	100	100	96.2	95.9	100	Test failures (SRM & CE) around midday owing to Site Network problem. In addition Atlas suffered a few other (mainly) SRM test failures earlier in the day.

Tier1 Operations Report 2013-03-20