Tier1 Operations Report 2015-04-08

RAL Tier1 Operations Report for 8th April 2015

Review of Issues during the week 2nd April to 8st April 2015.

On Wednesday 1st April there was a problem with the Argus service. The service was restarted before there were any tickets.
On Thursday 2nd April, a HyperVisor locked up and this caused a production squid and a test fts machine to be unavailable for some hours.

Resolved Disk Server Issues

Current operational status and issues

We are running with a single router connecting the Tier1 network to the site network, rather than a resilient pair.

Ongoing Disk Server Issues

Notable Changes made this last week.

gfal2 and davix rpms are in the process of being updated across the worker nodes.

Declared in the GOC DB

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
All Castor (all SRM endpoints)	SCHEDULED	OUTAGE	08/04/2015 10:00	08/04/2015 14:00	4 hours	Upgrade of Castor storage to version 2.1.14-15

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced.

Separate some non-Tier1 services off our network so as to be able to more easily investigate the router problems.

Listing by category:

Databases:
- Switch LFC/3D to new Database Infrastructure.
- Update to Oracle 11.2.0.4
Castor:
- Update SRMs to new version (includes updating to SL6).
- Fix discrepancies found in some of the Castor database tables and columns. (The issue has no operational impact.)
Networking:
- Resolve problems with primary Tier1 Router
- Enable the RIP protocol for updating routing tables on the Tier1 routers. (Install patch to Router software).
- Increase bandwidth of the link from the Tier1 into the RAL internal site network to 40Gbit.
- Make routing changes to allow the removal of the UKLight Router.
Fabric
- Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)

Entries in GOC DB starting since the last report.

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
All Castor	SCHEDULED	OUTAGE	08/04/2015 10:00	08/04/2015 14:00	4 hours	Upgrade of Castor storage to version 2.1.14-15
Entire site	SCHEDULED	WARNING	01/04/2015 07:45	01/04/2015 11:00	3 hours and 15 minutes	Warning on site for network test/reconfiguration and load test of UPS/generator.

Open GGUS Tickets (Snapshot during morning of meeting)


GGUS ID	Level	Urgency	State	Creation	Last Update	VO	Subject
112866	Green	Less Urgent	In Progress	2015-04-02	2015-04-07	CMS	Many jobs are failed/aborted at T1_UK_RAL
112819	Green	Less Urgent	In Progress	2015-04-02	2015-04-07	SNO+	ArcSync hanging
112721	Green	Less Urgent	waiting for reply	2015-03-28	2015-04-06	Atlas	RAL-LCG2: SOURCE Failed to get source file size
112713	Green	Urgent	In Progress	2015-03-27	2015-03-31	CMS	Please clean up unmerged area - RAL
111699	Yellow	Less Urgent	In Progress	2015-02-10	2015-03-23	Atlas	gLExec hammercloud jobs keep failing at RAL-LCG2 & RALPP
109694	Red	Urgent	In Progress	2014-11-03	2015-03-31	SNO+	gfal-copy failing for files at RAL
108944	Red	Less Urgent	In Progress	2014-10-01	2015-03-30	CMS	AAA access test failing at T1_UK_RAL

Availability Report

Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 508); CMS HC = CMS HammerCloud


Day	OPS	Alice	Atlas	CMS	LHCb	Atlas HC	CMS HC	Comment
01/04/15	100	98	97	43	100	99	98	All VOs had failed tests due to the Argus problem
02/04/15	100	100	100	100	100	98	n/a
03/04/15	100	100	100	100	100	98	98
04/04/15	100	100	100	100	100	100	n/a
05/04/15	100	100	84	16	100	100	100	gSOAP errors on the CMS SRM SUM test between ~1200 and ~1700
06/04/15	100	100	100	100	100	100	n/a
07/04/15	100	100	100	100	100	100	94