Difference between revisions of "Tier1 Operations Report 2015-05-06"

Revision as of 10:09, 6 May 2015

Review of Issues during the three weeks 15th April to 6th May 2015.

On Wednesday 8th April a minor network change triggered a wider problem across the Tier1 network and caused significant disruption of services. The planned Castor upgrade for that day had to be abandoned. A post mortem is being generated for this incident.

Resolved Disk Server Issues

GDSS633 (AtlasTape - D-T1) failed on Sunday 3rd May. It was returned to service late evening the following day (Bank Holiday Monday 4th May).

Current operational status and issues

We are running with a single router connecting the Tier1 network to the site network, rather than a resilient pair.

The bypass link from the Tier1 to the UKLR has been restored to 2 x 10Gb/s but traffic going outbound is not being correctly balanced across the two links. This is still under investigation.

Since 27th March there has been sporadic low level packet loss on the OPN primary link to CERN. This is being investigated.

Ongoing Disk Server Issues

Notable Changes made this last week.

gfal2 and davix rpms are in the process of being updated across the worker nodes.
Yesterday (5th May) the two remaining CREAM CEs were put into draining mode as the next step in their decommissioning.

Declared in the GOC DB

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
lcgfts3.gridpp.rl.ac.uk,	SCHEDULED	WARNING	07/05/2015 11:00	07/05/2015 12:00	1 hour	Update of Production FTS3 Server to version 3.2.33
cream-ce01.gridpp.rl.ac.uk, cream-ce02.gridpp.rl.ac.uk.	SCHEDULED	OUTAGE	05/05/2015 12:00	02/06/2015 12:00	28 days	Decommissioning of CREAM CEs (cream-ce01.gridpp.rl.ac.uk, cream-ce02.gridpp.rl.ac.uk).

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced.

Turn off CREAM CEs (planned for 5th May).
Turn off ARC-CE05. This will leave four ARC CEs (as planned.) This fifth was set-up to temporarily workround a specific problem and is no longer required.

Listing by category:

Databases:
- Switch LFC/3D to new Database Infrastructure.
- Update to Oracle 11.2.0.4. This will affect all services that use Oracle databases (Castor, LFC and Atlas Frontier)
Castor:
- Update SRMs to new version (includes updating to SL6).
- Fix discrepancies found in some of the Castor database tables and columns. (The issue has no operational impact.)
Networking:
- Separate some non-Tier1 services off our network so as to be able to more easily investigate the router problems.
- Resolve problems with primary Tier1 Router
- Enable the RIP protocol for updating routing tables on the Tier1 routers. (Install patch to Router software).
- Increase bandwidth of the link from the Tier1 into the RAL internal site network to 40Gbit.
- Make routing changes to allow the removal of the UKLight Router.
Fabric
- Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)

Entries in GOC DB starting since the last report.

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
cream-ce01.gridpp.rl.ac.uk, cream-ce02.gridpp.rl.ac.uk	SCHEDULED	OUTAGE	05/05/2015 12:00	02/06/2015 12:00	28 days,	Decommissioning of CREAM CEs (cream-ce01.gridpp.rl.ac.uk, cream-ce02.gridpp.rl.ac.uk).
lcgft-atlas.gridpp.rl.ac.uk,	SCHEDULED	WARNING	05/05/2015 09:00	05/05/2015 11:55	2 hours and 55 minutes	Regular patching of Oracle Database behind the Atlas Frontier service.
lfc.gridpp.rl.ac.uk	SCHEDULED	OUTAGE	28/04/2015 09:00	28/04/2015 14:03	5 hours and 3 minutes	Update of Oracle Database behind the LFC service. There will be an initial outage of up to 90 minutes. Following this the LFC will be available READ-ONLY for some (up to five) hours. At the end of the upgrade the LFC will again be unavailable for a period of up to 90 minutes.
All Castor (all SRM endpoints)	SCHEDULED	OUTAGE	15/04/2015 09:30	15/04/2015 11:57	2 hours and 27 minutes	Castor upgrade to 2.1.14-15 as this was postponed last week.

Open GGUS Tickets (Snapshot during morning of meeting)


GGUS ID	Level	Urgency	State	Creation	Last Update	VO	Subject
113010	Green	Very Urgent	Waiting Reply	2015-04-13	2015-04-14	SNO+	Job statuses reported by WMS not updating
112977	Green	Very Urgent	In Progress	2015-04-10	2015-04-13	CMS	High failure rate for RAL!
112896	Green	Urgent	In Progress	2015-04-09	2015-04-09	CMS	Please check this dataset
112866	Green	Less Urgent	In Progress	2015-04-02	2015-04-07	CMS	Many jobs are failed/aborted at T1_UK_RAL
112819	Green	Less Urgent	In Progress	2015-04-02	2015-04-07	SNO+	ArcSync hanging
112721	Green	Less Urgent	waiting for reply	2015-03-28	2015-04-06	Atlas	RAL-LCG2: SOURCE Failed to get source file size
112713	Green	Urgent	In Progress	2015-03-27	2015-03-31	CMS	Please clean up unmerged area - RAL
111699	Yellow	Less Urgent	In Progress	2015-02-10	2015-03-23	Atlas	gLExec hammercloud jobs keep failing at RAL-LCG2 & RALPP
109694	Red	Urgent	In Progress	2014-11-03	2015-03-31	SNO+	gfal-copy failing for files at RAL
108944	Red	Less Urgent	In Progress	2014-10-01	2015-03-30	CMS	AAA access test failing at T1_UK_RAL

Availability Report

Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 508); CMS HC = CMS HammerCloud


Day	OPS	Alice	Atlas	CMS	LHCb	Atlas HC	CMS HC	Comment
15/04/15	97.9	100	90.0	90.0	90.0	100	98	Castor 2.1.14-15 upgrade.
16/04/15	100	100	94.0	100	100	100	99	Couple of SRM Test errors around half an hour apart.
17/04/15	100	100	100	100	100	100	100
18/04/15	100	100	100	100	100	100	100
19/04/15	100	100	100	100	100	98	100
20/04/15	100	100	100	100	100	100	n/a
21/04/15	100	100	100	100	100	100	n/a
22/04/15	100	100	100	100	100	100	n/a
23/04/15	100	100	100	100	100	100	n/a
24/04/15	100	100	100	100	100	100	n/a
25/04/15	100	100	100	100	100	100	n/a
26/04/15	100	100	100	100	100	100	n/a
27/04/15	100	100	100	99.0	99.0	100	100	Argus problem - job sumbissions failed for a while.
28/04/15	100	100	100	100	100	100	100
29/04/15	100	100	100	100	100	100	100
30/04/15	100	100	100	100	100	100	100
01/05/15	100	100	100	100	96.0	95	100	Single SRM test failure. Error listing file.
02/05/15	100	100	100	100	100	96	100
03/05/15	100	100	100	100	100	83	100
04/05/15	100	100	100	100	100	90	99
05/05/15	100	100	100	52.0	100	93	100	CMS CE SAM tests are failed at many sites due to an expired proxy.

@@ Line 52: / Line 52: @@
 ====== ======
 <!-- ******************************************************************** ----->
-<!-- *************Start Notable Changes made this last week************** ----->
+<!-- *************Start Notable Changes made since the last meeting************** ----->
 {| width="100%" cellspacing="0" cellpadding="0" style="background-color: #ffffff; border: 1px solid silver; border-collapse: collapse; width: 100%; margin: 0 0 1em 0;"
 |-
 | style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Notable Changes made this last week.
 |}
-* Castor was successfully upgraded to version 2.1.14-15 this morning.
 * gfal2 and davix rpms are in the process of being updated across the worker nodes.
+* Yesterday (5th May) the two remaining CREAM CEs were put into draining mode as the next step in their decommissioning.
 <!-- *************End Notable Changes made this last week************** ----->
 <!-- ****************************************************************** ----->