Revision as of 11:05, 2 April 2014

RAL Tier1 Operations Report for 2nd April 2014

Review of Issues during the fortnight 19th March to 2nd April 2014.

There was a short (around 5 minute) break in external connectivity to the Tier1 during teh morning of Thursday 20th March and again a similar event the following morning.
There was a failover of an Atlas Castor Database early evening on Tuesday 25th March. The failover called out and the database was put back onto its allocated node. The cause is a bug that has been reported to Oracle.
On Friday, 28th March, we were not running some of the CE SUM tests in a timely manner. It was found that owing to a separate change in the Condor configuration we were no longer prioritising the test jobs. This was fixed.

Resolved Disk Server Issues

None

Current operational status and issues

There have been problems with the CMS Castor instance through the last week. These are triggered by high load on CMS_Tape - with all the disk servers that provide the cache for this ervice class running flat out (as far as network connectivity goes). Work is underway to increase the throughput of this disk cache.
The Castor Team are now able to reproduce the intermittent failures of Castor access via the SRM that has been reported in recent weeks. Understanding of the problem is significantly adcanced and further investigations are ongoing using the Castor Preprod instance. Ideas for a workaround are being developed.
The problem of full Castor disk space for Atlas has been eased. Working with Atlas the file deletion rate has been somewhat improved. However, there is still a problem that needs to be understood.
Around 50 files in tape backed service classes (mainly in GEN) have been found not to have migrated to tape. This is under investigation. The cause for some of these is understood (a bad tape at time of migration). CERN will provide a script to re-send the remaining ones to tape.

Ongoing Disk Server Issues

GDSS239 (Atlas HotDisk) crashed this morning. This is being investigated.

Notable Changes made this last week.

Approximately half of WNs now updated to EMI-3 version of WN. Rollout continuing OK.
The EMI3 Argus server is in use for most of the CEs and one batch of worker nodes.
The old MtProxy server (lcgrbp01.gridpp.rl.ac.uk) has just been turned off today. Its replacement (myproxy.gridpp.rl.ac.uk) is in production.
The 2013 purchases of worker nodes are being added to the farm this week.
Two of the CV2013 disk servers (120TB each) have been added to LHCbDst. A further 9 are being added today. Three further servers are in CMS non-prod awaiting being moved into production imminently.

Declared in the GOC DB

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
srm-lhcb-tape.gridpp.rl.ac.uk	UNSCHEDULED	WARNING	03/04/2014 08:00	03/04/2014 09:30	1 hour and 30 minutes	Warning during further testing of new tape interface (ACSLS),
lcgrbp01.gridpp.rl.ac.uk	SCHEDULED	OUTAGE	02/04/2014 12:00	01/05/2014 12:00	29 days,	System be decommissioned. (Replaced my myproxy.gridpp.rl.ac.uk).

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced.

Listing by category:

Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
Castor:
- Castor 2.1.14 testing is largely complete. We are starting to look at possible dates for rolling this out (probably around April).
Networking:
- Update core Tier1 network and change connection to site and OPN including:
  - Install new Routing layer for Tier1 & change the way the Tier1 connects to the RAL network.
  - These changes will lead to the removal of the UKLight Router.
Fabric
- We are phasing out the use of the software server used by the small VOs.
- Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
- There will be circuit testing of the remaining (i.e. non-UPS) circuits in the machine room during 2014.

Entries in GOC DB starting between the 19th March and 2nd April 2014.

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
lcgrbp01.gridpp.rl.ac.uk,	SCHEDULED	OUTAGE	02/04/2014 12:00	01/05/2014 12:00	29 days,	System be decommissioned. (Replaced my myproxy.gridpp.rl.ac.uk).
All Castor endpoints (All SRMs)	SCHEDULED	WARNING	01/04/2014 09:00	01/04/2014 11:00	2 hours	Testing of new interface to the tape library. During this time Castor disk services will remain up but there will be no tape access. Tape recalls will stall. Writes to tape backed service classes will carry on, with files flushed from the disk caches to tape once the testing is completed.
srm-lhcb-tape.gridpp.rl.ac.uk,	SCHEDULED	OUTAGE	01/04/2014 09:00	01/04/2014 11:00	2 hours	Testing of new interface to the tape library. During this time Castor disk services will remain up but there will be no tape access. Tape recalls will stall. Writes to tape backed service classes will carry on, with files flushed from the disk caches to tape once the testing is completed.

Open GGUS Tickets (Snapshot during morning of meeting)


GGUS ID	Level	Urgency	State	Creation	Last Update	VO	Subject
102902	Green	Urgent	In Progress	2014-04-01	2014-04-02	MICE & NA62	Stale .cvmfswhitelist file MICE VO
102770	Green	Urgent	On Hold	2014-03-27	2014-04-01		NAGIOS eu.egi.sec.EMI-2 failed on lcgrbp01.gridpp.rl.ac.uk@RAL-LCG2
102611	Green	Urgent	In Progress	2014-03-24	2014-03-24		NAGIOS eu.egi.sec.Argus-EMI-1 failed on argusngi.gridpp.rl.ac.uk@RAL-LCG2
101968	Yellow	Less Urgent	On Hold	2014-03-11	2014-0-01	Atlas	RAL-LCG2_SCRATCHDISK: One dataset to delete is causing 1379 deletion errors
101079	Red	Less Urgent	In Progress	2014-02-09	2014-04-01		ARC CEs have VOViews with a default SE of "0"
99556	Red	Very Urgent	On Hold	2013-12-06	2014-03-21		NGI Argus requests for NGI_UK
98249	Red	Urgent	In Progress	2013-10-21	2014-03-13	SNO+	please configure cvmfs stratum-0 for SNO+ at RAL T1
97025	Red	Less urgent	On Hold	2013-09-03	2014-03-04		Myproxy server certificate does not contain hostname

Availability Report

Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 508); CMS HC = CMS HammerCloud


Day	OPS	Alice	Atlas	CMS	LHCb	Atlas HC	CMS HC	Comment
19/03/14	100	100	100	88.6	100	99	73	Multiple SRM test failures (load problems).
20/03/14	100	100	99.7	99.6	100	100	n/a	Atlas: One SRM Test failure; CMS - CE Test failures on all 3 Arc-ce’s (no compatible resources).
21/03/14	100	100	100	100	100	100	n/a
22/03/14	100	100	100	100	100	100	n/a
23/03/14	100	100	100	100	100	100	n/a
24/03/14	100	100	100	100	100	100	n/a
25/03/14	100	100	99.0	89.8	100	98	99	Atlas: Castor database problem (Atlas_srm DB moved to another RAC node following a DB crash); CMS SRM SUM test failures separated through day.
26/03/14	100	100	100	87.1	100	100	99	Four separate SRM test failures.
27/03/14	100	100	100	96.5	100	97	100	Two test failures of SRM Put test.
28/03/14	100	100	100	100	100	100	100
29/03/14	100	100	100	100	100	99	100
30/03/14	100	100	100	100	100	100	99
31/03/14	100	100	100	100	100	100	99
01/04/14	100	100	100	100	100	100	99

@@ Line 9: / Line 9: @@
 | style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Review of Issues during the fortnight 19th March to 2nd April 2014.
 |}
-* On Wednesday early evening (12th March) there was a failure of the primary link to CERN between 17:00 and 19:00. Traffic flowed over the backup link. However, the failover was not clean and during this time we were failing the VO SUM tests.
+* There was a short (around 5 minute) break in external connectivity to the Tier1 during teh morning of Thursday 20th March and again a similar event the following morning.
-* There was a problem with one of the FTS2 agent systems in the early hours of Thursday 13th March. Owing to a configuration error the hypervisor hosting this virtual machine rebooted and this particular system was not configured to re-start. This was resolved by the primary on-call.
+* There was a failover of an Atlas Castor Database early evening on Tuesday 25th March. The failover called out and the database was put back onto its allocated node. The cause is a bug that has been reported to Oracle.
+* On Friday, 28th March, we were not running some of the CE SUM tests in a timely manner. It was found that owing to a separate change in the Condor configuration we were no longer prioritising the test jobs. This was fixed.
 <!-- ***********End Review of Issues during last week*********** ----->
 <!-- *********************************************************** ----->

Difference between revisions of "Tier1 Operations Report 2014-04-02"

Revision as of 11:05, 2 April 2014

RAL Tier1 Operations Report for 2nd April 2014

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools