Latest revision as of 13:20, 5 March 2014

RAL Tier1 Operations Report for 5th March 2014

Review of Issues during the week 26th February to 5th March 2014.

The Atlas disk space in Castor has become full. We are aware of an ongoing problem where file deletions triggered by Atlas' central service are slow. Some 'manual' deletions of files are taking place to speed up the process.
There have been significant problems with part of our Hyper-V infrastructure that runs many production virtual machines. This started on Friday (28th). The more important VMs have been moved elsewhere while the underlying problem is investigated. The problems were only worked round to sufficiently for us to report services were OK on Tuesday (4th). Services impacted included FTS3 which was running a large scale test for Atlas & CMS. At our request, on Tuesday (4th) Atlas moved the bulk of their file transfers (all except for the UK) to other FTS3 servers.
Three CMS files have been lost from a tape. The tape monitoring showed a problem when the tape was being read. On investigation a number of bad files were found on the tape. After further work some of the files were recovered. Three files were finally declared lost to CMS.

Resolved Disk Server Issues

Current operational status and issues

The intermittent failures of Castor access via the SRM reported in recent weeks is still present. This has been seen across multiple Castor instances and the Castor team are actively working to understand this. Some changes have been made with the aim of alleviating the problem, but it recurred this morning (Wednesday 5th March).

Ongoing Disk Server Issues

Notable Changes made this last week.

Two updates have been applied to FTS3 (now at 3.1.80-1)
Increased daemon thread counts for transfermanagerd and stagerd rolled out to all CASTOR instances. This is part of investigations into the Castor problems reported elewhere in this report.
Reduced number of replicas for atlasHotDisk from 10 to 1
The new MyProxy server (myproxy.gridpp.rl.ac.uk) added to the BDII. UIs changed to use this as their default MyProxy server.

Declared in the GOC DB

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced.

The Tier1 will move to use the new site firewall on Monday 17th March. There will some interruption to services as seen from outside RAL. Internally services are expected to continue uninterrupted.

Listing by category:

Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
Castor:
- Castor 2.1.14 testing is largely complete. We are starting to look at possible dates for rolling this out (probably around April).
Networking:
- Implementation of new site firewall.
- Update core Tier1 network and change connection to site and OPN including:
  - Install new Routing layer for Tier1 & change the way the Tier1 connects to the RAL network.
  - These changes will lead to the removal of the UKLight Router.
Fabric
- We are phasing out the use of the software server used by the small VOs.
- Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
- There will be circuit testing of the remaining (i.e. non-UPS) circuits in the machine room during 2014.

Entries in GOC DB starting between the 26th February and 5th March 2014.

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
arc-ce01.gridpp.rl.ac.uk	SCHEDULED	WARNING	26/02/2014 10:00	26/02/2014 12:00	2 hours	At Risk during software upgrade to version 13.11 / 4.0.0.

Open GGUS Tickets (Snapshot during morning of meeting)


GGUS ID	Level	Urgency	State	Creation	Last Update	VO	Subject
101729	Green	Top Priority	Waiting Reply	2014-03-01	2014-03-05	LHCb	Pilots failed at cream-ce02.gridpp.rl.ac.uk RAL-LCG2
101701	Green	Less Urgent	In Progress	2014-02-28	2014-02-28	ILC	Pilots aborted on ARC CEs
101557	Green	Less Urgent	In Progress	2014-02-25	2014-03-04	SNO+	Unable to delegate proxy to fts
101532	Green	Less Urgent	In Progress	2014-02-25	2014-02-25		Publishing default value for Max CPU Time
101079	Red	Urgent	In Progress	2014-02-09	2014-02-25		ARC CEs have VOViews with a default SE of "0"
101052	Red	Urgent	In Progress	2014-02-06	2014-02-26	Biomed	Can't retrieve job result file from cream-ce02.gridpp.rl.ac.uk
100114	Red	Less Urgent	In Progress	2014-01-08	2014-03-04		Jobs failing to get from RAL WMS to Imperial
99556	Red	Very Urgent	In Progress	2013-12-06	2014-02-13		NGI Argus requests for NGI_UK
98249	Red	Urgent	On Hold	2013-10-21	2014-01-29	SNO+	please configure cvmfs stratum-0 for SNO+ at RAL T1
97025	Red	Less urgent	On Hold	2013-09-03	2014-03-04		Myproxy server certificate does not contain hostname

Availability Report


Day	OPS	Alice	Atlas	CMS	LHCb	Comment
26/02/14	100	100	100	100	100
27/02/14	100	100	100	100	100
28/02/14	100	100	100	100	100
01/03/14	100	100	100	100	100
02/03/14	100	100	100	100	100
03/03/14	100	100	94.0	100	100	Multiple SRM test failures. (4 * "User timeout"; 1 * "SRM_FILE_BUSY")
04/03/14	100	100	99.3	100	100	Single SRM test failure ("Invalid argument")