RAL Tier1 Incident 20130626 Failure of RAL CVMFS Stratum1 Triggered Batch Farm Problems

RAL Tier1 Incident 20130626 Failure of RAL CVMFS Stratum1 Triggered Batch Farm Problems=====Description:=

The CVMFS Stratum 1 server at RAL developed hardware problems and had to be stopped. In principle CVMFS fails over to use other replicas. However this did not happen across the Tier1 batch farm where many nodes were running a version of the CVMFS client in which the failover was broken as the version being run was part of a test. Unfortunately this problem coincided with a broken version of the CMS software in the repository. CMS fixed this but many of the RAL worker nodes were unable to pick up the corrected version. Furthermore three other UK sites also reported problems with CVMFS following the failure of the Stratum 1 server.

The RAL server also provides a CVMFS repository for some non-LHC VOs. A mirror at CERN had recently been set-up but not announced at the time of the failure. As part of this incident information was then distributed to the UK sites as to how to re-configure to use this mirror.

Impact

The main impact was that CMS were unable to run any jobs successfully for 24 - 36 hours. A secondary impact was that by applying the fix, the nodes running 2.1.10, which was a sizable part of the farm lost access to CVMFS causing some running jobs to crash.

Timeline of the Incident


When	What
?? June 2013	Hardware failure on raid array (webfs) not seen
26th June 2013 09:51	CMS request us to fix CVMFS
26 June 2013 10:15	Started to respond to webfs h/w failure following notification from CMS. At this point we first became aware a performance problem on the Stratum 1.
26th June 2013 15:00 (approx)	Phonecall received from P.Grobech (Oxford) asking if we had problems with our CVMFS repository. First indication of problem seen by external sites.
26th June 2013 15:30	Problem on CVMFS Stratum 1 announced (on TB-Support list).
26th June 2013 16:56	Reconfiguration of batch farm (using Quattor) not to use the RAL stratum 1.
26th June 2013 17:15	Messages to phys-ibergid & enmr.eu VOs that their CVMFS repositories were not available.
26th June 2013 17:28	Announced availability of CERN mirror of CVMFS repositories for mice, na62 and hone (via TBSupport e-mail list).
27th June 2013 10:00	Stratum 1 server (webfs) and the the reverse proxy squids which sit in front of it turned off.
27th June 2013 11:35	CVMFS configuration on batch farm updated with Quattor to remove references to RAL Stratum 1.
27th June 2013 15:00	Meeting at the Tier 1 to discuss situation. Farm nodes with v2.0.18 - Seems to be slowly propagating the change/update. Farm nodes with v2.1.10 - CVMFS broken (presumably after config update) until a reload done. Farm nodes with v2.1.11 - Know a CVMFS reload picks up new version OK. All these have been reloaded (as for the 2.1.10 nodes)
27th June 2013 16:52	Summary of current situation sent to TB-Support list.
30th June 2013	Stratum 1 server (webfs) back up.
1st July 2013	Batch farm reconfigured to again use the RAL stratum1 server.
4th July 2013 16:00	Roll out newer CERN-provided Nagios test for CVMFS.
5th July 2013 09:030	Enable Event handler on new Nagios test for CVMFS to batch disable node.
5th July 2013 10:45	Request received from PIC about possible problem with the lhcb-conddb repository on the RAL startum 1.
5th July 2013 13:30	Problem with lhcb-conddb snapshots found and fixed on stratum 1 server.
10th July 2013 14:20	Roll out newly available updated CERN-provided Nagios test for CVMFS.
17th July 2013 11:00	Completed roll-out of CVMFS client version 2.1.12.

Incident details

We first noticed a problem when we were notified of the urgent need to update CMS repository in an email from Christoph Wissing at 9:51 on 26th June. It then became clear there was a problem with the backend storage of the stratum 1 server, blocking the update. The relevant staff discussed the problems and how best to resolve them.

At this point we mistakenly believed that:

clients would correctly fail over if we turned off the Stratum 1 and
the only concern was the small VOs whose repositories had not yet been replicated to CERN.

In response a new CVMFS configuration was rolled out for the non-LHC VOs to point them to the CERN replica. The web server on webfs and the the reverse proxy squids which sit in front of it were turned off.

At the end of the working day on the 26th June staff believed the local problems with CVMFS were resolved (apart from the lack of availability of the repository for those smaller VOs for which there was no replica elsewhere.)

In the early afternoon of 26th June the Tier1 was contacted by one of the UK Tier2 sites (Oxford) who were also seeing CVMFS problems. A notification was sent to the 'TB Support' list to give UK Tier2 managers a warning that they may also experience problems.

However, the following morning (27th June) it was clear that some batch jobs were still failing. During the day a number of staff worked on investigating the problem. However, the problem was complex with the failures varying between worker nodes, not only dependent on the CVMFS version running. Not all the staff working on the problem at this time were in the office at RAL and the investigations were further complicated by local/remote staff not being fully aware of what the others were doing.

On Sunday 30th June the RAL stratum 1 server was put back into service and the following day (Monday 1st July) a CVMFS client re-configuration was rolled out across the batch farm to make use of it. The server has also been reconfigured to remove some networking bottlenecks and to improve the performance of the storage backend.

On Thursday the 4th July the Nagios test provided by the CERN CVMFS developers was rolled out. This enabled problems with specific worker nodes to be more easily identified. It was then possible to track down issues specific to individual nodes and either disable them from the batch system or fix their CVMFS repositories in order to re-establish successful batch working.

Analysis

For some months prior to this incident the Tier1 had been tracking a problem of timeouts ("cmtside time out errors") in batch job set-up that was seen by both Atlas and LHCb. This problem was linked to CVMFS and as part of the investigation different versions of the CVMFS client had been tried across the Tier1 batch farm. Tier1 staff were working closely with the CVMFS developers on the problem.

At the time of the incident the batch farm was running 3 different versions of CVMFS client. A few clusters were still running 2.0.18, this was the release that was causing the cmtside time out errors for LHCb and ATLAS. It was otherwise a very stable release and was deployed at most Grid sites. The majority of the farm was using 2.1.10. This was the first release in the 2.1.x branch that we had sufficiently validated and had shown that it reduced the LHCb and ATLAS error rates. After we had pushed out this release a bug was found in the failover mechanism that hadn't been tested in the validation period. The remainder of the farm was running 2.1.11, this was a new test release which was meant to fix the failover problem in 2.1.10.

A problem on the RAL stratum 1 server ("webfs") was discovered following notification from CMS of the need to update the CMS CVMFS repository. It was initially thought that the problem with webfs was to do with updating files, rather than serving files out. Subsequent investigation revealed a problem with the disk array behind the server. However, a lack of monitoring had failed to give an early warning of this.

One of the features of CVMFS is its ability to fail over to use other servers. The initial response to the problem on webfs worked on this assumption. I.e. that both our batch farm and those at other sites would be largely unaffected by the failure. Initial concern focused on those VOs for which RAL is either the only repository or for which the repository at CERN had recently (earlier that week by chance) been set-up (MICE, NA62 and H1).

The configuration update rolled out across the whole farm to use the CERN CVMFS server so as to have access to repositories for MICE, NA62 and H1 was expected to also fix the failover problem knwon in version 2.1.10. At this point is was believed the problem was contained to those repositories for which there was not alternative server (phys-ibergrid, enmr.eu).

The problem was complex as the symptoms seen depended on the version of the CVMFS client on the particular node.

The problem was particularly severe for CMS as just before the failure of the RAL stratum 1 server they had distributed a version of their CVMFS repository with a broken link. This broken link stopped CMS batch jobs from running. During this problem it was not possible to update the CVMS repository on the worker nodes.

During the incident testing a worker node for having successfully failed over to use another server gave misleading results.

At the time of the problem a Nagios test for CVMFS was running on all worker nodes. However, this test was written for the 2.0.x branch of the CVMFS client and did not work for nodes running versions 2.1.x. Updating the test to use the one provided by the CVMFS developers enabled a much better diagnostic of problems on individual worker nodes.

The failure of the Stratum 1 server also affected other sites and three UK Tier2s reported resulting CVMFS problems.

In summary the problem was triggered by a hardware problem on the RAL Stratum 1 server. The effect of this was severe owing to the failover functionality within CVMFS not working.

Follow Up


Issue	Response	Done
No early warning of hardware failures that would affect the RAL Stratum 1 server.	Set-up all Nagios tests for all appropriate items (disk failures etc.)	Yes
The investigation of the job set-up timeouts had led to multiple versions of the CVMFS client running at the same time on the batch farm complicating the picture when a problem ourred.	Set-up guidelines for the amount of testing required before rolling out changes to the CVMFS Clinet (and any other similar software) across the batch farm.	Yes
During the incident communication with the CVMFS developer(s) was routed through one individual. As RAL is testing CVMFS client versions there could be communication difficulties should CERN support be urgently needed and the relevant staff are absent.	Links with the CVMFS developers at CERN to be formalised (e.g. ensuring relevant staff aware of the cvmfs mailing list) so that there is a contact method known to other RAL staff for requesting assistance and providing feedback on CVMFS.	Yes
During the investigations into this incident not all staff working on the problems were present at RAL. There was some confusion as to who was doing, or had done, what and when.	Clarify working procedures when some staff involved in an incident are working remotely.	No

Related issues

Following this incident, problems were discovered in CVMFS client 2.1.11 which caused a significant fraction of CMS jobs to fail. The Tier1 batch farm manager decided it was safest to validate the 2.1.12 release on a cluster for a week before rolling it out across the farm. This allowed both ATLAS and LHCb to confirm that they were not experiencing problems.

Reported by: Gareth Smith. 17th July 2013

Summary Table


Start Date	27 June 2013
Impact	>20% It affected all CMS jobs on the batch farm
Duration of Outage	24 - 36 hours
Status	Open
Root Cause	Software Bug in CVMFS software, Hardware failure of Stratum 1
Data Loss	No

RAL Tier1 Incident 20130626 Failure of RAL CVMFS Stratum1 Triggered Batch Farm Problems

Contents