RAL Tier1 Incident 20101026 LHCb SRM Bad TURL and Outage

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Incident 26th October 2010: LHCb SRM returns Bad TURL and Subsequently Suffers Outage

Description:

On Tuesday 26th October LHCb raised a GGUS ticket that reported a problem with the SRM endpoint (srm-lhcb) returning badly formed TURLs. Investigations took place but the cause of this was not found. On Friday morning (29th October) a restart of the Castor head nodes was carried out in order to resolve the TURL problem. However, significant problems were then encountered with time-outs being returned by the LHCB SRMs. Restarting the SRM server at lunchtime that day appeared to fix the time-out problem. However, the time-outs came back later that evening/night (29/30 October). On Monday (1st November) some clean-up and re-configuration of the SRMs was done and the service declared as being back in production. It was noted that the badly formed TURL problem has not been seen since the restarts on Friday morning.

Impact

The initial problem of badly formed TURLs was sufficient to cause LHCb to stop using the RAL Tier1.

The time-out problems effectively took srm-lhcb off air for the morning of Friday 29th October and over the weekend (30/31 October).

Timeline of the Incident

When What
26th October 2010 14:00 GGUS ticket (63468) from LHCb indicating a malformed TURL returned by SRM.Followed by initial response to ticket.
26th October 2010 17:00 GGUS ticket update from RAL. Cause not understood. Suggest affects 10% of returned TURLs.
27th October 2010 14:30 GGUS ticket update from RAL. Cause still not understood. Background information provided.
28th October 2010 Unable to advance investigations into bad TURL problem owing to lack of available expertise this day. Decision taken to restart LHCb the following morning.
29th October 2010 08:00 LHCb (R.Nandakumar) confirm that LHCb are happy with proposal to restart LHCb instance
29th October 2010 09:30 Restarted Castor LHCb instance to try to resolve the TURL problem. (Outage declared in GOCDB 09:30 to 10:00)
29th October 2010 1000 Internal (rfio-based) Castor tests show all OK. Return LHCb Castor instance to production.
29th October 2010 10:30 Failing SAM tests on srm-lhcb. Declared outage in GOC DB. Initially declared from 10:30 to 12:30. This was subsequently extended to 13:30.
29th October 2010 13:20 Rebooted the LHCb SRM machines. SAM tests then passing. A subsequent analysis of the logs shows the problem was due to the back-end SRM processes not connecting to the database after the morning restart. The logs also suggest the problem did resolve itself immediately before these lunch-time reboots.
30th October 2010 01:10 SAM tests again failing. On-call staff notified by automated system. Investigations failed to resolve the problem. Declared srm-lhcb down in GOC DB until Monday. (I.e from 02:30 on Saturday 30-Oct to 17:00 on Monday 1st Nov.) However, at times over the weekend the srm-lhcb did respond correctly. Subsequent analysis of SRM response times and the SRM request rate indicates a large increase in requests on Friday evening.
1st November 2010 12:44 Sequence of events over Friday and weekend better understood. End GOCDB outage for srm-lhcb.
1st November 2010 12:50 We note that there have not been any cases of the bad TURL problem since Friday morning, 29th October.

Incident details

On Tuesday 26th October LHCb raised a GGUS ticket to report the LHCb Castor SRM was returning badly formed TURLs. Investigations showed this affected around 10% of requests. Work was carried out to try and understand the underlying cause of this problem.

On Friday morning the above problem was still unresolved. To try and make progress a reboot of the LHCb Castor instance 'head nodes' (i.e. all LHCb Castor systems except the disk servers) was carried out on Friday morning, 29th October. However, problems were subsequently found with the srm-lhcb end-point failing SAM tests with time-out errors. These in turn were resolved by restarting the LHCb SRM nodes on Friday lunchtime.

During the evening of Friday 29th October srm-lhcb again started to give time-out errors. Attempts to resolve the matter failed and srm-lhcb was put into a GOCDB outage until Monday (1st Nov.). Further work on Monday morning enabled the service to return to production with the downtime ended at lunch-time that day.

Analysis

The initial problem was the LHCb castor SRM returning badly formed TURLs. This appeared on Tuesday 26th October and affected a percentage (around 10%) of requests. The cause of this was not, and has not (as yet) been understood. The problem was followed up during the 26th & 27th October. However, there were no staff available with sufficient expertise in the SRM to follow up on Thursday and Friday 28/29 October.

A reboot of the LHCb Castor instance 'head nodes' (i.e. all LHCb Castor systems except the disk servers) was carried out on Friday morning, 29th October. The SRM back-end processes did not re-connect to the database due to a known bug. This was not picked up. A further reboot of the SRMs around lunchtime resolved the database connection issue. During Friday morning srm-lhcb was returning time-out errors.

During the evening of Friday 29th October srm-lhcb again returned time-outs. Attempts to resolve the matter during the night failed and srm-lhcb was put into a GOCDB outage until Monday (1st Nov.). Subsequent analysis showed that the trigger for this problem was a significant increase in the rate of requests to srm-lhcb on Friday evening (29th). The request rate to the SRMs went from 30,000 in the hour from 19:00 to >100,000 in the next hour and then >200,000 in following hour (from 21:00). 72% of these SRM requests originated from the RAL, FTS, 27% from CERN and 1% from the batch farm. FTS monitoring indicated that activity was transfer to LHCB Tier-1 sites. The bulk of the SRM transactions were status requests (one per file per second). Its likely that as performance of the SRMs degraded status requests made the situation much worse leading to meltdown. On Monday some stuck requests in the SRM database were cleaned up and also the number of daemon threads increased from 15 to 25. These actions are expected to improve, but not completely resolve, this problem.

Follow Up

Issue Response Done
Testing Castor after restart did not pick up the SRM problem. Modify procedures to ensure to test both CASTOR and SRM when doing acceptance testing of CASTOR following intervention. yes
The SRM daemon did not reconnect to the database once restarted. The bug that prevented reconnection to the database will require a change within in the Castor code that is outside the remit of the RAL team. N/A
Lack of SRM expertise across Castor team make SRM investigations difficult when key member of staff absent. Spread SRM knowledge better across Castor Team. Update June 2013: The need for the SRM is expected to diminish in future and sufficient expertise has been available from CERN. This s now believed sufficient and this action will no longer be followed up. N/A
Throughput capacity of the SRMs for LHCb are not sufficient for the loads seen. Upgrade the machines (number of machines / performance of machines) used to host the LHCb SRMs. yes
FTS polls status requests every second potentially causing meltdown Investigate why FTS appears not to have honoured the SRM's response of estimatedWaitTime. Note added following review on 28/06/11. This is being tracked in a Savannah ticket and will no longer be tracked here. N/A


Reported by: Gareth Smith 1st November 2010 13:00

Summary Table

Start Date 26th October 2010
Impact >50%
Duration of Outage 3 days
Status Closed
Root Cause Multiple. Unknown (bad TURLs), Software Bug (DB reconnects/SRM status requests), Limited Capacity & User Load
Data Loss No