RAL Tier1 Operations Report for 3rd November 2010

Review of Issues during the week from 27th October to 3rd November 2010.

On Tuesday (26th Oct.) LHCb reported problems with malformed TURLs being returned by the LHCb SRMs. This was worked on through Tuesday & Wednesday. It was not possible to progress the issue Thursday. On Friday morning (29th October) the Castor LHCb head nodes were rebooted. This appeared to fix the bad-TURL problem. However, there was a further problem with the LHCb SRMs during the morning. These had not reconnected to the back-end database following the reboot. This was resolved at lunchtime that day.
On the evening of Friday 29th October there were again problems with the LHCb SRMs. This continued through the weekend and appears related to very high request rates on the SRMs. The LHCb castor instance (srm-lhcb) was put in an outage over the weekend. Replacent SRM machines (each more powerful and increasing from two to three nodes) are being prepared in order to boost this aspect of the LHCb storage service.
- A Post Mortem has been created covering the above two issues. This is at:

 https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20101026_LHCb_SRM_Bad_TURL_and_Outage

On Sunday evening we encountered problems with the batch server, and were failing CE SAM tests. This was finally resolved lunchtime on Monday when the batch server was rebooted.
A long standing permissions problem for Alice was resolved following the Castor GEN instance upgrade to version 2.1.9.
The migration of all CMS tape data on the T10KB media has been completed. During the operation a number of bad tapes (i.e. lost data) were uncovered. CMS have been informed of this and we have been working with them to carry out any necessary clean-up etc.
On Tuesday 2nd November the MyProxy server was successfully migrated to a system built with Quattor.
During this morning (3rd November) the pair of nodes behind the site-bdii have undergone a rolling update to glite 3.2 and SL5.
Atlas have started doing a lot of recalls from tape. Yesterday (2nd Nov.) we had an problem reading some 18 data files from tape for them. The first 230 files on the tape were read OK, the problem was with the last 18. Subsequent work to rebuild the MIR (Media Information Record) on the tape enabled these files to be read.

Current operational status and issues.

On Monday (1st November) the Atlas SRMs crashed repeatedly. This was triggered by the use of a particular SRM command that checked the status of files recalled from tape. (There had been a change in the Atlas software that exposed this problem.) On Tuesday morning the Atlas SRMs were upgrade to fix the problem. So far this is looking good.
During the night 26-27 October disk server GDSS117 (CMSWanIn) failed with a read only filesystem. It was removed from production. Since then it has been re-running the acceptance tests before being returned to production.
Gdss280 (CMSFarmRead) which had reported FSProbe errors twice, is still out of production and is undergoing acceptance testing.
Performance issues on Castor Disk Servers for LHCb: This is being kept under observation. Investigations were suspended during the Castor 2.1.9 upgrade but are being resumed now LHCb have re-started running batch work here. There has been a limit of 800 simultaneous batch jobs on LHCb during this period, although it was reduced to 500 at the start of this week. This will be increased in a controlled manner when there is a waiting job queue and Castor/disk performance monitored.
Testing of an EMC disk array with one of its power supplies connected to the UPS supply continues. Further discussions on removing the electrical noise is taking place.
Transformer TX2 in R89 is still out of use. As a result of work carried out on 18th Oct. on TX4 the indication is that the cause of the TX2 problem relates to over sensitive earth leakage detection. Plans are being made to resolve this.
Atlas are now running some user jobs at RAL. In order to ensure these do not cause problems (i.e. excessive load) for the existing Atlas software server, these jobs make use of a pilot CVMFS based solution for delivering Atlas software to the worker nodes. This did expose a permissions problem whereby any Atlas User has rather too open an access to all Atlas data.
The problem with the cooling of one of the power supplies on the tape robot was investigated during a downtime of the tape system yesterday (2nd Nov). However, the problem is not resolved and a further intervention will be required sometime.
There was a problem reported last night (2-3 Nov) with CE SAM tests timing out when trying to use the CMS Castor instance. This appears to be a recurrence of a problem whereby CASTOR is very busy doing Disk-to-Disk copies. CMS have further limited PhEDEx from staging too many files too quickly.

Declared in the GOC DB

None.

Advanced warning:

The following items remain to be scheduled/announced:

Monday 13th December (just after LHC 2010 run ends): UPS test.
Castor Upgrade to 2.1.9.
- Upgrade CMS - Tuesday to Thursday 16-18 November.
- Upgrade ATLAS - Monday to Wednesday 6 - 8 December.
Upgrade to 64-bit OS on Castor disk servers to resolve checksumming problem.
Upgrade to the LHCb SRMs to provide greater capacity.
Note added after the meeting: Planned power outage of Atlas building weekend 11/12 December. Whole Tier1 at risk.

Entries in GOC DB starting between 27th October and 3rd November 2010.

There were a total of five unscheduled entries in the GOC DB for this last week. All relate to problems with the LHCb SRM/Castor.

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
site-bdii	SCHEDULED	WARNING	03/11/2010 09:00	03/11/2010 13:00	4 hours	Rolling update to glite 3.2 and SL5.
lcgrbp01.gridpp.rl.ac.uk,	SCHEDULED	WARNING	02/11/2010 09:30	02/11/2010 13:00	3 hours and 30 minutes	At Risk while system migrated to alternative server built using Quattor.
All SRM end points	SCHEDULED	WARNING	02/11/2010 09:00	02/11/2010 11:37	2 hours and 37 minutes	Tape System Unavailable. Work on tape robot to resolve problem with power supply cooling.
lcgce03, lcgce05, lcgce08, lcgce09	UNSCHEDULED	OUTAGE	30/10/2010 04:30	01/11/2010 12:00	2 days, 7 hours and 30 minutes	Problems on LHCb SRM
srm-lhcb	UNSCHEDULED	OUTAGE	30/10/2010 02:30	01/11/2010 12:44	2 days, 10 hours and 14 minutes	More SRM problems on LHCb
srm-lhcb	UNSCHEDULED	OUTAGE	29/10/2010 12:25	29/10/2010 13:29	1 hour and 4 minutes	Extending Outage while we investigate LHCb SRM errors
srm-lhcb	UNSCHEDULED	OUTAGE	29/10/2010 10:30	29/10/2010 12:30	2 hours	Outage while we investigate LHCb SRM errors
srm-lhcb	UNSCHEDULED	OUTAGE	29/10/2010 08:30	29/10/2010 09:00	30 minutes	Restart of LHCb CASTOR instance to fix SRM bug

Tier1 Operations Report 2010-11-03

Contents

RAL Tier1 Operations Report for 3rd November 2010

Review of Issues during the week from 27th October to 3rd November 2010.

Current operational status and issues.

Declared in the GOC DB

Advanced warning:

Entries in GOC DB starting between 27th October and 3rd November 2010.

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools