Latest revision as of 10:41, 24 November 2010

RAL Tier1 Operations Report for 24th November 2010

On Thursday 18th November disk server gdss221 (LHCbUser) had a problem and was taken out of production. Following an intervention and verification of the RAID array it was returned to production later that day.
Also on Thursday we saw a very high number of connections to OGMA, the Atlas 3D database. This is related to a bug in Atlas software which will be fixed. Also a 'sniper script' has been deployed to clear up stale database sessions.
On Friday 19th November e saw some high loads on the LHCbUser area in Castor for a while.
During the weekend 13/14 there had been high load on Atlas storage with the limitation seen in the SRMs. Since then we had been running with some FTS channels turned down and a reduced batch capacity for Atlas. Yesterday, 23rd November, two new ATLAS SRM servers were deployed yesterday (23rd November). These run the back-end daemon to decouple them from the front end, and are not included in the srm-atlas alias.
The problem with the cooling of one of the power supplies on the tape robot was fixed yesterday (23rd November)
The upgrade of the CMS Castor instance was completed within schedule last week. During the upgrade a configuration problem was found with migrations to tape. This was resolved before the outage ended.

Over the weekend Atlas Disk Server gdss391 (AtlasDataDisk) failed with FSProbe errors. Data loss has been reported to Atlas. This follows close behind the failure of gdss398 (also in AtlasDataDisk) on 8th November, which also led to data loss. Both of these servers are from the same batch. Problems with these servers are being followed up with the vendor and plans are being made to withdraw the batch of servers from production while the issue is resolved. A Post Mortem is being prepared for the first incident at:

 https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20101108_Atlas_Disk_Server_GDSS398_Data_Loss

During the night 26-27 October disk server GDSS117 (CMSWanIn) failed with a read only filesystem. It was removed from production. This server will be replaced with a spare.
SInce the Castor upgrade there have been problems with CMS transfers from RAL - they have mostly been timing out due to being so slow. The cause of this has not been fully understood. On Monday afternoon (22nd) CMS was switched over to GridFTP internal, and some cmsWanOut disk servers were partially drained in order to spread the load out more evenly. Furthermore, the number of D2D copies was also reduced from 5 to 1 per source disk server and the CMS file limit on each CLOUD channel from RAL was halved. On Tuesday morning 5 extra disk servers were put into cmsWanOut (moved from cmsTemp). On Monday afternoon things started to get better. However, between 9pm and 10pm on Monday all CMS transfers to/from RAL failed with SRM errors. The cause of this is also not known. However, since this incident everything has been much better.
Testing of an EMC disk array with one of its power supplies connected to the UPS supply continues. Further discussions on removing the electrical noise have taken place and a solution is being prepared.
Transformer TX2 in R89 is still out of use. As a result of work carried out on 18th Oct. on TX4 the indication is that the cause of the TX2 problem relates to over sensitive earth leakage detection. Plans are being made to resolve this.

"Outage" on CE08 - Drain and reinstall as CREAM CE
"Warning" for Switch over to using the gLite3.2 Web Service Wednesday 24th November 10-12.
"Outage" for upgrade of Atlas Castor instance - Monday to Wednesday 6-8 December.

The following items remain to be scheduled/announced:

There were no unscheduled entries in the GOC DB for this last week.

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
lcgfts	SCHEDULED	WARNING	24/11/2010 10:00	24/11/2010 12:00	2 hours	Switch over to using the gLite3.2 Web Service
All Castor (storage)	SCHEDULED	WARNING	23/11/2010 10:00	23/11/2010 14:00	4 hours	Tape System Unavailable. Work on tape robot to resolve problem with power supply cooling.
lcgce08	SCHEDULED	OUTAGE	22/11/2010 10:00	25/11/2010 17:00	3 days, 7 hours	Drain and reinstall as CREAM CE
srm-cms	SCHEDULED	OUTAGE	16/11/2010 08:00	18/11/2010 10:16	2 days, 2 hours and 16 minutes	Upgrade of CMS Castor instance to version 2.1.9.