Difference between revisions of "Tier1 Operations Report 2010-11-24"

From GridPP Wiki
Jump to: navigation, search
 
(No difference)

Latest revision as of 10:41, 24 November 2010

RAL Tier1 Operations Report for 24th November 2010

Review of Issues during the week from 17th to 24th November 2010.

  • On Thursday 18th November disk server gdss221 (LHCbUser) had a problem and was taken out of production. Following an intervention and verification of the RAID array it was returned to production later that day.
  • Also on Thursday we saw a very high number of connections to OGMA, the Atlas 3D database. This is related to a bug in Atlas software which will be fixed. Also a 'sniper script' has been deployed to clear up stale database sessions.
  • On Friday 19th November e saw some high loads on the LHCbUser area in Castor for a while.
  • During the weekend 13/14 there had been high load on Atlas storage with the limitation seen in the SRMs. Since then we had been running with some FTS channels turned down and a reduced batch capacity for Atlas. Yesterday, 23rd November, two new ATLAS SRM servers were deployed yesterday (23rd November). These run the back-end daemon to decouple them from the front end, and are not included in the srm-atlas alias.
  • The problem with the cooling of one of the power supplies on the tape robot was fixed yesterday (23rd November)
  • The upgrade of the CMS Castor instance was completed within schedule last week. During the upgrade a configuration problem was found with migrations to tape. This was resolved before the outage ended.

Current operational status and issues.

  • Over the weekend Atlas Disk Server gdss391 (AtlasDataDisk) failed with FSProbe errors. Data loss has been reported to Atlas. This follows close behind the failure of gdss398 (also in AtlasDataDisk) on 8th November, which also led to data loss. Both of these servers are from the same batch. Problems with these servers are being followed up with the vendor and plans are being made to withdraw the batch of servers from production while the issue is resolved. A Post Mortem is being prepared for the first incident at:
 https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20101108_Atlas_Disk_Server_GDSS398_Data_Loss
  • During the night 26-27 October disk server GDSS117 (CMSWanIn) failed with a read only filesystem. It was removed from production. This server will be replaced with a spare.
  • SInce the Castor upgrade there have been problems with CMS transfers from RAL - they have mostly been timing out due to being so slow. The cause of this has not been fully understood. On Monday afternoon (22nd) CMS was switched over to GridFTP internal, and some cmsWanOut disk servers were partially drained in order to spread the load out more evenly. Furthermore, the number of D2D copies was also reduced from 5 to 1 per source disk server and the CMS file limit on each CLOUD channel from RAL was halved. On Tuesday morning 5 extra disk servers were put into cmsWanOut (moved from cmsTemp). On Monday afternoon things started to get better. However, between 9pm and 10pm on Monday all CMS transfers to/from RAL failed with SRM errors. The cause of this is also not known. However, since this incident everything has been much better.
  • Testing of an EMC disk array with one of its power supplies connected to the UPS supply continues. Further discussions on removing the electrical noise have taken place and a solution is being prepared.
  • Transformer TX2 in R89 is still out of use. As a result of work carried out on 18th Oct. on TX4 the indication is that the cause of the TX2 problem relates to over sensitive earth leakage detection. Plans are being made to resolve this.

Declared in the GOC DB

  • "Outage" on CE08 - Drain and reinstall as CREAM CE
  • "Warning" for Switch over to using the gLite3.2 Web Service Wednesday 24th November 10-12.
  • "Outage" for upgrade of Atlas Castor instance - Monday to Wednesday 6-8 December.

Advanced warning:

The following items remain to be scheduled/announced:

  • Next week - rolling update of microcode on half of tape drives.
  • Power outage of Atlas building weekend 11/12 December. Whole Tier1 at risk.
  • Monday 13th December (just after LHC 2010 run ends): UPS test.
  • Upgrade to 64-bit OS on Castor disk servers to resolve checksumming problem.
  • Increase shared memory for OGMA, LUGH & SOMNUS (rolling change)
  • Address permissions problem regarding Atlas User access to all Atlas data.

Entries in GOC DB starting between 17th and 24th November 2010.

There were no unscheduled entries in the GOC DB for this last week.

Service Scheduled? Outage/At Risk Start End Duration Reason
lcgfts SCHEDULED WARNING 24/11/2010 10:00 24/11/2010 12:00 2 hours Switch over to using the gLite3.2 Web Service
All Castor (storage) SCHEDULED WARNING 23/11/2010 10:00 23/11/2010 14:00 4 hours Tape System Unavailable. Work on tape robot to resolve problem with power supply cooling.
lcgce08 SCHEDULED OUTAGE 22/11/2010 10:00 25/11/2010 17:00 3 days, 7 hours Drain and reinstall as CREAM CE
srm-cms SCHEDULED OUTAGE 16/11/2010 08:00 18/11/2010 10:16 2 days, 2 hours and 16 minutes Upgrade of CMS Castor instance to version 2.1.9.