Tier1 Operations Report 2010-12-01

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 1st December 2010

Review of Issues during the week from 24th November to 1st December 2010.

  • On Wednesday (24th Nov) had error reported with too many sessions into Oracle Atlas 3D. This was resolved that afternoon.
  • We have been seeing problems with CMS transfers from RAL for some time (weeks) - they have mostly been timing out due to being so slow. We have also seen problems with slow transfers for other VOs (e.g. Atlas). The problem appeared to be getting worse. On Tuesday (30th Nov) a faulty transceiver was found on one of the uplinks from a switch stack. Replacing this has fixed these slow transfer problems.
  • On Friday morning (26th Nov) gdss120 (lhcbRawDst) was out of production for around 40 minutes. It was rebooted in order to resolve a problem where it didn't see a replacement disk drive.
  • Over the weekend there was a problem with one of the Top-BDII systems. The bdii service failed on the node and was restarted early Saturday morning (27th).
  • On Saturday afternoon there was a problem with disk servers GDSS90 (cmsWanOut). It was taken out of service. Following checks (which could not find any fault) it has been put back in service this morning (Wed. 1st Dec.)
  • There have been problems on WMS03 (the non-LHC WMS). This has locked up and needed restarting on both Monday and Tuesday mornings this week (29 & 30 Nov). This is under investigation.
  • There was a problem overnight (30 Nov - 1 Dec) with the Atlas SRMs which had lost their connections to the database. This was fixed around lunchtime today (1st Dec.)
  • FTS front-end update to glite3.2 was completed OK.
  • Conversion of CE08 to a cream CE was completed OK.

Current operational status and issues.

  • Problems with a particular batch of disk servers: This has been responsible for a significant number of disk server failures, including the two Atlas ones during the last few weeks. This batch of disk servers will be removed from service. The recent availability of the large batch from the '09 purchase makes this possible. It is anticipated this will take place over the next week or two. In the meantime the problematic servers have been marked read-only. For D1T0 servers this has been done by setting the scheduling within Castor to not allocate writes to the relevant disk servers. Files on these servers are still available (for batch work and via FTS).
  • LHCb disk servers have been closely monitored for performance as the amount of batch work done by LHCb has been increased. The max. number of LHCb batch jobs was increased from 800 to 1000 on Monday morning (29 Nov), and then to 1200 this morning (1st Dec.)
  • During the night 26-27 October disk server GDSS117 (CMSWanIn) failed with a read only file system. It was removed from production. This server will be replaced with a spare.
  • Testing of an EMC disk array with one of its power supplies connected to the UPS supply continues. Further discussions on removing the electrical noise have taken place and a solution is being prepared that makes use of one (or more) small isolating transformers.
  • Transformer TX2 in R89 is still out of use. As a result of work carried out on 18th Oct. on TX4 the indication is that the cause of the TX2 problem relates to over sensitive earth leakage detection. Plans are being made to resolve this. TX2 will initially be run for a period with no load before being brought back into use.

Declared in the GOC DB

  • "Outage" for upgrade of Atlas Castor instance - Monday to Wednesday 6-8 December.

Advanced warning:

The following items remain to be scheduled/announced:

  • Today (Wednesday 1st Dec) - rolling update of microcode on half of tape drives.
  • Power outage of Atlas building weekend 11/12 December. Whole Tier1 at risk.
  • Monday 13th December (just after LHC 2010 run ends): UPS test.
  • Upgrade to 64-bit OS on Castor disk servers to resolve checksumming problem.
  • Increase shared memory for OGMA, LUGH & SOMNUS (rolling change)
  • Address permissions problem regarding Atlas User access to all Atlas data.

Entries in GOC DB starting between 24th November and 1st December 2010.

There was one unscheduled entry in the GOC DB for this last week. This was for the unresponsive WMS.

Service Scheduled? Outage/At Risk Start End Duration Reason
lcgwms03 UNSCHEDULED OUTAGE 29/11/2010 20:00 30/11/2010 09:00 13 hours host is not responsive
lcgfts SCHEDULED WARNING 24/11/2010 10:00 24/11/2010 12:00 2 hours Switch over to using the gLite3.2 Web Service
lcgce08 SCHEDULED OUTAGE 22/11/2010 10:00 25/11/2010 14:00 3 days, 4 hours Drain and reinstall as CREAM CE