Tier1 Operations Report 2011-01-05

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 5th January 2011

Review of Issues during the fortnight from 22nd December 2010 to 5th January 2011.

  • On Wednesday 22nd December Disk Servers GDSS70 (LHCbMDST - out since 17th Dec) & GDSS357 (CMSFarmRead - out since 18/19 Dec) were returned to production.
  • GDSS117 (CMSWanIn) was out of production overnight 21/22 December to resolve a problem where it did not see a replacement disk drive.
  • On Thursday 23rd December GDSS337 (GenTape) failed. There was only one un-migrated file (for T2K) on it. This server is still out of production and undergoing hardware tests.
  • Saturday 25th December GDSS283 (cmsFarmRead) reported file system and fsprobe errors. It was removed from production. Investigations into possible hardware fault ongoing.
  • On Tuesday 28th December there were some problems reported (by the regional Nagios) on both the site and top BDII services. Restarting the site BDIIs resolved the problem, although the underlying cause has not been found.
  • There were problems caused by load on the Atlas Castor instance between the 27th and 30th December. The cause was a very full service class, resulting in a lot of access falling to a small number of nodes. (In fact almost entirely on a single server, gdss488, at one point). This was compounded by our draining of the faulty batch of disk servers - although the I/O rates for this were low in comparison to the load from Atlas. Having previously marked the servers to be drained as 'read only' some confusion can occur as the published free space seen by Atlas does not reflect available (writeable) space. Over the holiday the number of servers being simultaneously drained was reduced from three to one for each service class, although it has since been increased back to three again. Here is a brief timeline of this issue:
    • Mon 27th: High load on AtlasMCDisk (AtlasSimStrip) - put disk server (gdss488 into draining) This helped somewhat.
    • Tue 28th: Increase number of job slots per disk server from 32 to 50 for gsiftp (for all Atlas disks servers)
    • Tue 28th: Reduce number of production jobs down to 500 for ATLAS to try and reduce load on disk servers.
    • Tue 28th: Evening - six disk servers deployed to AtlasMCDisk (AtlasSimStrip)
    • Thu 30th: Atlas FTS channels reduced to 25% of normal values. File transfers immediately started succeeding. Attempts had been made to reduce the FTS channels earlier, but a problem had been encountered with the procedure.
    • Friday 31st: Limit on atlas batch jobs raised to 2000.
    • Wed 5th Jan. Six disk servers moved from AtlasStripInput (AtlasDataDisk) to AtlasSimStrip (AtlasMCDisk)
  • Atlas tape migration problem. During the week before the holiday and over the holiday Atlas tape migrations were not working normally. A workaround was put in place where the tape migration was periodically forced so as to reduce the backlog. The problem (traced to a configuration error relating to tape pools) was found on Tuesday 4th January and the problem fixed.
  • Friday 31st December (late afternoon) Emergency shutdown of PDU G6 in LPD room. A member of staff making regular checks noticed something wrong and on checking the unit the seriously over-temperature and the PDU was shutdown. An initial assessment showed all production systems stayed up as they are dual powered and we continued running. However, a review on Saturday morning showed concern over the data arrays holding the Oracle databases. These are dual powered, UPS and mains (LPD) power. It was not clear whether they were still dual powered following the PDU shutdown, or now only on UPS power. As a precaution services (Castor, LFC, FTS, 3D) were stopped while staff attended on site. The disk arrays hosting the LFC, FTS & 3D databases were found to only be on UPS power, a configuration known to potentially cause problems. These arrays were re-powered (so that all arrays were again dual powered, UPS & mains) and services were restarted. This resulted in an outage of 3.5 hours to Castor, FTS, LFC & 3D services.
  • Saturday 1st January: 21:00: Disk server GDSS364 (CMSTemp) crashed. Taken out of production and undergoing tests.
  • Sunday 2nd January: We were not starting Atlas batch jobs. Traced to a large number of stuck jobs dating from 27th December. Once these were cleared out jobs started Atlas batch started running normally again.

Current operational status and issues.

  • Problems with a particular batch of disk servers that has been responsible for a significant number of disk server failures: These servers are being drained and taken out of use. There are a total of eight disk servers remaining to be removed, of which three are being drained at the moment.
  • Problem with disk servers becoming unresponsive: We have had a number of cases where disk servers fail to respond (Nagios tests time out). The problem is under investigation. In come cases these recover by themselves. It appears to be related to rsyslog - at least restarting this seems to have resolved the problem in a couple of cases. Incidents in the last fortnight are:
    • Thursday 23rd December: GDSS464 (LHCbDst) This was resolved about 90 minutes later.
    • Friday 24th December: GDSS375 (LHCbDst) A restart of rsyslog fixed it.
    • Saturday 25th December GDSS117 (CMSWanIn) Failed to respond in a similar way to others. This system recovered overnight.
    • Saturday 1st January: GDSS97 (CMSWanIn) Resolved when staff attended site later that day.
  • Transformer TX2 in R89 is still out of use.

Declared in the GOC DB

  • 17/18 January: Upgrade to 64-bit OS on Castor disk servers for Atlas.

Advanced warning:

The following items are being discussed and are still to be formally scheduled:

  • Application of kernel update to batch server (some small risk to batch services).
  • Upgrade to 64-bit OS on Castor disk servers to resolve checksumming problem. Possible dates for this are:
    • CMS - Late January or Early February 2011.
  • 22/23 January. Weekend power outage in Atlas building (leading to "At Risk").
  • Increase shared memory for OGMA, LUGH & SOMNUS (rolling change).
  • Address permissions problem regarding Atlas User access to all Atlas data.
  • Upgrade all Oracle databases from version 10.2.0.4 to 10.2.0.5 (assuming this upgrade goes OK at CERN).

Entries in GOC DB starting between 22nd December 2010 and 5th January 2011.

There were two unscheduled entries, one a "Outage", the other an "At Risk". Both relate to the power (PDU) problem that occurred on 31st December.

Service Scheduled? Outage/At Risk Start End Duration Reason
All Castor, FTS & 3D. UNSCHEDULED OUTAGE 01/01/2011 11:30 01/01/2011 15:00 3 hours and 30 minutes Following a power issue yesterday we have stopped services while checks are made.
All Castor UNSCHEDULED AT_RISK 31/12/2010 08:00 04/01/2011 16:00 4 days, 8 hours Castor services At Risk owing to power supply problem.
srm-cert SCHEDULED OUTAGE 24/12/2010 15:00 04/01/2011 10:00 10 days, 19 hours This is a test SRM endpoint that will not have a guaranteed service over the holiday.