Tier1 Operations Report 2011-04-06

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 6th April 2011

Review of Issues during the fortnight from 23rd March to 6th April 2011.

  • On Sunday 13th March there was a problem with one of the three LHCB SRMs systems (lcgsrm0660) which had failed. This was removed from the SRM triplet. On Thursday 31st March a replacement system was brought into use and the LHCb SRMs are now back at full strength with three nodes in production.
  • On Thursday 14th March one of the top-bdii nodes failed to boot correctly after it was accidentally power cycled. Some lookups would have failed until the DNS was updated to remove this node from the top-bdii set. The cause of this is now understood and a fix rolled out.
  • Summary of changes made during the last fortnight:
    • Castor upgrade to version 2.1.10 was carried out on Monday 28th March (CMS instance) and 30th March (Atlas, LHCb & GEN instances).
    • On Wednesday 30th March the CVMFS client on the Worker nodes was upgrade to version 0.2.61
    • One tranche of the 2010 purchase of Worker nodes has been added to the batch capacity since the last meeting.

Current operational status and issues.

  • On Tuesday Evening 15th March there was a Read Only file system reported on GDSS426 (AtlasDataDisk - D1T0). The server was put back into production in a Read-Only mode during the afternoon of Wednesday 16th March. On 23rd March it was put into draining mode. The drain completed on 28th March and since then the system the system has been out of production while it undergoes tests.
  • On Monday morning 28th March GDSS502 (GEN_TAPE) failed and was withdrawn from production. Following investigations one T2K file was declared lost.
  • On Friday 4th April both GDSS481 & GDSS488 (both Atlas Datadisk) were put into ‘draining’. This is a preventative measure as these two nodes have were showing a similar anomaly on their RAID arrays as gdss502 had exhibited before its failure.
  • There were significant problems with the site BDIIs both over the period 3rd to 5th April. The cause is understood and a fix is being worked on.
  • Following a report from Atlas of failures accessing the LFC at RAL we have been following some network issues at RAL that are believed to be the underlying cause of this. These problems are intermittent and are being tracked by the site networking team. A test running over the backup SuperJanet link to London was made this morning (6th April) as part of this.
  • Atlas have reported slow data transfers into the RAL Tier1 from other Tier1s and CERN (ie. asymmetrical performance). The investigation into this is ongoing.
  • We are aware of a problem with the Castor Job manager that can occasionally hang up. The service recovers by itself after around 30 minutes. This problem may have been fixed by the Castor 2.1.10 update. However, the automated capture of more diagnostic information remains in place and we await more information from that should the problem recur.
  • We note that the introduction of isolating transformers into the power feeds to the disk arrays for the Oracle Databases several weeks ago has been successful with no errors reported since their introduction.

Declared in the GOC DB

  • Warning on whole site during the power work in the Atlas building this coming weekend (9/10 April).

Advanced warning:

The following items are being discussed and are still to be formally scheduled:

  • Updates to Site Routers (the Site Access Router and the UKLight router) are required.
  • Castor SRM upgrades to version 2.10
  • Upgrade Castor clients on the Worker Nodes to version 2.1.10 & Atlas request to add xrootd libraries to worker nodes.
  • Address permissions problem regarding Atlas User access to all Atlas data.
  • Minor Castor update to enable access to T10KC tapes.
  • Networking upgrade to provide sufficient bandwidth for T10KC tapes.
  • Switch Castor and LFC/FTS/3D to new Database Infrastructure.

Entries in GOC DB starting between 23rd March and 6th April 2011.

There were no unscheduled entries in this period.

Service Scheduled? Outage/At Risk Start End Duration Reason
srm-alice, srm-atlas, srm-cert, srm-dteam, srm-hone, srm-ilc, srm-lhcb, srm-mice, srm-minos, srm-preprod, srm-superb, srm-t2k. SCHEDULED OUTAGE 30/03/2011 09:00 30/03/2011 14:15 5 hours and 15 minutes Upgrade of Atlas, LHCb and GEN Castor instances to version 2.1.10-0
srm-alice, srm-atlas, srm-cert, srm-dteam, srm-hone, srm-ilc, srm-lhcb, srm-mice, srm-minos, srm-preprod, srm-superb, srm-t2k. UNSCHEDULED OUTAGE 30/03/2011 08:00 30/03/2011 09:00 1 hour Upgrade of Atlas, LHCb and GEN Castor instances to version 2.1.10-0. Adding short downtime to compensate for summer time.
srm-cms SCHEDULED OUTAGE 28/03/2011 09:00 28/03/2011 17:00 8 hours Upgrade of CMS Castor instance to version 2.1.10-0