Tier1 Operations Report 2011-04-13

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 13th April 2011

Review of Issues during the week from 6th to 13th April 2011.

  • On Friday 4th April both GDSS481 & GDSS488 (both Atlas Datadisk) were put into ‘draining’ as a preventative measure. Following the drain the RAID arrays have been rebuilt and the two servers were returned to production on Thursday 7th April.
  • One of the five nodes in the Top-BDII set failed on Thursday morning (7th April). The system was restarted and was available again later that morning.
  • GDSS426 (AtlasDataDisk - D1T0) was returned to production on Thursday 7th April. It had failed with a Read Only file system on 15th March. The server had been run in read-only mode for a while, then drained before having the RAID array rebuilt.
  • Overnight Thursday-Friday (7/8 April) GDSS103 (AtlasFarm - D0T1) crashed. There were no files on it awaiting migration to tape. As this is part of a batch of systems due for withdrawal from service it has been replaced and retired from production
  • On Friday evening, 8th April, there were problems with the Atlas Software Server which was struggling under load. The Atlas batch load was throttled back temporarily.
  • On Saturday evening, 9th April, one of the Network Stacks (Stack 15) encountered a problem - This is the second such occurrence on this stack. This was resolved at around midday on the Sunday (10th). During this time a number of services were degraded (e.g. reduced number of LFC front ends) and the LHCb SRM was not available.
  • On Monday 11th April GDSS502 (GEN_TAPE) was replaced by another server. GDSS502 had failed on 28th March. All except one un-migrated files were copied off. As announced at the last meeting one T2K file was lost from this server. A Post Mortem (currently in draft) for this data loss is being prepared and can be found at:
 https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20110330_Disk_Server_GDSS502_Data_Loss_T2K
  • Problems with the site BDIIs were reported at the last meeting. The cause of these is understood (an interaction between a change in the username used to run the services and Quattor) has been resolved.
  • Summary of changes made during the last fortnight:
    • The second tranche of the 2010 purchase of Worker nodes was added to the batch capacity on the 7th April.
    • On Tuesday 12th April the LHCb SRM was upgraded to version 2.10-2.
    • On Wednesday 13th April the CMS SRM was upgraded to version 2.10-2.
    • LHCb batch work has been switched to use CVMFS for obtaining the LHCb software. This was initially done as a test at the end of last week and over the weekend, but has now been confirmed as a production service.

Current operational status and issues.

  • Some short breaks in connectivity within the Tier1 network have been in recent nights. These have mainly affected Tier1 internal processes. The cause of these is not yet understood.
  • Following a report from Atlas of failures accessing the LFC at RAL we have been following some network issues at RAL that are believed to be the underlying cause of this. These problems are intermittent and continue to be tracked by the site networking team.
  • Atlas have reported slow data transfers into the RAL Tier1 from other Tier1s and CERN (ie. asymmetrical performance). The investigation into this is ongoing.
  • The long standing problem with the Castor Job Manager occasionally hanging up has not been seen since the Castor 2.1.10 update.

Declared in the GOC DB

  • SRM 2.10-2 update for Atlas - Thursday 14th April
  • SRM 2.10-2 update for GEN - Friday 15th April

Advanced warning:

The following items are being discussed and are still to be formally scheduled:

  • Updates to Site Routers (the Site Access Router and the UKLight router) are required.
  • Upgrade Castor clients on the Worker Nodes to version 2.1.10 & Atlas request to add xrootd libraries to worker nodes.
  • Address permissions problem regarding Atlas User access to all Atlas data.
  • Minor Castor update to enable access to T10KC tapes.
  • Networking upgrade to provide sufficient bandwidth for T10KC tapes.
  • Microcode updates for the tape libraries are due.
  • Switch Castor and LFC/FTS/3D to new Database Infrastructure.

Entries in GOC DB starting between 6th and 13th April 2011.

There were two unscheduled entries during this period:

  • The failure of the network switch stack at the weekend caused the lhcb-srm to be unavailable.
  • The SRM update for srm-lhc overran.
Service Scheduled? Outage/At Risk Start End Duration Reason
srm-cms SCHEDULED OUTAGE 13/04/2011 11:00 13/04/2011 13:00 2 hours Upgrade of CMS SRM to version 2.10-2
srm-lhcb UNSCHEDULED OUTAGE 12/04/2011 13:00 12/04/2011 14:00 1 hour Extending outage for Upgrade of LHCb SRM to version 2.10-2. Upgrade largely done but need to verify all is OK.
srm-lhcb SCHEDULED OUTAGE 12/04/2011 11:00 12/04/2011 13:00 2 hours Upgrade of LHCb SRM to version 2.10-2
srm-lhcb UNSCHEDULED OUTAGE 09/04/2011 20:00 10/04/2011 13:00 17 hours Added retrospectively. srm-lhcb was unavailable following the failure of a network switch stack.
Whole site SCHEDULED WARNING 09/04/2011 09:00 10/04/2011 18:00 1 day, 9 hours Systems at risk during power work in building hosting networking equipment.