Tier1 Operations Report 2011-12-21

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 21st December 2011

Review of Issues during the week 14th to 21st December 2011.

  • In the evening of Wednesday 14th Dec. there were two outages of the site network. One from around 21:00 to 22:05, the second from 23:15 to 00:15. Network staff were called into site both times. Following these breaks one of the DNS servers used by the Tier1 was showing very slow response times. The Atlas SRM did not recover from these problems, and indeed it seemed to trigger another problem with the Atlas SRM database. Following investigations on Thursday morning the Atlas SRM database was migrated to new hardware. (This is hardware that has been previously prepared and to which we are planning to migrate all Castor databases on 5th January.). The service was finally restored early afternoon. Including the Network breaks the Atlas SRM was unavailable for a total of 18 hours. A Post Mortem is being prepared for this incident and can be seen at:

https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20111215_Network_BReak_Atlas_SRM_DB

  • On Friday morning (16th Dec) there was a hang of the PLUTO database which resulted in an outage for the CMS and GEN Castor instances of around three hours. This was resolved by the Database team around 09:30.
  • On Saturday morning (17th Dec) there was a further outage of the site network link between around 04:15 and 05:00. This was fixed by networking staff who were called in. A board in the Site Access Router was replaced. This incident did not cause knock-on problems for the Tier1 (as compared with the outages on 14/15 Dec.)
  • On Monday (19th) it was found that the CIP was reporting stale information. This was a consequence of the move of the Atlas SRM database on the 15th and was immediately fixed.

Resolved Disk Server Issues

  • Early afternoon on Thursday 15th Dec. GDSS266 (LHCbDst – D1T0) was taken out of production with a double disk failure following which and the rebuilds were going very slowly. It was returned to production aroun 09:30 the following morning.
  • Around 05:00 on Sunday (18th Dec) GDSS568 (AtlasDataDisk - D1T0)) failed with a read-only filesystem on its system drives. Following a disk replacement it was returned to service around 10:00 on Monday (19th).

Current operational status and issues.

  • None, but note the Tier1 PLans for the Christmas & New Year Holidays as detailed in the blog at:

http://www.gridpp.rl.ac.uk/blog/2011/12/20/ral-tier1-%E2%80%93-plans-for-christmas-new-year-holiday/

Ongoing Disk Server Issues

  • GDSS332 (LHCbDst) has failed with a Read Only filesystem around 10:10 this morning. Currently under investigation.

Notable Changes made this last week

  • None

Forthcoming Work & Interventions

  • Thursday 5th January. First step of migration of Oracle database infrastructure. This is the move of Castor databases and will cause an outage of Castor for some hours (with a batch drain beforehand.) The opportunity will also be taken to move some racks of older nodes (required to create space for new deliveries).
  • Monday 9th January. Lcgui01 will be updated to UMD version of UI.
  • Tuesday 10th January: Update of final pair of DNS servers at RAL to new hardware.

Declared in the GOC DB

  • None

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced. There is a significant amount of work that will require to be done during the LHC stop at the start of 2012.

  • Infrastructure:
    • Move part of the cooling system onto the UPS supply. (Should not require service interruption.)
  • Databases:
    • Regular Oracle "PSU" patches are pending.
    • Switch Castor and LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • Castor 2.1.11 upgrade.
    • SRM 2.11 upgrade
    • Replace hardware running Castor Head Nodes.
    • Move to use Oracle 11g.
    • Update to the Castor Information Porvider (CIP).
  • Networking:
    • Changes required to extend range of addresses that route over the OPN.
    • Re-configure networking in UPS room.
    • Install new Routing & Spine layers.
    • Final updates to the RAL DNS infrastructure (two DNS servers still to replace)
  • Fabric:
    • BIOS/firmware updates, equipment moves in machine room (consolidation of equipment in racks; some rack moves.), Othere re-configurations (adding IPMI cards, etc.)
  • Grid Services:
    • Updates of Grid Services (including LB, APEL, batch server) to UMD versions (mainly in new year).
  • VO:
    • Address permissions problem regarding Atlas User access to all Atlas data.
    • Migrate Atlas LFC to CERN.

Entries in GOC DB starting between 14th and 21st December 2011.

Service Scheduled? Outage/At Risk Start End Duration Reason
Castor CMS & GEN instances (srm-alice, srm-cms, srm-dteam, srm-hone, srm-ilc, srm-mice, srm-minos, srm-na62, srm-snoplus, srm-superb, srm-t2k) UNSCHEDULED OUTAGE 16/12/2011 06:00 16/12/2011 09:00 3 hours Backdated Entry to Track Outage now Fixed: Database Hang-up caused outage of some CASTOR instances.
srm-atlas UNSCHEDULED WARNING 15/12/2011 15:15 16/12/2011 12:00 20 hours and 45 minutes To fix the problem causing the outage of srm-atlas we have moved the database to new hardware. Declaring an At Risk (Warning) overnight on this service.
srm-atlas UNSCHEDULED OUTAGE 15/12/2011 11:00 15/12/2011 15:15 4 hours and 15 minutes Following network problems overnight we have a problem with srm-atlas. Although the network issues are resolved a subsequent database issue is stopping srm-atlas working.
srm-atlas UNSCHEDULED OUTAGE 14/12/2011 22:05 15/12/2011 11:00 12 hours and 55 minutes Following two site outages problems are persisting on srm-atlas.
Whole site UNSCHEDULED OUTAGE 14/12/2011 21:00 14/12/2011 22:05 1 hour and 5 minutes Networking failure caused site outage. Staff called out to fix.

Open GGUS Tickets

GGUS ID Level Urgency State Creation Last Update VO Subject
77528 Green Less urgent In Progress 2011-12-16 2011-12-16 H1 hone jobs cannot be submitted through lcgwms03.gridpp.rl.ac.uk wms-server
77026 Red Less Urgent On Hold 2011-11-29 2011-12-15 BDII
74353 Red very urgent In Progress 2011-09-16 2011-12-07 Pheno Proxy not renewing properly from WMS
68853 Red less urgent On hold 2011-03-22 2011-12-15 Retirenment of SL4 and 32bit DPM Head nodes and Servers (Holding Ticket for Tier2s)
68077 Red less urgent in progress 2011-02-28 2011-09-20 Mandatory WLCG InstalledOnlineCapacity not published
64995 Red less urgent in progress 2010-12-03 2011-09-20 No GlueSACapability defined for WLCG Storage Areas