Tier1 Operations Report 2012-05-09

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 9th May 2012

Review of Issues during the week 2nd to 9th May 2012
  • On Wednesday (2nd March) LHCb reported a problem with file transfers that was traced to a user was mapped to a VO by some of the FTS web front ends.
  • During the early hours of this morning (Wednesday 9th May) there was a problem with a network switch within the Tier1 network. This in turn caused a problem for the whole of the switch stack it is in. Staff attended on-site. This particularly affected the batch system and a four hour outage was declared on the CEs.
Resolved Disk Server Issues
  • GDSS596 (AtlasDataDisk - D1T0) was out for production for a few hours on Thursday (3rd May) to enable a disk to start rebuilding.
  • GDSS607 (LHCbDst - D1T0) failed with FSProbe errors on Friday evening (4th May). It was left in "readonly" mode for the weekend and since then a start has been made on draining the disk server ahead of further work.
Current operational status and issues
  • Investigations into an ongoing communications problem between the CEs and the batch server continue.
  • There have been no further problems in the last week on the UKLight-SAR link although we will continue to track this here.
  • There is a known problem with the handling of some certificates within FTS that is currently causing problems for LHCb FTS transfers.
  • Two worker nodes with the newer EMI release of the software have been tested although some problems were uncovered.
Ongoing Disk Server Issues
  • None
Notable Changes made this last week
  • Five disk servers deployed to Alice (AliceDisk) (will replace five older, smaller capacity servers).
  • Ten disk servers have been deployed for LHCb (LHCbDst).
Forthcoming Work & Interventions
  • Castor will move to use the new "Tape Gateway" and "Transfer Manager" features.
  • Wednesday 16th May: Update LFC front ends (except LHCb) to glite version 1.8.2
  • Some modified WAN tuning settings are being rolled out across disk servers.
  • The ganglia server will be replaced.
Declared in the GOC DB
  • Thursday 10th May - Short interruption to the Castor GEN instance as it is reconfigured to use the newer Castor Tape Gateway.
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.
  • Databases:
    • Regular Oracle "PSU" patches are pending.
    • Switch LFC/FTS/3D to new Database Infrastructure.
    • Update LFC/FTS databases to Oracle 11.
  • Castor:
    • Update the Castor Information Provider (CIP) (Need to re-schedule.)
    • Move to use Oracle 11g (requires a minor Castor update to version 2.1.11-9).
    • Upgrade to version 2.1.12.
  • Networking:
    • Install new Routing & Spine layers for Tier1 network.
    • Main RAL network updates - early summer.
    • Addition of caching DNSs into the Tier1 network.
  • Grid Services:
    • Updates of Grid Services (including WMS, LFC front ends) to EMI/UMD versions.
  • Infrastructure:
    • The electricity supply company plan to work on the main site power supply for 6 months commencing 14th May. This involves powering off one half of the resilient supply for 3 months while being overhauled, then repeat with the other half.
Entries in GOC DB starting between 2nd and 9th May 2012

There was one unscheduled entries in the GOC DB during this period. This was for the network switch problem last night (Wed 9th May).

Service Scheduled? Outage/At Risk Start End Duration Reason
All CEs UNSCHEDULED OUTAGE 09/05/2012 05:00 09/05/2012 09:01 4 hours and 1 minutes Outage on the CEs due to a network switch failure affecting parts of the batch system
Open GGUS Tickets
GGUS ID Level Urgency State Creation Last Update VO Subject
81669 yellow Less Urgent Assigned 2011-04-27 2012-05-09 NA62 FTS channel for na62.vo.gridpp.ac.uk
68853 Red Less Urgent On hold 2011-03-22 2012-04-20 Retirement of SL4 and 32bit DPM Head nodes and Servers (Holding Ticket for Tier2s)