Tier1 Operations Report 2012-06-06

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 6th June 2012

Review of Issues during the week 30th May to 6th June 2012
  • Note that we have just had a four-day 'weekend' to celebrate the Queen's Jubilee. Overall operations continued OK through the weekend. There was problem with one of the Top BDIIs on Sunday (3rd June), resolved by the on-call person. (Also note one disk server issue reported below).
  • Last Wednesday (30th May) Inaccessible files reported by ALICE were found out to be timeouts within the xrootd manager. resolved by increasing the timeout threshold.
  • There was a problem with Castor overnight Thursday-Friday (30 May - 1 June) caused by some systems running out of memory. This was triggered by the Castor DLF database being taken down earlier in the week and the relevant daemon starting to consume memory. This particularly affected LHCb.
  • There was a problem with the CMS tape migrations at the end of last week. A significant backlog (around 12k files) built up. The problem was understood and fixed on Friday (1st June) and the backlog processed by the end of Saturday.
Resolved Disk Server Issues
  • GDSS496 (CMSTape - D0T1) had a problem in the early hours of Saturday 2nd June. The problem was traced to the RAID interface card hanging up. The server was returned to production later that morning.
Current operational status and issues
  • Investigations into an ongoing communications problem between the CEs and the batch server continue.
  • There is a known problem with the handling of some certificates within FTS that is currently causing problems for outgoing CMS FTS transfers.
  • There are problems with the Top BDIIs that are being investigated and worked around.
  • A regular test this morning (Wed. 6th June) failed to start the backup diesel generator. Specialists are being called to investigate. This means that should there be a general power failure in the meantime we would not have the diesel generator backup power.
  • WMS03 is currently out of service for database maintenance and service re-configuration.
Ongoing Disk Server Issues
  • GDSS374 (AtlasTape - D0T1) and GDSS607 (LHCbDst - D1T0) are both drained and undergoing re-acceptance testing following earlier failures.
Notable Changes made this last week
  • On Wednesday (30th May) all BDIIs (Site & Top) were upgraded to the latest EMI version (EMI-1 update 15).
  • On Thursday (31st May) the CMS Castor instance was successfully changed to use the new "Transfer Manager" scheduler. (The LHCb & GEN instances were done earlier in the week.)
  • This morning (Wednesday 6th June) a new version of the Castor Information Provider (CIP) was brought into service.
  • Errata and kernel updates are being deployed on worker nodes.
  • On Friday (1st June) ALICE were given access to grid3000M queue.
Declared in the GOC DB
  • Deploy Transfer Manager for Atlas Castor instance on Thursday 7th June 2012 09:00-11:00 ATLAS
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.

The following items will particularly affect services:

  • Castor 2.1.11-9 update (provisionally - Wed 13th June).
  • Update LFC/FTS databases to Oracle 11 (provisionally - Wed 13th June).
  • Replacement of site access router on Tuesday 19th June.
  • Castor Oracle 11 update. (provisionally Wed 27th June).

Listing by category:

  • Databases:
    • Regular Oracle "PSU" patches are pending.
    • Switch LFC/FTS/3D to new Database Infrastructure.
    • Update LFC/FTS databases to Oracle 11.
  • Castor:
    • Move to use Oracle 11g (requires a minor Castor update to version 2.1.11-9).
    • Upgrade to version 2.1.12.
  • Networking:
    • One step still remains in extending the IP address range used for disk servers that will use the OPN.
    • Install new Routing & Spine layers for Tier1 network.
    • Main RAL network updates - early summer. There is now a firm date of 19th June for upgrading the Site Access Router. The replacement of the UKLight Router will follow.
    • Addition of caching DNSs into the Tier1 network.
  • Grid Services:
    • Reconfiguration and maintenance operation will be scheduled for lcgwms03 (non-LHC WMS) from 1-7 June.
    • Updates of Grid Services (including WMS, LFC front ends) to EMI/UMD versions.|
  • Infrastructure:
    • The electricity supply company plan to work on the main site power supply for 6 months commencing 18th June. This involves powering off one half of the resilient supply for 3 months while being overhauled, then repeat with the other half.


Entries in GOC DB starting between 30th May and 6th June 2012

There were no unscheduled outages during the last week.

Service Scheduled? Outage/At Risk Start End Duration Reason
lcgwms03 SCHEDULED OUTAGE 01/06/2012 12:00 07/06/2012 14:00 6 days, 2 hours database maintenance and service re-configuration
Castor CMS instance: srm-cms SCHEDULED OUTAGE 31/05/2012 09:00 31/05/2012 11:00 2 hours downtime to upgrade CMS castor instance to use Transfer Manager
Castor GEN instance: srm-alice, srm-dteam, srm-hone, srm-ilc, srm-mice, srm-minos, srm-na62, srm-snoplus, srm-superb, srm-t2k SCHEDULED OUTAGE 30/05/2012 09:00 30/05/2012 11:00 2 hours downtime to upgrade GEN castor instance to use Transfer Manager
Open GGUS Tickets
GGUS ID Level Urgency State Creation Last Update VO Subject
68853 Red Less Urgent On hold 2011-03-22 2012-04-20 Retirement of SL4 and 32bit DPM Head nodes and Servers (Holding Ticket for Tier2s)
82740 Amber Less Urgent Waiting Reply 2012-05-31 2012-05-31 Biomed CREAM CE lcgce05.gridpp.rl.ac.uk is not working for VO biomed