Tier1 Operations Report 2012-06-13

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 13th June 2012

Review of Issues during the week 6th to 13th June 2012
  • Following the failure of the regular test of the backup diesel generator on Wed. 6th June when it would not start, specialists were called in the next day and the problem was resolved.
  • Following the update of the CIP on 6th June we started failing the OPS VO test for srm-cms. (Other CMS tests ran OK and there was no operational effect). This was resolved on Friday 8th June. The new CIP declared that srm-cms supported gridftp & rfio, but the latter was not actually enabled in the SRM - nor is it actually used by CMS. Enabling it resolved the problem.
  • Problem during morning of Friday 8th June with one of the FTS agent nodes caused a problem with file transfers particularly to/from the RAL Tier1. This was initially thought to be a more general problem and an outage was initially (and incorrectly) declared for the SRM endpoints. Further investigation identified the problem in the FTS and that was put into a downtime for a while.
  • Problem with DNS lookups for nodes at Fermilab investigated and fixed by the central networking team during Friday morning 8th June. DNSSec data presented by Fermi Lab were causing recursive DNS queries to fail.
  • There was a short disconnect in our network connectivity for around 15 minutes around 17:30 on Friday 8th June.
  • There was a short networking break to the Tier1 offices around ~14:30 to ~14:50 on Tuesday 12th June. Whilst the main Tier1 links stayed up there was some knock on effect as a number of FTS transfers failed.
  • The reported problems with the Top BDIIs have been effectively worked around.
  • There known problem of handling of some certificates/proxies within FTS is now understood by the developers. It is not causing significant problems for us at the moment.
  • The communications problem between the CEs and the batch server remains but at a lower level and will no longer be tracked here.
Resolved Disk Server Issues
  • GDSS431 (AtlasDataDisk - D1T0) failed on Friday 8th June. The disk controller had a problem after a disk drive went faulty. Returned to production the following day (Sat. 10th June).
  • GDSS496 (CMSTape - D0T1) failed on Sunday 10th June. It was returned to production yesterday (12th June) after its disk controller was replaced.
  • GDSS374 (AtlasTape - D0T1) passed its acceptance testing following its earlier failure and was returned to production yesterday (12th June).
Current operational status and issues
  • On Tuesday 12th June the first stage of switching ready for the work on the main site power supply took place. This initial switching should be completed today (13th June). The work on the two transformers is expected to take until 18th December.
  • There is still a problem with the reporting of disk capacity to be followed up.
Ongoing Disk Server Issues
  • GDSS607 (LHCbDst - D1T0) has been drained and, following further problems is still undergoing re-acceptance testing following earlier failures.
Notable Changes made this last week
  • Atlas Castor instance switched to use Transfer Manager on Thu. 7th June. This means all four instances now running the Transfer Manager.
  • Errata and kernel updates are being deployed on worker nodes.
  • WMS03 has been returned to service following database maintenance and service re-configuration.
  • Castor upgrade to version 2.1.11-9 completed this morning (13th June).
  • Missing step required to extend the IP address range used for disk servers that will use the OPN completed.
Declared in the GOC DB
  • Castor 2.1.11-9 update (Wed 13th June).
  • Update LFC/FTS databases to Oracle 11 (Wed 13th June).
  • Replacement of site access router and add two new routers into the Tier1 network (Tues 19th June).
  • Drain and re-install of WMS01 (Thu 14 - Wed 20 June).
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.

The following items will particularly affect services:

  • Castor Oracle 11 update. (provisionally Wed 27th June).

Listing by category:

  • Databases:
    • Regular Oracle "PSU" patches are pending.
    • Switch LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • Move to use Oracle 11g (requires a minor Castor update to version 2.1.11-9).
    • Upgrade to version 2.1.12.
  • Networking:
    • Install new Routing & Spine layers for Tier1 network.
    • Main RAL network updates - early summer. There is now a firm date of 19th June for upgrading the Site Access Router. The replacement of the UKLight Router will follow.
    • Addition of caching DNSs into the Tier1 network.
  • Grid Services:
    • Reconfiguration and maintenance operation will be scheduled for lcgwms03 (non-LHC WMS) from 1-7 June.
    • Updates of Grid Services (including WMS, LFC front ends) to EMI/UMD versions.|
  • Infrastructure:
    • The electricity supply company plan to work on the main site power supply for 6 months commencing 18th June. This involves powering off one half of the resilient supply for 3 months while being overhauled, then repeat with the other half.


Entries in GOC DB starting between 6th and 13th June 2012

There were two unscheduled outages during the last week. Both relate to the problems on Friday morning (8th June) which were traced to the FTS.

Service Scheduled? Outage/At Risk Start End Duration Reason
lcgfts.gridpp.rl.ac.uk, lfc.gridpp.rl.ac.uk SCHEDULED OUTAGE 13/06/2012 08:00 13/06/2012 17:00 9 hours FTS and LFC services down for update of back-end database to Oracle 11.
All CEs and All Castor end points SCHEDULED OUTAGE 13/06/2012 08:00 13/06/2012 16:00 8 hours Castor and Batch (CEs) unavailable during Castor update to version 2.1.11-9
All Castor (SRM) end points. UNSCHEDULED OUTAGE 08/06/2012 08:00 08/06/2012 10:40 2 hours and 40 minutes We are investigating a problem that is causing a very high rate of file transfer failures to and from the RAL Tier1.
lcgfts UNSCHEDULED OUTAGE 08/06/2012 08:00 08/06/2012 13:00 5 hours Investigating File Transfer Problem that appear more related to the FTS than the SRMs.
srm-atlas.gridpp.rl.ac.uk, SCHEDULED OUTAGE 07/06/2012 09:00 07/06/2012 11:00 2 hours downtime to upgrade Atlas castor instance to use Transfer Manager
lcgwms03 SCHEDULED OUTAGE 01/06/2012 12:00 07/06/2012 14:00 6 days, 2 hours database maintenance and service re-configuration


Open GGUS Tickets
GGUS ID Level Urgency State Creation Last Update VO Subject
68853 Red Less Urgent On hold 2011-03-22 2012-04-20 Retirement of SL4 and 32bit DPM Head nodes and Servers (Holding Ticket for Tier2s)