Tier1 Operations Report 2014-05-28

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 28th May 2014

Review of Issues during the week 21st to 28th May 2014.
  • Maintenance on the diesel generator was carried out as planned on the morning of Thursday 22nd May.
  • There were problems with tape access late Tuesday and Wednesday (20/21 May). On the Tuesday morning a new tape controller server (ACSLS) had been put into operation. This change was reverted on Wednesday afternoon. The revertion was to put the old server back into service.
  • This morning's planned network reconfiguration, with an 'at risk' on Castor ran into some problems causing a break in access to some disk servers for around 20 minutes. The network change itself was carried out to completion..
Resolved Disk Server Issues
  • None.
Current operational status and issues
  • Grumbly problems with the WMSs reported last week ongoing. The developers have been contacted.
Ongoing Disk Server Issues
  • None
Notable Changes made this last week.
  • Completion of roll out of CVMFS Client version 2.1.19 to whole farm.
  • This morning (28th May) arc-ce01 was updated to version 4.1.0-1.
  • This morning (28th May) two network switches that provide connectivity to some Castor disk servers were moved to the mesh network.
Declared in the GOC DB
Service Scheduled? Outage/At Risk Start End Duration Reason
All Castor (SRMs) and batch (CEs). SCHEDULED OUTAGE 10/06/2014 08:50 10/06/2014 15:00 6 hours and 10 minutes Castor and batch services down during upgrade of Castor Nameserver to version 2.1.14.
arc-ce02.gridpp.rl.ac.uk, arc-ce03.gridpp.rl.ac.uk SCHEDULED WARNING 02/06/2014 10:00 02/06/2014 12:00 2 hours Upgrade arc-ce02 and arc-ce03 to v. 4.1.0.
arc-ce01.gridpp.rl.ac.uk, arc-ce01.gridpp.rl.ac.uk, SCHEDULED WARNING 28/05/2014 10:00 28/05/2014 12:00 2 hours Upgrade of ARC CE to version 4.1.0.
All Castor (All SRM endpoints) SCHEDULED WARNING 28/05/2014 09:30 28/05/2014 11:30 2 hours At Risk on Castor (All SRM endpoints) during small internal network change.
lcgui02.gridpp.rl.ac.uk, SCHEDULED OUTAGE 30/04/2014 14:00 29/05/2014 13:00 28 days, 23 hours Service being decommissioned.
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.
  • Dates for the Castor 2.1.14 upgrade: Nameserver: Tuesday 10th June (now in GOC DB); Stagers: CMS- Tue 17th June; LHCb - Thu 19th June; GEN - Tue 24th June; Atlas - Thu 26th June.
  • We are starting to plan the termination of the FTS2 service now that almost all use is on FTS3.

Listing by category:

  • Databases:
    • Switch LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • Castor 2.1.14 testing was largely complete, although a new minor version (2.1.14-12) will be released soon.
  • Networking:
    • Move switches connecting recent disk servers batches ('11, '12) onto the Tier1 mesh network.
    • Make routing changes to allow the removal of the UKLight Router.
  • Fabric
    • We are phasing out the use of the software server used by the small VOs.
    • Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
    • There will be circuit testing of the remaining (i.e. non-UPS) circuits in the machine room during 2014.
Entries in GOC DB starting between the 21st and 28th May 2014.
Service Scheduled? Outage/At Risk Start End Duration Reason
arc-ce01.gridpp.rl.ac.uk SCHEDULED WARNING 28/05/2014 10:00 28/05/2014 12:00 2 hours Upgrade of ARC CE to version 4.1.0.
All SRMs (All Castor) SCHEDULED WARNING 28/05/2014 09:30 28/05/2014 11:30 2 hours At Risk on Castor (All SRM endpoints) during small internal network change.
lcgui02.gridpp.rl.ac.uk, SCHEDULED OUTAGE 30/04/2014 14:00 29/05/2014 13:00 28 days, 23 hours Service being decommissioned.
Open GGUS Tickets (Snapshot during morning of meeting)
GGUS ID Level Urgency State Creation Last Update VO Subject
105571 Green Less Urgent In Progress 2014-05-21 2014-05-27 LHCb BDII and SRM publish inconsistent storage capacity numbers
105405 Yellow Urgent In Progress 2014-05-14 2014-05-15 please check your Vidyo router firewall configuration
105308 Yellow Less Urgent On Hold 2014-05-11 2014-05-27 Atlas Jobs at RAL-LCG2_MCORE are failing with "Failed to open shared memory object: Permission denied"
105161 Amber Less Urgent In Progress 2014-05-05 2014-05-16 H1 hone jobs submitted into CREAM queues through lcgwms05.gridpp.rl.ac.uk & lcgwms06.gridpp.rl.ac.uk WMSs are are Ready status long time (more as 5 hours)
105100 Red Urgent In Progress 2014-05-02 2014-05-12 CMS T1_UK_RAL Consistency Check (May14)
98249 Red Urgent Waiting Reply 2013-10-21 2014-05-21 SNO+ please configure cvmfs stratum-0 for SNO+ at RAL T1
Availability Report

Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 508); CMS HC = CMS HammerCloud

Day OPS Alice Atlas CMS LHCb Atlas HC CMS HC Comment
21/05/14 100 100 100 100 100 98 99
22/05/14 100 100 100 100 100 99 100
23/05/14 100 100 100 100 100 97 99
24/05/14 100 100 100 100 100 100 100
25/05/14 100 100 100 100 100 99 100
26/05/14 100 100 100 100 100 98 100
27/05/14 100 100 99.1 100 100 96 100 Single SRM Get test failure.