Tier1 Operations Report 2013-11-27

From GridPP Wiki
Revision as of 13:53, 27 November 2013 by Gareth smith (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

RAL Tier1 Operations Report for 27th November 2013

Review of Issues during the week 20th to 27th November 2013.
  • The problem with one batch of worker nodes that has been reported in previous weeks has been solved. These systems were put back in production on Thursday (21st Nov).
  • On Monday (25th November) the Primary OPN link to CERN failed. However, the failover was not clean in that whilst the router at the CERN end switched to the backup link, the router at the RAL end didn't. Once the problem was identified the primary link was forced down at the RAL end and all traffic ran over the backup link. The following morning the primary link was fixed and traffic was switched back to use it.
  • On Monday (25th November) there was a problem with one of the hypervisor clusters that led to problems on some service machines that run as VMs there (FTS, Alice VO box, arc-ce03).
  • On Tuesday evening there was a problem with one of the WMS systems, WMS05, caused by a user job filling up the available space.
Resolved Disk Server Issues
  • None.
Current operational status and issues
  • The Condor batch farm is running fine. Some tweaks have been applied to the scheduling in the light of experience (e.g. Increased Condor priority halflife from 1 to 3 days.)
  • The FTS3 testing continues. Two updates have been applied in this last week.
Ongoing Disk Server Issues
  • None
Notable Changes made this last week.
  • On Tuesday (26th Nov) a firmware update was made to one of the disk arrays used by the LFC/FTS2/Atlas3D databases. This had been showing a fault and the firmware upgrade was required to investigate this.
Declared in the GOC DB
  • None
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.

Listing by category:

  • Databases:
    • Switch LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • Castor 2.1.14 testing is starting. It is expected to be a few months before deployment.
  • Networking:
    • Possible move of Tier1 core network switch in January (TBC).
    • Implementation of new site firewall.
    • Update core Tier1 network and change connection to site and OPN including:
      • Install new Routing layer for Tier1
      • Change the way the Tier1 connects to the RAL network.
      • These changes will lead to the removal of the UKLight Router.
  • Fabric
    • Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
Entries in GOC DB starting between the 20th and 27th November 2013.
Service Scheduled? Outage/At Risk Start End Duration Reason
lcgfts (FTS2) UNSCHEDULED OUTAGE 26/11/2013 15:00 26/11/2013 15:15 15 minutes Investigating problems with restarting FTS2 service after intervention earlier today
lcgft-atlas, lcgfts (FTS2), lfc.gridpp SCHEDULED OUTAGE 26/11/2013 09:30 26/11/2013 15:00 5 hours and 30 minutes Outage of LFC, FTS2 and Atlas 3D/Frontier during work on disk array used by back end database.
Open GGUS Tickets (Snapshot at time of meeting)
GGUS ID Level Urgency State Creation Last Update VO Subject
99162 Green Less Urgent In Progress 2013-11-25 2013-11-25 Publishing default values
99161 Green Less Urgent In Progress 2013-11-25 2013-11-25 GLUE 2 obsolete entries
98249 Red Urgent Waiting Reply 2013-10-21 2013-11-18 SNO+ please configure cvmfs stratum-0 for SNO+ at RAL T1
98122 Red Less Urgent Waiting Reply 2013-10-17 2013-11-18 cernatschool CVMFS access for the cernatschool.org VO
97868 Red Less Urgent In Progress 2013-10-08 2013-11-18 T2K CVMFS for t2k.org
97385 Red Less Urgent In Progress 2013-09-17 2013-11-18 HyperK CVMFS for hyperk.org
97025 Red Less urgent On Hold 2013-09-03 2013-11-05 Myproxy server certificate does not contain hostname
91658 Red Less Urgent In Progress 2013-02-20 2013-11-15 LFC webdav support
86152 Red Less Urgent On Hold 2012-09-17 2013-10-18 correlated packet-loss on perfsonar host
Availability Report
Day OPS Alice Atlas CMS LHCb Comment
20/11/13 100 100 100 100 100
21/11/13 100 100 100 100 100
22/11/13 100 100 100 100 100
23/11/13 100 100 100 100 100
24/11/13 100 100 100 100 100
25/11/13 100 100 82.4 83.0 84.9 CERN Primary link failed but failover didn't work correctly
26/11/13 100 95.8 100 89.7 87.8 BDII problem at CERN affected many sites.