Tier1 Operations Report 2014-01-15

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 15th January 2014

Review of Issues during the week 8th to 15th January 2014.
  • On Monday afternoon (13th Jan) a minor operational problem led to the Castor Atlas instance being down for around 15 minutes from 15:00-15:15.
  • The Atlas file renaming in Castor has been completed with around 17 million files renamed. We are still checking the missing files. However, the total number of lost files found in this processes is believed to be in line with other sites.
Resolved Disk Server Issues
  • None.
Current operational status and issues
  • None
Ongoing Disk Server Issues
  • None
Notable Changes made this last week.
  • Changed garbage collection threshold on all Castor D1T0 disk servers from 95% to 99%. This should lead to a 4% increase in usable space for the instance. The change was made for AtlasDataDisk on Monday (13th Jan) and all other instances on Tuesday (14th Jan).
  • New CernVM-FS Stratum-0 and Stratum-1 services for the non-LHC VOs have been deployed and announced.
  • The second (and final) tranche of disk servers in this year's purchase are currently being delivered.
Declared in the GOC DB
  • There is an entry for the retirement of two old (and replaced) Logging & Bookkeeping servers.
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.
  • On Thursday 16th January the disk caches in front of the Alice-Tape and Gen-tape pools will be merged.
  • On the morning (09:00 - 13:00) of Tuesday 21st January there will be an upgrade to the microcode in the tape libraries. There will be no tape access during this time.

Listing by category:

  • Databases:
    • Switch LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • Castor 2.1.14 testing is ongoing. A date for deployments awaits successful completion of this testing.
  • Networking:
    • Implementation of new site firewall. (Date for Tier1 traffic to start using this is not yet agreed. Initial changes for links that do not affect the Tier1 commence on 20th January)
    • Update core Tier1 network and change connection to site and OPN including:
      • Install new Routing layer for Tier1
      • Change the way the Tier1 connects to the RAL network.
      • These changes will lead to the removal of the UKLight Router.
  • Fabric
    • Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
    • There will be circuit testing of the remaining (i.e. non-UPS) circuits in the machine room during 2014.
Entries in GOC DB starting between the 8th and 15th January 2014.
Service Scheduled? Outage/At Risk Start End Duration Reason
lcglb03.gridpp.rl.ac.uk, lcglb04.gridpp.rl.ac.uk, SCHEDULED OUTAGE 18/12/2013 11:00 31/01/2014 00:00 43 days, 13 hours Old EMI-2 hosts to be retired
Open GGUS Tickets (Snapshot during morning of meeting)
GGUS ID Level Urgency State Creation Last Update VO Subject
100180 Green Less Urgent Waiting Reply 2014-01-10 2014-01-10 Hone hone jobs submitted through lcgwms05.gridpp.rl.ac.uk & lcgwms06.gridpp.rl.ac.uk into all Lyon's, all Imperial College's, 3 from 5 DESY-HH's, EFDA's and ITEP's cream queues are aborted immediately
100114 Amber Less Urgent Waiting Reply 2014-01-08 2014-01-10 Jobs failing to get from RAL WMS to Imperial
100086 Red Less Urgent In Progress 2014-01-07 2014-01-13 T2K WMS jobs cleared too rapidly
99768 Red Less Urgent Waiting Reply 2013-12-13 2014-01-07 Atlas RAL-LCG2_DATADISK: transfer failures with "source file doesn't exist"
99647 Red Less Urgent In Progress 2013-12-12 2013-12-17 SNO+ lcg-cp connection timeouts
99556 Red Very Urgent In Progress 2013-12-06 2014-01-07 NGI Argus requests for NGI_UK
98249 Red Urgent In Progress 2013-10-21 2014-01-14 SNO+ please configure cvmfs stratum-0 for SNO+ at RAL T1
98122 Red Less Urgent Waiting Reply 2013-10-17 2014-01-14 cernatschool CVMFS access for the cernatschool.org VO
97025 Red Less urgent On Hold 2013-09-03 2014-01-06 Myproxy server certificate does not contain hostname
86152 Red Less Urgent On Hold 2012-09-17 2013-10-18 correlated packet-loss on perfsonar host
Availability Report
Day OPS Alice Atlas CMS LHCb Comment
08/01/14 100 100 100 100 100
09/01/14 100 100 100 100 100
10/01/14 100 100 100 100 100
11/01/14 100 100 100 95.5 100 WMS at CERN found " no compatible resources"
12/01/14 100 100 100 100 100
13/01/14 100 100 99.2 100 100 Short outage of a Castor daemon.
14/01/14 100 100 100 95.9 100 SRM Put test faiure "Invalid argument".