Tier1 Operations Report 2013-11-20

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 20th November 2013

Review of Issues during the week 13th to 20th November 2013.
  • One batch of worker nodes is still under investigation. A BIOS/Firmware update has been applied to the nodes. At present around 50% of the batch are back in production and these are being monitored ahead of putting the remainder back.
  • Following a report by LHCb a number of (LHCb) files in D1T0 service class have been found to be problematic. Castor thinks there are copies of the file on both tape and disk, however the disk copy does not exist. These can be fixed up and this does not mean any data is lost. Work is ongoing to find the full extent of this problem and to try and understand the cause.
Resolved Disk Server Issues
  • GDSS720 (AtlasDataDisk - D1T0) was returned to service yesterday morning (Tuesday 19th). The system had crashed on 22nd October. It has been drained. Following a firmware update to the RAID controller it underwent two weeks of acceptance testing.
Current operational status and issues
  • Nothing to report.
Ongoing Disk Server Issues
  • None
Notable Changes made this last week.
  • A modification was made to increase the published job time limit on ARC CEs for LCHb.
  • The size of the CASTOR overhead has been reduced from 5% to 1% on a small number of disk servers (two for CMS; three for Atlas). The impact of this will be evaluated before a wider roll-out of this change.
Declared in the GOC DB
Service Scheduled? Outage/At Risk Start End Duration Reason
CEs for Torque/Maui farm. (lcgce01, lcgce02, lcgce04, lcgce10, lcgce11) SCHEDULED OUTAGE 05/11/2013 07:00 30/11/2013 23:59 25 days, 16 hours and 59 minutes Service being decommissioned.
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.
  • Tuesday 26th November: Upgrading the firmware in a disk array. This will cause an interruption to the LFC, Atlas 3D and FTS2 services for a few hours. (FTS3 unaffected).

Listing by category:

  • Databases:
    • Switch LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • Castor 2.1.14 testing is starting. It is expected to be a few months before deployment.
  • Networking:
    • Update core Tier1 network and change connection to site and OPN including:
      • Install new Routing layer for Tier1
      • Change the way the Tier1 connects to the RAL network.
      • These changes will lead to the removal of the UKLight Router.
  • Fabric
    • One of the disk arrays hosting the FTS, LFC & Atlas 3D databases is showing a fault and an intervention is required - initially to update the disk array's firmware.
Entries in GOC DB starting between the 13th and 20th November 2013.
Service Scheduled? Outage/At Risk Start End Duration Reason
Whole site SCHEDULED WARNING 13/11/2013 10:00 13/11/2013 12:00 2 hours RAL site in warning state due to power generator test.
lcgce01, lcgce02, lcgce04, lcgce10, lcgce11 SCHEDULED OUTAGE 05/11/2013 07:00 30/11/2013 23:59 25 days, 16 hours and 59 minutes Service being decommissioned.
Open GGUS Tickets (Snapshot at time of meeting)
GGUS ID Level Urgency State Creation Last Update VO Subject
98764 Red Less Urgent Waiting Reply 2013-11-08 2013-11-11 SNO+ Storage request
98625 Red Urgent Waiting Reply 2013-11-04 2013-11-15 LHCb Data unavailable for Brazilian proxies at RAL-LCG2
98249 Red Urgent Waiting Reply 2013-10-21 2013-11-18 SNO+ please configure cvmfs stratum-0 for SNO+ at RAL T1
98122 Red Less Urgent Waiting Reply 2013-10-17 2013-11-18 cernatschool CVMFS access for the cernatschool.org VO
97868 Red Less Urgent In Progress 2013-10-08 2013-11-18 T2K CVMFS for t2k.org
97385 Red Less Urgent In Progress 2013-09-17 2013-11-18 HyperK CVMFS for hyperk.org
97025 Red Less urgent On Hold 2013-09-03 2013-11-05 Myproxy server certificate does not contain hostname
91658 Red Less Urgent In Progress 2013-02-20 2013-11-15 LFC webdav support
86152 Red Less Urgent On Hold 2012-09-17 2013-10-18 correlated packet-loss on perfsonar host
Availability Report
Day OPS Alice Atlas CMS LHCb Comment
13/11/13 100 100 100 100 100
14/11/13 100 100 100 100 100
15/11/13 100 100 100 100 100
16/11/13 100 100 100 100 100
17/11/13 100 100 100 100 100
18/11/13 100 100 100 100 100
19/11/13 100 100 100 100 100