Difference between revisions of "Tier1 Operations Report 2013-11-13"

From GridPP Wiki
Jump to: navigation, search
 
(No difference)

Latest revision as of 12:07, 13 November 2013

RAL Tier1 Operations Report for 13th November 2013

Review of Issues during the week 6th to 13th November 2013.
  • Service were watched closely following the work on the UPS Tuesday/Wednesday last week. A UPS/Generator load test was carried out successfully this morning.
  • One batch of worker nodes has continued to give problems and has not been in in production.
  • One file has been reported lost to ILC. The file was found to be corrupt when investigating why it would not migrate to tape.
Resolved Disk Server Issues
  • None
Current operational status and issues
  • The FTS3 testing has continued very actively with Atlas. Problems with FTS3 are being uncovered during these tests. Patches are being regularly applied to FTS3 to deal with issues.
  • We are participating in xrootd federated access tests for Atlas. The server has now been successfully configured to work as an xroot redirector, whereas before it could only serve as a proxy.
Ongoing Disk Server Issues
  • GDSS720 (AtlasDataDisk - D1T0) crashed during the evening of 22nd October. It has been drained. Following a firmware update to the RAID controller it is undergoing two weeks of acceptance testing before being returned to production.
Notable Changes made this last week.
  • We are now running with just the one (Condor) batch farm. Nodes that were in the Torque/Maui farm when it was stopped last week have been re-configured and added to the Condor farm. The CEs that front the old Torque/Maui farm (lcgce01,02,04,10,11) have been set as not in production in the GOC DB.
  • A UPS/generator load test was successfully carried out this morning (Wed 13th Nov). This test was scheduled following the work on the UPS last week.
Declared in the GOC DB
Service Scheduled? Outage/At Risk Start End Duration Reason
CEs for Torque/Maui farm. (lcgce01, lcgce02, lcgce04, lcgce10, lcgce11) SCHEDULED OUTAGE 05/11/2013 07:00 30/11/2013 23:59 25 days, 16 hours and 59 minutes Service being decommissioned.
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.

Listing by category:

  • Databases:
    • Switch LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • Castor 2.1.14 testing is starting. It is expected to be a few months before deployment.
  • Networking:
    • Update core Tier1 network and change connection to site and OPN including:
      • Install new Routing layer for Tier1
      • Change the way the Tier1 connects to the RAL network.
      • These changes will lead to the removal of the UKLight Router.
  • Fabric
    • One of the disk arrays hosting the FTS, LFC & Atlas 3D databases is showing a fault and an intervention is required - initially to update the disk array's firmware.
Entries in GOC DB starting between the 6th and 13th November 2013.
Service Scheduled? Outage/At Risk Start End Duration Reason
Whole Site. SCHEDULED WARNING 13/11/2013 10:00 13/11/2013 12:00 2 hours RAL site in warning state due to power generator test.
CEs for Torque/Maui farm. (lcgce01, lcgce02, lcgce04, lcgce10, lcgce11) SCHEDULED OUTAGE 05/11/2013 07:00 30/11/2013 23:59 25 days, 16 hours and 59 minutes Service being decommissioned.
Open GGUS Tickets (Snapshot at time of meeting)
GGUS ID Level Urgency State Creation Last Update VO Subject
98838 Green Urgent In Progress 2013-11-13 2013-11-13 T2K no jobs delegated to cream-ce0*
98833 Green Less Urgent In Progress 2013-11-12 2013-11-13 SNO+ Adoption of backup GridPP VOMS servers: lcglb03.gridpp.rl.ac.uk
98764 Green Less Urgent Waiting Reply 2013-11-08 2013-11-11 SNO+ Storage request
98625 Red Urgent In Progress 2013-11-04 2013-11-12 LHCb Data unavailable for Brazilian proxies at RAL-LCG2
98249 Red Urgent In Progress 2013-10-21 2013-10-30 SNO+ please configure cvmfs stratum-0 for SNO+ at RAL T1
98122 Red Less Urgent In Progress 2013-10-17 2013-10-30 cernatschool CVMFS access for the cernatschool.org VO
97868 Red Less Urgent Waiting Reply 2013-10-08 2013-10-30 T2K CVMFS for t2k.org
97759 Red Urgent On Hold 2013-10-04 2013-11-07 OPS SHA-2 test failing on lcgce01
97385 Red Less Urgent In Progress 2013-09-17 2013-10-14 HyperK CVMFS for hyperk.org
97025 Red Less urgent On Hold 2013-09-03 2013-05-11 Myproxy server certificate does not contain hostname
91658 Red Less Urgent On Hold 2013-02-20 2013-11-13 LFC webdav support
86152 Red Less Urgent On Hold 2012-09-17 2013-10-18 correlated packet-loss on perfsonar host
Availability Report
Day OPS Alice Atlas CMS LHCb Comment
06/11/13 46.2 46.2 0 100 46.2 Batch not restarted until the middle of the day owing to the UPS intervention.
07/11/13 100 100 62.3 100 100 Atlas remained "not available" until the 'old' CE for the Torque/Maui batch farm were marked out of production in the GOC DB.
08/11/13 100 100 100 100 100
09/11/13 100 100 100 100 100
10/11/13 100 100 100 100 100
11/11/13 100 100 99.1 100 100 Single SRM test failure "could not open connection to srm-atlas.gridpp.rl.ac.uk"
12/11/13 100 100 100 100 100