Tier1 Operations Report 2014-07-09

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 9th July 2014

Review of Issues during the week 2nd to 9th July 2014.
  • There were probelms with the SRM (not Castor) for the GEN instance on Thursday and Friday of last week (3/4 July). details....
  • Problems with Atlas multicore jobs on Friday 4th July....
Resolved Disk Server Issues
  • None
Current operational status and issues
  • None
Ongoing Disk Server Issues
  • None
Notable Changes made this last week.
  • Tuesday (8th July) Atlas Castor instance upgraded to version 2.1.14-13. (to be confirmed....)
Declared in the GOC DB
  • None
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.
  • We are planning the termination of the FTS2 service (announced for 2nd September) now that almost all use is on FTS3.

Listing by category:

  • Databases:
    • Switch LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • None.
  • Networking:
    • Move switches connecting the 2011 disk servers batches onto the Tier1 mesh network.
    • Make routing changes to allow the removal of the UKLight Router.
  • Fabric
    • We are phasing out the use of the software server used by the small VOs.
    • Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
    • There will be circuit testing of the remaining (i.e. non-UPS) circuits in the machine room during 2014.
Entries in GOC DB starting between the 2nd and 9th July 2014.
Service Scheduled? Outage/At Risk Start End Duration Reason
srm-atlas SCHEDULED OUTAGE 08/07/2014 06:00 09/07/2014 12:00 1 day, 6 hours Atlas Castor instance down for Castor 2.1.14 Stager Update
Castor GEN: srm-alice, srm-biomed, srm-dteam, srm-hone, srm-ilc, srm-mice, srm-minos, srm-na62, srm-snoplus, srm-superb, srm-t2k UNSCHEDULED WARNING 03/07/2014 07:45 03/07/2014 13:00 5 hours and 15 minutes Problem with SRMs for Castor GEN instance. (However Castor itself - e.g. xroot access - working OK).
Whole site SCHEDULED WARNING 02/07/2014 10:00 02/07/2014 11:00 1 hour RAL Tier1 site in warning state due to UPS/generator test.
Open GGUS Tickets (Snapshot during morning of meeting)
GGUS ID Level Urgency State Creation Last Update VO Subject
106640 Green Less Urgent In Progress 2014-07-04 2014-07-04 ILC Failure to submit jobs to RAL-LCG2 CEs
106610 Green Less Urgent In Progress 2014-07-02 2014-07-02 HyperK HyperK support
106480 Green Less Urgent Waiting Reply 2014-06-25 2014-06-30 dteam Publishing meaningful Castor version
106324 Yellow Urgent In Progress 2014-06-18 2014-07-01 CMS pilots losing network connections at T1_UK_RAL
105571 Red Less Urgent In Progress 2014-05-21 2014-06-30 LHCb BDII and SRM publish inconsistent storage capacity numbers
105405 Red Urgent On Hold 2014-05-14 2014-07-01 please check your Vidyo router firewall configuration
Availability Report

Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 508); CMS HC = CMS HammerCloud

Day OPS Alice Atlas CMS LHCb Atlas HC CMS HC Comment
25/06/14 100 100 94.8 100 100 96 98 Several SUM test failures (Invalid Argument).
26/06/14 100 100 90.6 95.8 92.6 90 100 LHCb Castor Stager 2.1.14 upgrade; Atlas: Several SRM test failures; CMS: Single SRM Put test failure.
02/07/14 100 100 100 100 100 100 100
03/07/14 100 100 100 100 100 100 100
04/07/14 100 100 100 100 100 100 100
05/07/14 100 100 100 100 100 100 100
06/07/14 100 100 100 100 100 100 100
07/07/14 100 100 100 100 100 100 100
08/07/14 100 100 100 100 100 100 100