Tier1 Operations Report 2014-07-09

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 9th July 2014

Review of Issues during the week 2nd to 9th July 2014.
  • There were problems with the SRM (not Castor) for the GEN instance on Thursday and Friday of last week (3/4 July). It was fixed by a database edit.
  • Problems with Atlas multicore jobs on Friday 4th July. We believe it is an Atlas issue.
Resolved Disk Server Issues
  • None
Current operational status and issues
  • We are still investigating xroot access to CMS Castor following the upgrade on the 17th June.
  • There is a problem with the dteam SRM regional nagios tests, which may be caused by how dteam is published by the CIP.
Ongoing Disk Server Issues
  • None
Notable Changes made this last week.
  • Tuesday and Wednesday (8th and 9th July) Atlas Castor instance upgraded to version 2.1.14-13. Castor Atlas was returned to production at 10:40 this morning.
Declared in the GOC DB
  • None
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.
  • We are planning the termination of the FTS2 service (announced for 2nd September) now that almost all use is on FTS3.

Listing by category:

  • Databases:
    • Switch LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • None.
  • Networking:
    • Move switches connecting the 2011 disk servers batches onto the Tier1 mesh network.
    • Make routing changes to allow the removal of the UKLight Router.
  • Fabric
    • We are phasing out the use of the software server used by the small VOs.
    • Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
    • There will be circuit testing of the remaining (i.e. non-UPS) circuits in the machine room during 2014.
Entries in GOC DB starting between the 2nd and 9th July 2014.
Service Scheduled? Outage/At Risk Start End Duration Reason
srm-atlas SCHEDULED OUTAGE 08/07/2014 06:00 09/07/2014 10:40 1 day, 6 hours Atlas Castor instance down for Castor 2.1.14 Stager Update
Castor GEN: srm-alice, srm-biomed, srm-dteam, srm-hone, srm-ilc, srm-mice, srm-minos, srm-na62, srm-snoplus, srm-superb, srm-t2k UNSCHEDULED WARNING 03/07/2014 07:45 03/07/2014 13:00 5 hours and 15 minutes Problem with SRMs for Castor GEN instance. (However Castor itself - e.g. xroot access - working OK).
Whole site SCHEDULED WARNING 02/07/2014 10:00 02/07/2014 11:00 1 hour RAL Tier1 site in warning state due to UPS/generator test.
Open GGUS Tickets (Snapshot during morning of meeting)
GGUS ID Level Urgency State Creation Last Update VO Subject
106753 Green Less Urgent In Progress 2014-07-09 2014-07-09 Atlas Errors in transfers to RAL-LCG2
106695 Green Less Urgent In Progress 2014-07-08 2014-07-08 Ops [Rod Dashboard] Issues detected at RAL-LCG2
106655 Green Less Urgent In Progress 2014-07-04 2014-07-04 Ops [Rod Dashboard] Issues detected at RAL-LCG2 (srm-dteam)
106640 Green Less Urgent In Progress 2014-07-04 2014-07-04 ILC Failure to submit jobs to RAL-LCG2 CEs
106610 Green Less Urgent In Progress 2014-07-02 2014-07-02 HyperK HyperK support
106324 Yellow Urgent In Progress 2014-06-18 2014-07-01 CMS pilots losing network connections at T1_UK_RAL
105405 Red Urgent On Hold 2014-05-14 2014-07-01 please check your Vidyo router firewall configuration
Availability Report

Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 508); CMS HC = CMS HammerCloud

Day OPS Alice Atlas CMS LHCb Atlas HC CMS HC Comment
02/07/14 100 100 100 100 100 98 99
03/07/14 100 100 100 100 100 99 100
04/07/14 100 100 100 100 100 97 100
05/07/14 100 100 100 100 100 92 100
06/07/14 100 100 100 100 100 99 100
07/07/14 100 100 100 100 100 97 100
08/07/14 100 100 41 100 100 100 99 Atlas Castor upgrade.