Tier1 Operations Report 2014-07-23

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 23rd July 2014

Review of Issues during the week 16th to 23rd July 2014.
  • The recurring problems with the SRM processes for the castor GEN instance crashing has been solved. The problem started on Friday 11th July and was fixed on Friday 18th. The cause was a mal-formed file name being sent that was not trapped by the relevant SRM code.
  • On Thursday (17th) the Castor disk cache for AtlasTape filled up. This was traced to the garbage collector not running and was immediately fixed.
  • Yesterday (Tuesday 22nd) there was a problem with the site network that effectively took the Tier1 off-air from 10:40 to 11:25. This was co-incident with, but not caused by, an ongoing network update (to use the "RIP" protocol).
Resolved Disk Server Issues
  • None
Current operational status and issues
  • We are still investigating xroot access to CMS Castor following the upgrade on the 17th June. The initial problems are understood but we still need to investigate further optimisations and there have been hot data sets.
  • There is a problem with the dteam SRM regional nagios tests, which may be caused by how dteam is published by the CIP.
  • There have been problems with Atlas FAX since the Atlas Castor upgrade (9th July). The cause is now understood and is just awaiting the application of the fix.
Ongoing Disk Server Issues
  • None
Notable Changes made this last week.
  • Hyper-K VO enabled on ARC CEs
  • Inceased max number of allowed multicore jobs from 400 to 600
  • FTS3 was upgraded to v3.2.26 on Monday (21st July)
  • The core RAL site network has been updated to use RIP for network routing on Tuesday 22nd July.
Declared in the GOC DB
Service Scheduled? Outage/At Risk Start End Duration Reason
perfsonar-ps01.gridpp.rl.ac.uk, perfsonar-ps02.gridpp.rl.ac.uk, SCHEDULED OUTAGE 14/07/2014 11:00 14/08/2014 11:00 31 days, Systems being decommissioned. They have been replaced by lcgps01.gridpp.rl.ac.uk and lcgps02.gridpp.rl.ac.uk
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.
  • We are planning the termination of the FTS2 service (announced for 2nd September) now that almost all use is on FTS3.
  • The removal of the (NFS) software server is scheduled for the 2nd September.
  • Although the core RAL site network has been updated to use RIP for network routing this still needs to be applied to the Tier1 failover pair of routers. We are planning (subject to confirmation) to do this on Tuesday 29th July.
  • We are planning stop access to the cream CEs - although possibly leaving them available to ALICE for some time. No date has yet been specified for this.

Listing by category:

  • Databases:
    • Switch LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • None.
  • Networking:
    • Move switches connecting the 2011 disk servers batches onto the Tier1 mesh network.
    • Make routing changes to allow the removal of the UKLight Router.
  • Fabric
    • We are phasing out the use of the software server used by the small VOs.
    • Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
    • There will be circuit testing of the remaining (i.e. non-UPS) circuits in the machine room during 2014.
Entries in GOC DB starting between the 16th and 23rd July 2014.
Service Scheduled? Outage/At Risk Start End Duration Reason
Whole Site UNSCHEDULED WARNING 22/07/2014 12:00 22/07/2014 15:00 3 hours Extending the earlier warning during update to site network routing.
Whole Site UNSCHEDULED OUTAGE 22/07/2014 10:40 22/07/2014 11:25 45 minutes Unexpected outage during update to site network routing.
Whole Site UNSCHEDULED WARNING 22/07/2014 10:00 22/07/2014 11:00 1 hour Extending the earlier warning during update to site network routing.
Whole Site SCHEDULED WARNING 22/07/2014 07:00 22/07/2014 10:00 3 hours Warning during update to site network routing.
lcgfts3.gridpp.rl.ac.uk, SCHEDULED WARNING 21/07/2014 11:00 21/07/2014 13:00 2 hours Update of FTS3 service to version v3.2.26
perfsonar-ps01.gridpp.rl.ac.uk, perfsonar-ps02.gridpp.rl.ac.uk, SCHEDULED OUTAGE 14/07/2014 11:00 14/08/2014 11:00 31 days, Systems being decommissioned. They have been replaced by lcgps01.gridpp.rl.ac.uk and lcgps02.gridpp.rl.ac.uk
Open GGUS Tickets (Snapshot during morning of meeting)
GGUS ID Level Urgency State Creation Last Update VO Subject
106655 Yellow Less Urgent In Progress 2014-07-04 2014-07-16 Ops [Rod Dashboard] Issues detected at RAL-LCG2 (srm-dteam)
106324 Red Urgent In Progress 2014-06-18 2014-07-07 CMS pilots losing network connections at T1_UK_RAL
105405 Red Urgent On Hold 2014-05-14 2014-07-01 please check your Vidyo router firewall configuration
Availability Report

Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 508); CMS HC = CMS HammerCloud

Day OPS Alice Atlas CMS LHCb Atlas HC CMS HC Comment
16/07/14 100 100 100 100 100 100 98
17/07/14 100 100 100 100 100 97 100
18/07/14 100 100 100 100 100 97 99
19/07/14 100 100 100 100 100 98 100
20/07/14 100 100 100 100 100 96 100
21/07/14 100 100 100 100 100 100 96
22/07/14 100 95.6 97.4 95.6 100 96 96 Site networking problem.