Tier1 Operations Report 2014-09-17

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 17th September 2014

Review of Issues during the week 10th to 17th September 2014.
  • On Saturday (13th Sep) there was a poblem with teh Atlas Castor instance that persisted into the beginning of Sunday. A number of measures were taken to improve it, although the root cause remains unknown.
  • For the second half of last week there were problems with cream-ce02.
  • This morning (Wednesday 17th Sep) there was a problem with some machines that run as VMs - the symptom was that their networking stopped. Restarting the network fixed the problem. This is similar to a problem seen on the 30th August. The configuration of the network interface on these systems has been changed to workaround this.
Resolved Disk Server Issues
  • None
Current operational status and issues
  • None.
Ongoing Disk Server Issues
  • None.
Notable Changes made this last week.
  • VO londongrid enabled on LFC.
Declared in the GOC DB
  • None
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.
  • The rollout of the RIP protocol to the Tier1 routers still has to be completed.
  • Access to the Cream CEs will be withdrawn apart from leaving access for ALICE. The proposed date for this is Tuesday 30th September.

Listing by category:

  • Databases:
    • Apply latest Oracle patches (PSU) to the production database systems (Castor, LFC).
    • A new database (Oracle RAC) is being set-up that will host the Atlas3D database and be updated from CERN via Oracle GoldenGate.
    • Switch LFC/3D to new Database Infrastructure.
  • Castor:
    • Update Castor headnodes to SL6.
    • Fix discrepancies were found in some of the Castor database tables and columns. (The issue has no operational impact.)
  • Networking:
    • Move switches connecting the 2011 disk servers batches onto the Tier1 mesh network.
    • Make routing changes to allow the removal of the UKLight Router.
    • Enable the RIP protocol for updating routing tables on the Tier1 routers.
  • Fabric
    • Migration of data to new T10KD tapes. (Migration of CMS from 'B' to 'D' tapes underway; migration of GEN from 'A' to 'D' tapes to follow.)
    • Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
    • There will be circuit testing of the remaining (i.e. non-UPS) circuits in the machine room (Expected first quarter 2015).
Entries in GOC DB starting between the 10th and 17th September 2014.
  • None
Open GGUS Tickets (Snapshot during morning of meeting)
GGUS ID Level Urgency State Creation Last Update VO Subject
108546 Green Less Urgent In Progress 2014-09-16 2014-09-16 Atlas RAL-LCG2_HIMEM_SL6: production jobs failed
107935 Yellow Less Urgent In Progress 2014-08-27 2014-09-02 Atlas BDII vs SRM inconsistent storage capacity numbers
107880 Amber Less Urgent In Progress 2014-08-26 2014-09-02 SNO+ srmcp failure
106324 Red Urgent On Hold 2014-06-18 2014-08-14 CMS pilots losing network connections at T1_UK_RAL
105405 Red Urgent On Hold 2014-05-14 2014-09-12 Please check your Vidyo router firewall configuration
Availability Report

Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 508); CMS HC = CMS HammerCloud

Day OPS Alice Atlas CMS LHCb Atlas HC CMS HC Comment
10/09/14 100 100 99.2 100 100 96 97 Single SRM test failure on GET - [SRM_FILE_BUSY]
11/09/14 100 100 100 100 100 100 99
12/09/14 100 100 100 100 100 100 96
13/09/14 100 100 82.2 100 100 54 99 Problems with Atlas Castor instance
14/09/14 100 100 91.8 100 100 84 98 Problems with Atlas Castor instance (continued)
15/09/14 100 100 100 100 100 99 98
16/09/14 100 100 100 100 100 98 99