Tier1 Operations Report 2014-12-03

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 3rd December 2014

Review of Issues during the week 26th November to 3rd December 2014.
  • On the evening of Tuesday 18th, the CMS transfer manager machine (lcgclsf02) failed. The services failed over to the backup. A replacement machine was prepared and put into service that afternoon. The following day the original system was fixed and returned to service.
  • In the early hours of Friday 21st Nov. there was a problem of locking sessions in the Castor database that affected CMS & LHCb. Whilst this was transitory the cause has been understood and a fix will be provided in a future version of Castor.
Resolved Disk Server Issues
  • GDSS673 (LhcbRawDst - D0T1) had failed during the evening of Friday 14th November. The server was returned to production around midday on Thursday 20th Nov.
Current operational status and issues
  • Some problems on Atlas Castor instance. At various times in recent weeks the Atlas workload has led to differing groups of disk servers spending a lot of time in a "wait i/o" state. This is triggered by the numbers of reads using xroot and has led to some SAM test failures.
Ongoing Disk Server Issues
  • None.
Notable Changes made this last week.
  • Latest WMS updates (EMI 3 update 22) applied to WMSs.
  • FTS3 upgraded to 3.2.30
  • OS Errata updates applied to all Castor systems (apart from GEN instance for which this had already been done).
Declared in the GOC DB
  • None.
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.
  • The rollout of the RIP protocol to the Tier1 routers still has to be completed.
  • First quarter 2015: Circuit testing of the remaining (i.e. non-UPS) circuits in the machine room.(Provisional dates: week 12-16 January).
  • Castor headnode upgrades to SL6: (Assume 4 hour outage of Castor instance in each case for stager updates).
    • Tuesday 2nd Dec - LHCb; Tues 9th Dec - CMS; Wed 10th Dec - Atlas; Wednesday 7th Jan - GEN; Thursday 8th Jan - Nameserver (transparent - at risk)

Listing by category:

  • Databases:
    • A new database (Oracle RAC) has been set-up to host the Atlas 3D database. This is updated from CERN via Oracle GoldenGate. This system is yet to be brought into use. (Currently Atlas 3D/Frontier still uses the OGMA datase system, although this was also changed to update from CERN using Oracle Golden Gate.)
    • Switch LFC/3D to new Database Infrastructure.
  • Castor:
    • Update Castor headnodes to SL6.
    • Fix discrepancies were found in some of the Castor database tables and columns. (The issue has no operational impact.)
  • Networking:
    • Move switches connecting the 2011 disk servers batches onto the Tier1 mesh network.
    • Make routing changes to allow the removal of the UKLight Router.
    • Enable the RIP protocol for updating routing tables on the Tier1 routers.
  • Fabric
    • Migration of data to new T10KD tapes. (Migration of CMS from 'B' to 'D' tapes underway; migration of GEN from 'A' to 'D' tapes to follow.)
    • Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
    • There will be circuit testing of the remaining (i.e. non-UPS) circuits in the machine room (Expected first quarter 2015).
Entries in GOC DB starting between the 26th November and 3rd December 2014.
Service Scheduled? Outage/At Risk Start End Duration Reason
Castor (srm endpoints). SCHEDULED WARNING 25/11/2014 11:00 25/11/2014 12:00 1 hour At risk on some castor instances while we deploy errata updates
Whole site UNSCHEDULED WARNING 25/11/2014 07:00 25/11/2014 08:00 1 hour Site warning during firewall configuration change.
lcgfts3.gridpp.rl.ac.uk, SCHEDULED WARNING 24/11/2014 10:00 24/11/2014 12:00 2 hours At risk for FTS3 upgrade to 3.2.30
srm-cms-disk.gridpp, srm-cms.gridpp.rl.ac.uk UNSCHEDULED WARNING 20/11/2014 14:00 20/11/2014 15:00 1 hour At risk while we return a CMS Castor headnode to production


Open GGUS Tickets (Snapshot during morning of meeting)
GGUS ID Level Urgency State Creation Last Update VO Subject
110497 Green Less Urgent In Progress 2014-12-02 2014-11-02 [Rod Dashboard] Issues detected at RAL-LCG2 OPS
110397 Green Less Urgent In Progress 2014-11-26 2014-11-27 Unable to access LFC webdav interface via browser dteam
110382 Green Less Urgent In Progress 2014-11-26 2014-11-26 RAL-LCG2: please reinstall your perfsonar hosts(s) N/A
109712 Green Urgent In Progress 2014-10-29 2014-11-27 CMS Glexec exited with status 203; ...
109694 Green Urgent On hold 2014-11-03 2014-11-26 SNO+ gfal-copy failing for files at RAL
108944 Amber Urgent In Progress 2014-10-01 2014-11-26 CMS AAA access test failing at T1_UK_RAL
107935 Red Less Urgent On Hold 2014-08-27 2014-11-03 Atlas BDII vs SRM inconsistent storage capacity numbers
106324 Red Urgent On Hold 2014-06-18 2014-11-27 CMS pilots losing network connections at T1_UK_RAL
Availability Report

Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 508); CMS HC = CMS HammerCloud

Day OPS Alice Atlas CMS LHCb Atlas HC CMS HC Comment
26/11/14 100 100 100 100 100 99 n/a
27/11/14 100 100 100 100 100 98 n/a
28/11/14 100 100 98.1 100 100 100 n/a
29/11/14 100 100 100 98.5 100 100 n/a Start of problems with CMS Castor scheduler headnode.
30/11/14 100 100 100 73.0 100 100 n/a Problems with CMS Castor scheduler headnode.
01/12/14 100 100 99.2 95.9 91.9 100 n/a
02/12/14 100 100 100 100 100 100 100