Tier1 Operations Report 2015-10-28

From GridPP Wiki
Jump to: navigation, search

DRAFT - DO NOT USE - RAL Tier1 Operations Report for 28st October 2015

Review of Issues during the week 14th to 21st October 2015.
  • item
  • item
Resolved Disk Server Issues
  • item
Current operational status and issues
  • The LHCb problem with a low but persistent rate of failure when copying the results of batch jobs to Castor. There is also a further problem that sometimes occurs when these (failed) writes are attempted to storage at other sites.
  • The intermittent, low-level, load-related packet loss seen over external connections is still being tracked. Likewise we have been working to understand some remaining low level of packet loss seen within a part of our Tier1 network.
  • Long-standing CMS issues. The two items that remain are CMS Xroot (AAA) redirection and file open times. Work is ongoing into the Xroot redirection with a new server having been added in recent weeks. File open times using Xroot remain slow but this is a less significant problem.
Ongoing Disk Server Issues
  • GDSS663
  • GDSS665
  • GDSS644
Notable Changes made since the last meeting.
  • item
  • item
Declared in the GOC DB
  • None
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.
  • Upgrade of remaining Castor disk servers (those in tape-backed service classes) to SL6. This will be transparent to users.
  • Some detailed internal network re-configurations to enable the removal of the old 'core' switch from our network. This includes changing the way the UKLIGHT router connects into the Tier1 network.

Listing by category:

  • Databases:
    • Switch LFC/3D to new Database Infrastructure.
  • Castor:
    • Update SRMs to new version (includes updating to SL6).
    • Update disk servers to SL6 (ongoing)
    • Update to Castor version 2.1.15.
  • Networking:
      • Complete changes needed to remove the old core switch from the Tier1 network.
    • Make routing changes to allow the removal of the UKLight Router.
  • Fabric
    • Firmware updates on remaining EMC disk arrays (Castor, LFC)
Entries in GOC DB starting since the last report.
  • None
Open GGUS Tickets (Snapshot during morning of meeting)
GGUS ID Level Urgency State Creation Last Update VO Subject
117171 very urgent waiting for reply 2015-10-24 2015-10-27 LHCb Aborted pilots on arc-ce02.gridpp.rl.ac.uk

-

116866 Green Less Urgent On Hold 2015-10-12 2015-10-19 SNO+ snoplus support at RAL-LCG2 (pilot role)
116864 Green Urgent In Progress 2015-10-12 2015-10-26 CMS T1_UK_RAL AAA opening and reading test failing again...
Availability Report

Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 508); CMS HC = CMS HammerCloud

Day OPS Alice Atlas CMS LHCb Atlas HC CMS HC Comment
14/10/15 100 100 100 100 100 91 n/a
15/10/15 100 100 98 100 100 85 100 Single SRM test failure (ould not open connection to srm-atlas.gridpp.rl.ac.uk:8443)
16/10/15 100 100 100 98 100 89 100 Short problem with glexec in the early hours of the morning.
17/10/15 100 100 100 100 100 95 100
18/10/15 100 100 100 100 100 92 n/a
19/10/15 100 100 100 100 100 100 100
20/10/15 100 100 100 100 100 93 100