Difference between revisions of "Tier1 Operations Report 2013-10-16"

From GridPP Wiki
Jump to: navigation, search
 
(No difference)

Latest revision as of 09:02, 16 October 2013

RAL Tier1 Operations Report for 16th October 2013

Review of Issues during the week 9th to 16th October 2013.
  • The Torque/Maui farm as been problematic during this last week. This is currently running with 50% of our total batch capacity.
  • A single file was reported lost to CMS. It was on CMSDisk. This was uncovered by the checksum checker. A copy was still available on CMSTape here at RAL and was copied across.
Resolved Disk Server Issues
  • None
Current operational status and issues
  • The uplink to the UKLight Router is running on a single 10Gbit link, rather than a pair of such links.
  • The FTS3 testing has continued very actively with Atlas. Problems with FTS3 are being uncovered during these tests. Patches are being regularly applied to FTS3 to deal with issues.
  • We are participating in xrootd federated access tests for Atlas. The server has now been successfully configured to work as an xroot redirector, whereas before it could only serve as a proxy.
  • The Condor batch farm has been marked as in production. This contains around 50% of the total batch capacity. All its WNs running SL6. The remaining nodes are in the Torque/Maui farm and its WNs have been upgraded to Sl6 as well. We plan to keep this configuration (with both farms running SL6 WNs with 50% of the total capacity) until early November.
Ongoing Disk Server Issues
  • None
Notable Changes made this last week.
  • None
Declared in the GOC DB
  • None
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.
  • Re-establishing the paired (2*10Gbit) link to the UKLight router. This is proposed to take place next Wednesday morning, 23rd October, during which Castor will be stopped and batch paused.
  • Interruption to services over Tuesday/Wednesday 5/6 November during work on the UPS and safety testing of its circuits. Initial plans propose Castor down for the day on Tuesday 5th.

Listing by category:

  • Databases:
    • Switch LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • None
  • Networking:
    • Single link to UKLight Router to be restored as paired (2*10Gbit) link.
    • Update core Tier1 network and change connection to site and OPN including:
      • Install new Routing layer for Tier1
      • Change the way the Tier1 connects to the RAL network.
      • These changes will lead to the removal of the UKLight Router.
  • Fabric
    • One of the disk arrays hosting the FTS, LFC & Atlas 3D databases is showing a fault and an intervention is required - initially to update the disk array's firmware.
  • Infrastructure:
    • A 2-day maintenance on the UPS along with the safety testing of associated electrical circuits is being planned for the 5th/6th November (TBC). The impact of this on our services is still being worked out. During this the following issues will be addressed:
      • Intervention required on the "Essential Power Board".
      • Remedial work on the BMS (Building Management System) due to one its three modules being faulty.
      • Electrical safety check. This will take place over a couple of days during which time individual UPS circuits will need to be powered down.
Entries in GOC DB starting between the 9th and 16th October 2013.

There were no entries in the GOC DB for the last week.

Open GGUS Tickets (Snapshot at time of meeting)
GGUS ID Level Urgency State Creation Last Update VO Subject
97908 Amber Less Urgent In Progress 2013-10-09 2013-10-09 Backup UK VOMS servers
97868 Red Less Urgent In Progress 2013-10-08 2013-10-14 T2K CVMFS for t2k.org
97759 Red Urgent On Hold 2013-10-04 2013-10-04 OPS SHA-2 test failing on lcgce01
97385 Red Less Urgent In Progress 2013-09-17 2013-10-14 HyperK CVMFS for hyperk.org
97025 Red Less urgent On Hold 2013-09-03 2013-09-12 Myproxy server certificate does not contain hostname
91658 Red Less Urgent On Hold 2013-02-20 2013-09-03 LFC webdav support
86152 Red Less Urgent On Hold 2012-09-17 2013-06-17 correlated packet-loss on perfsonar host
Availability Report
Day OPS Alice Atlas CMS LHCb Comment
09/10/13 100 100 97.1 96.0 100 Atlas: SRM test failure (Invalid argument); CMS: SRM test failure (Error reading token data header)
10/10/13 100 100 100 100 100
11/10/13 100 100 100 100 100
12/10/13 100 100 100 100 100
13/10/13 100 100 97.8 100 100 Problem with Torque/Maui batch server.
14/10/13 100 100 100 100 100
15/10/13 100 100 99.2 100 100 SRM test failure (Too many threads busy with Castor)