Tier1 Operations Report 2013-02-13
From GridPP Wiki
RAL Tier1 Operations Report for 13th February 2013
Review of Issues during the week 6th to 13th February 2013. |
- There was a low level SRM problem that caused Atlas to put RAL offline for brief periods.
- Maintenance of the R89 machine room air conditioning was completed on 07/02/2013.
Resolved Disk Server Issues |
- None
Current operational status and issues |
- There have been intermittent problems over the past week with the start rate for batch jobs. This is being investigated.
- There is a GGUS ticket for a problem seen by the FTS that is caused by a problem within the Castor SRM.
- The batch server process sometimes consumes memory, something which is normally triggered by a network/communication problem with worker nodes. A test for this is in place.
- High load observed on uplink to one of network stacks (stack 13), serving SL09 disk servers (~ 3PB of storage).
- Investigations are ongoing into asymmetric bandwidth to/from other sites. We are seeing some poor outbound rates - a problem which disappears when we are not loading the network.
- The testing of FTS3 is continuing. (This runs in parallel with our existing FTS2 service).
- System set-up for participation in xrootd federated access tests for Atlas.
- Test batch queue with five SL6/EMI-2 worker nodes and own CE in place. Currently being tested by Atlas.
Ongoing Disk Server Issues |
- gdss594 (GenTape) suffered a double drive failure last night (12/02/2013). Fabric are currently investigating.
Notable Changes made this last week |
- Today, 13th February: Stopping AFS client on Worker Nodes.
Declared in the GOC DB |
- None
Advanced warning for other interventions |
The following items are being discussed and are still to be formally scheduled and announced. |
Listing by category:
- Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
- Castor:
- Upgrade to version 2.1.13
- Networking:
- Replace central switch (C300). (Tentative date 5th March, but Atlas would like earlier). This will:
- Improve the stack 13 uplink.
- Change the network trunking as part of investigation (& possible fix) into asymmetric data rates.
- Update core Tier1 network and change connection to site and OPN including:
- Core networking has informed us that they need to re-configure a core switch on 26/02/2013 between 07:30 and 08:30
- Install new Routing layer for Tier1
- Change the way the Tier1 connects to the RAL network.
- These changes will lead to the removal of the UKLight Router.
- Addition of caching DNSs into the Tier1 network.
- Replace central switch (C300). (Tentative date 5th March, but Atlas would like earlier). This will:
- Grid Services:
- Removal of AFS clients from Worker Nodes.
- Infrastructure:
- Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
- Remedial work on the BMS (Building Management System) due to one its three modules being faulty.
Entries in GOC DB starting between 6th and 13th February 2013. |
None
Open GGUS Tickets (Snapshot at time of meeting) |
GGUS ID | Level | Urgency | State | Creation | Last Update | VO | Subject |
---|---|---|---|---|---|---|---|
91251 | Red | top priority | In Progress | 2013-02-07 | 2013-02-07 | lhcb | CEs don't seem to be running jobs |
91146 | Red | Urgent | In Progress | 2013-02-04 | 2013-02-12 | Atlas | RAL input bandwith issues |
91029 | Red | Very Urgent | In Progress | 2013-01-30 | 2013-02-11 | Atlas | FTS problem in queryin jobs |
90528 | Red | Less Urgent | In Progress | 2013-01-17 | 2013-02-04 | SNO+ | WMS not assiging jobs to sheffield |
90151 | Red | Less Urgent | Waiting Reply | 2013-01-08 | 2013-02-04 | NEISS | Support for NEISS VO on WMS |
86152 | Red | Less Urgent | On Hold | 2012-09-17 | 2013-01-16 | correlated packet-loss on perfsonar host |
Availability Report |
Day | OPS | Alice | Atlas | CMS | LHCb | Comment |
---|---|---|---|---|---|---|
06/02/13 | 100 | 100 | 100 | 100 | 100 | |
07/02/13 | 100 | 100 | 100 | 100 | 100 | |
08/02/13 | 100 | 100 | 100 | 100 | 100 | |
09/02/13 | 100 | 100 | 99.2 | 100 | 100 | User timeout, failure to put a file and subsequent failure to delete it. |
10/02/13 | 100 | 100 | 100 | 100 | 100 | |
11/02/13 | 100 | 100 | 100 | 100 | 100 | |
12/02/13 | 100 | 100 | 100 | 100 | 100 |