Tier1 Operations Report 2013-02-20
From GridPP Wiki
RAL Tier1 Operations Report for 20th February 2013
Review of Issues during the week 13th to 20th February 2013. |
- It was not possible to recover the RAID array on GDSS594 (GenTape) following the double drive failure on 12the Feb. 68 files which had not been migrated to tape at the time of the problem have been declared lost. All of these belonged to T2K.
- There was a shut down of power in the Atlas Building over the last weekend (16/17 Feb) for safety checks. This had no effect on the Tier1.
Resolved Disk Server Issues |
- None
Current operational status and issues |
- There have been intermittent problems over the past fortnight with the start rate for batch jobs. These are still being investigated.
- We have had our batch system put offline by Atlas intermittently over the last week following test failures. We are also investigating a higher job failure rate for LHCb. (These problems may be related).
- High load observed on uplink to one of network stacks (stack 13), serving SL09 disk servers (~ 3PB of storage).
- Investigations are ongoing into asymmetric bandwidth to/from other sites. We are seeing some poor outbound rates - a problem which disappears when we are not loading the network.
- The testing of FTS3 is continuing. (This runs in parallel with our existing FTS2 service).
- We are participating in xrootd federated access tests for Atlas.
- Test batch queue with five SL6/EMI-2 worker nodes and own CE in place. Currently being tested by Atlas.
Ongoing Disk Server Issues |
- Following the loss of data from GDSS594 (GenTape) referred to above it is having the RAID array rebuilt ahead of re-running acceptance testing before being considered for going back into service.
Notable Changes made this last week |
- Wed 13th Feb: AFS clients stopped on Worker Nodes.
Declared in the GOC DB |
- Tuesday 26th February: Warning on Site for an hour during central network intervention that wall cause two short breaks in external connectivity via the firewall. (Will drain FTS ahead of this).
Advanced warning for other interventions |
The following items are being discussed and are still to be formally scheduled and announced. |
Listing by category:
- Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
- Castor:
- Upgrade to version 2.1.13
- Networking:
- Replace central switch (C300). (Anticipated for a Tuesday during March). This will:
- Improve the stack 13 uplink.
- Change the network trunking as part of investigation (& possible fix) into asymmetric data rates.
- Update core Tier1 network and change connection to site and OPN including:
- Install new Routing layer for Tier1
- Change the way the Tier1 connects to the RAL network.
- These changes will lead to the removal of the UKLight Router.
- Addition of caching DNSs into the Tier1 network.
- Replace central switch (C300). (Anticipated for a Tuesday during March). This will:
- Infrastructure:
- Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
- Remedial work on the BMS (Building Management System) due to one its three modules being faulty.
Entries in GOC DB starting between 13th and 20th February 2013. |
None
Open GGUS Tickets (Snapshot at time of meeting) |
GGUS ID | Level | Urgency | State | Creation | Last Update | VO | Subject |
---|---|---|---|---|---|---|---|
91146 | Red | Urgent | In Progress | 2013-02-04 | 2013-02-12 | Atlas | RAL input bandwith issues |
91029 | Red | Very Urgent | In Progress | 2013-01-30 | 2013-02-18 | Atlas | FTS problem in queryin jobs |
90528 | Red | Less Urgent | Waiting Reply | 2013-01-17 | 2013-02-19 | SNO+ | WMS not assiging jobs to sheffield |
90151 | Red | Less Urgent | In Progress | 2013-01-08 | 2013-02-04 | NEISS | Support for NEISS VO on WMS |
86152 | Red | Less Urgent | On Hold | 2012-09-17 | 2013-01-16 | correlated packet-loss on perfsonar host |
Availability Report |
Day | OPS | Alice | Atlas | CMS | LHCb | Comment |
---|---|---|---|---|---|---|
13/02/13 | 100 | 100 | 100 | 100 | 100 | |
14/02/13 | 100 | 100 | 100 | 100 | 100 | |
15/02/13 | 100 | 100 | 100 | 100 | 100 | |
16/02/13 | 100 | 100 | 100 | 100 | 100 | |
17/02/13 | 100 | -100 | 100 | 100 | 100 | Problem with ALICE's monitoring. |
18/02/13 | 100 | 100 | 100 | 100 | 100 | |
19/02/13 | 100 | 100 | 100 | 100 | 100 |