From GridPP Wiki
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
RAL Tier1 Operations Report for 13th February 2013
Review of Issues during the week 6th to 13th February 2013.
|
- There was a low level SRM problem that caused Atlas to put RAL offline for brief periods.
- Maintenance of the R89 machine room air conditioning was completed on 07/02/2013.
Resolved Disk Server Issues
|
Current operational status and issues
|
- There have been intermittent problems over the past week with the start rate for batch jobs. This is being investigated.
- There is a GGUS ticket for a problem seen by the FTS that is caused by a problem within the Castor SRM.
- The batch server process sometimes consumes memory, something which is normally triggered by a network/communication problem with worker nodes. A test for this is in place.
- High load observed on uplink to one of network stacks (stack 13), serving SL09 disk servers (~ 3PB of storage).
- Investigations are ongoing into asymmetric bandwidth to/from other sites. We are seeing some poor outbound rates - a problem which disappears when we are not loading the network.
- The testing of FTS3 is continuing. (This runs in parallel with our existing FTS2 service).
- System set-up for participation in xrootd federated access tests for Atlas.
- Test batch queue with five SL6/EMI-2 worker nodes and own CE in place. Currently being tested by Atlas.
Ongoing Disk Server Issues
|
- gdss594 (GenTape) suffered a double drive failure last night (12/02/2013). Fabric are currently investigating.
Notable Changes made this last week
|
- Today, 13th February: Stopping AFS client on Worker Nodes.
Advanced warning for other interventions
|
The following items are being discussed and are still to be formally scheduled and announced.
|
Listing by category:
- Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
- Castor:
- Upgrade to version 2.1.13
- Networking:
- Replace central switch (C300). (Tentative date 5th March, but Atlas would like earlier). This will:
- Improve the stack 13 uplink.
- Change the network trunking as part of investigation (& possible fix) into asymmetric data rates.
- Update core Tier1 network and change connection to site and OPN including:
- Core networking has informed us that they need to re-configure a core switch on 26/02/2013 between 07:30 and 08:30
- Install new Routing layer for Tier1
- Change the way the Tier1 connects to the RAL network.
- These changes will lead to the removal of the UKLight Router.
- Addition of caching DNSs into the Tier1 network.
- Grid Services:
- Removal of AFS clients from Worker Nodes.
- Infrastructure:
- Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
- Remedial work on the BMS (Building Management System) due to one its three modules being faulty.
Entries in GOC DB starting between 6th and 13th February 2013.
|
None
Open GGUS Tickets (Snapshot at time of meeting)
|
GGUS ID |
Level |
Urgency |
State |
Creation |
Last Update |
VO |
Subject
|
91251
|
Red
|
top priority
|
In Progress
|
2013-02-07
|
2013-02-07
|
lhcb
|
CEs don't seem to be running jobs
|
91146
|
Red
|
Urgent
|
In Progress
|
2013-02-04
|
2013-02-12
|
Atlas
|
RAL input bandwith issues
|
91029
|
Red
|
Very Urgent
|
In Progress
|
2013-01-30
|
2013-02-11
|
Atlas
|
FTS problem in queryin jobs
|
90528
|
Red
|
Less Urgent
|
In Progress
|
2013-01-17
|
2013-02-04
|
SNO+
|
WMS not assiging jobs to sheffield
|
90151
|
Red
|
Less Urgent
|
Waiting Reply
|
2013-01-08
|
2013-02-04
|
NEISS
|
Support for NEISS VO on WMS
|
86152
|
Red
|
Less Urgent
|
On Hold
|
2012-09-17
|
2013-01-16
|
|
correlated packet-loss on perfsonar host
|
Day |
OPS |
Alice |
Atlas |
CMS |
LHCb |
Comment
|
06/02/13 |
100 |
100 |
100 |
100 |
100 |
|
07/02/13 |
100 |
100 |
100 |
100 |
100 |
|
08/02/13 |
100 |
100 |
100 |
100 |
100 |
|
09/02/13 |
100 |
100 |
99.2 |
100 |
100 |
User timeout, failure to put a file and subsequent failure to delete it.
|
10/02/13 |
100 |
100 |
100 |
100 |
100 |
|
11/02/13 |
100 |
100 |
100 |
100 |
100 |
|
12/02/13 |
100 |
100 |
100 |
100 |
100 |
|