RAL Tier1 Operations Report for 16th January 2013
Review of Issues during the week 9th to 16th January 2013.
|
- Although not a part of the Tier1 service there have been problems on one of the AFS servers resulting in a loss of availability of some AFS files since Monday (14th). System currently unavailable but expected back today.
Resolved Disk Server Issues
|
- GDSS579 (AtlasDataDisk) Out of production for a few hours on Monday (14th Jan). Read-only file system triggered when a disk was being replaced.
- GDSS658 (AtlasScratchDisk) Taken out of production on Tuesday (15th Jan) as found to be responding slowly. Returned to production after disk controller firmware update this morning (16th Jan).
Current operational status and issues
|
- The batch server process sometimes consumes memory, something which is normally triggered by a network/communication problem with worker nodes. A test for this (with a re-starter) is in place.
- On 12th/13th June the first stage of switching ready for the work on the main site power supply took place. One half of the new switchboard has been refurbished and was brought into service on 17 September. The work on the second is over-running slightly with an estimated completion of date of 13th January. (Original date was 18th Dec.)
- High load observed on uplink to one of network stacks (stack 13), serving SL09 disk servers (~ 3PB of storage).
- Investigations are ongoing into asymmetric bandwidth to/from other sites. We are seeing some poor outbound rates - a problem which disappears when we are not loading the network.
Ongoing Disk Server Issues
|
Notable Changes made this last week
|
Advanced warning for other interventions
|
The following items are being discussed and are still to be formally scheduled and announced.
|
Listing by category:
- Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
- Castor:
- Upgrade to version 2.1.13
- Networking:
- Update core Tier1 network and change connection to site and OPN including:
- Install new Routing layer for Tier1 and update the core to remove the C300 central switch
- Change the way the Tier1 connects to the RAL network.
- The above changes will lead to the removal of the UKLight Router.
- Network trunking change as part of investigation (& possible fix) into asymmetric data rates. The major network change (above) is expected to resolve this although a separate change may still be done earlier.
- Improve the stack 13 uplink. The major network change (above) will resolve this although a separate change may still be done earlier.
- Addition of caching DNSs into the Tier1 network.
- Grid Services:
- Checking VO usage of, and requirement for, AFS clients from Worker Nodes.
- Upgrades to BDIIs to latest version on SL6.
- Infrastructure:
- Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
- Remedial work on the BMS (Building Management System) due to one its three modules being faulty.
Entries in GOC DB starting between 9th and 16th January 2013.
|
There were no entries in the GOC DB for this period.
Open GGUS Tickets (Snapshot at time of meeting)
|
GGUS ID |
Level |
Urgency |
State |
Creation |
Last Update |
VO |
Subject
|
90235
|
Amber
|
Urgent
|
In Progress
|
2013-01-09
|
2013-01-15
|
T2K
|
lcgwms03 not renewing proxies
|
90151
|
Green
|
Less Urgent
|
In Progress
|
2013-01-08
|
2013-01-14
|
NEISS
|
Support for NEISS VO on WMS
|
89733
|
Red
|
Urgent
|
In Progress
|
2012-12-17
|
2013-01-09
|
|
RAL bdii giving out incorrect information
|
86152
|
Red
|
Less Urgent
|
On Hold
|
2012-09-17
|
2013-01-16
|
|
correlated packet-loss on perfsonar host
|
Day |
OPS |
Alice |
Atlas |
CMS |
LHCb |
Comment
|
09/01/13 |
100 |
100 |
99.2 |
85.9 |
100 |
Atlas: Single SRM test failure - unable to delete file from SRM.; CMS: Single SRM test failure "user timeout".
|
10/01/13 |
100 |
100 |
99.2 |
100 |
100 |
|
11/01/13 |
100 |
100 |
100 |
100 |
100 |
|
12/01/13 |
100 |
100 |
100 |
100 |
100 |
|
13/01/13 |
100 |
100 |
99.2 |
100 |
100 |
Single SRM test failure - unable to delete file from SRM
|
14/01/13 |
100 |
100 |
99.5 |
100 |
100 |
Single SRM test failure - unable to delete file from SRM. Traced to problematic disk server.
|
15/01/13 |
100 |
100 |
96.9 |
100 |
100 |
Three SRM test failures - unable to delete file from SRM. Traced to problematic disk server.
|