Latest revision as of 13:02, 12 December 2012
RAL Tier1 Operations Report for 12th December 2012
Review of Issues during the week 5th to 12th December 2012
|
- On Thursday (6th Dec) the batch system was showing some problems and not able to fill all the available job slots. This was traced to a limitation in the configuration of the MAUI scheduler. As a temporary measure the over-commit of jobs (to use hyperthreading) was scaled back. On Monday (10th) an update to MAUI with the relevant parameter increased was rolled out. The following day the over-commit was increased again to bring the total number of job slots up to the expected level.
- On Tuesday (11th Dec) there was a successful load test of the UPS & diesel generator.
- The problem with Castor Atlas and GEN stager daemons using memory has gone away. The cause is unknown, although there is some weak correlation with changes made in the databases to stop bad execution plans.
- This morning, Wednesday 12th December, there have been some problems with the Atlas SRM. These were most likely caused by a networking problem which was fixed around midday.
Resolved Disk Server Issues
|
Current operational status and issues
|
- The batch server process sometimes consumes memory, something which is normally triggered by a network/communication problem with worker nodes. A test for this (with a re-starter) is in place.
- On 12th/13th June the first stage of switching ready for the work on the main site power supply took place. One half of the new switchboard has been refurbished and was brought into service on 17 September. The work on the second is over-running slightly with an estimated completion of date of 13th January. (Original date was 18th Dec.)
- High load observed on uplink to one of network stacks (stack 13), serving SL09 disk servers (~ 3PB of storage).
- Investigations are ongoing into asymmetric bandwidth to/from other sites, in particular we are seeing some poor outbound rates - a problem which disappears when we are not loading the network.
Ongoing Disk Server Issues
|
Notable Changes made this last week
|
- Test of UPS/Diesel generator carried out successfully on Tuesday 11th Dec.
Advanced warning for other interventions
|
The following items are being discussed and are still to be formally scheduled and announced.
|
Listing by category:
- Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
- Castor:
- Upgrade to version 2.1.13
- Networking:
- Network trunking change as part of investigation (& possible fix) into asymmetric data rates.
- Improve the stack 13 uplink
- Install new Routing layer for Tier1 and update the way the Tier1 connects to the RAL network.
- Update Spine layer for Tier1 network.
- Replacement of UKLight Router.
- Addition of caching DNSs into the Tier1 network.
- Grid Services:
- Checking VO usage of, and requirement for, AFS clients from Worker Nodes.
- Infrastructure:
- Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
- Remedial work on the BMS (Building Management System) due to one its three modules being faulty.
Entries in GOC DB starting between 5th and 12th December 2012
|
There were two unscheduled outages in the GOC DB for this period. Both are for this morning's problem with the Atlas SRMs. There was one unscheduled 'warning' for the investigations into the Castor stager memory leak.
Service
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
srm-atlas.gridpp.rl.ac.uk,
|
UNSCHEDULED
|
OUTAGE
|
12/12/2012 11:45
|
12/12/2012 13:30
|
1 hour and 45 minutes
|
Ongoing problems with Atlas SRM.
|
srm-atlas.gridpp.rl.ac.uk,
|
UNSCHEDULED
|
OUTAGE
|
12/12/2012 10:30
|
12/12/2012 11:45
|
1 hour and 15 minutes
|
There are problems with the Atlas srm Database.
|
Whole site.
|
SCHEDULED
|
WARNING
|
11/12/2012 10:00
|
11/12/2012 11:30
|
1 hour and 30 minutes
|
At Risk for test of UPS.
|
Castor 'GEN' srm-alice, srm-biomed, srm-dteam, srm-hone, srm-ilc, srm-mice, srm-minos, srm-na62, srm-snoplus, srm-superb, srm-t2k
|
UNSCHEDULED
|
WARNING
|
06/12/2012 10:00
|
06/12/2012 12:00
|
2 hours
|
Possible degradation in performance of Castor 'GEN' instance while investigating cause of memory leak.
|
Open GGUS Tickets (Snapshot at time of meeting)
|
GGUS ID |
Level |
Urgency |
State |
Creation |
Last Update |
VO |
Subject
|
86152
|
Red
|
Less Urgent
|
On Hold
|
2012-09-17
|
2012-10-31
|
|
correlated packet-loss on perfsonar host
|
Day |
OPS |
Alice |
Atlas |
CMS |
LHCb |
Comment
|
05/12/12 |
100 |
100 |
100 |
100 |
100 |
|
06/12/12 |
100 |
100 |
100 |
100 |
100 |
|
07/12/12 |
100 |
100 |
100 |
100 |
100 |
|
08/12/12 |
100 |
100 |
100 |
100 |
100 |
|
09/12/12 |
100 |
100 |
100 |
95.9 |
100 |
Single SRM test failure "user timeout"
|
10/12/12 |
100 |
100 |
100 |
100 |
100 |
|
11/12/12 |
100 |
100 |
100 |
100 |
100 |
|