(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
RAL Tier1 Operations Report for 6th March 2013
Review of Issues during the week 27th February to 6th March 2013.
|
- A quiet week operationally (although note the current issues section below). An emergency reboot of one of the site routers on Tuesday late afternoon (5th March) did not cause any operational problem.
Resolved Disk Server Issues
|
- GDSS648 (LHCbDst) failed in the early hours of Sunday morning (3rd March). A faulty network card was replaced and the system returned to production around midday on Monday (4th March).
Current operational status and issues
|
- This morning (Wed 6th March) - intermittent network connectivity problems being investigated.
- There have been intermittent problems over the past few weeks with the start rate for batch jobs. A script has been introduced to regularly check for this and take action to minimise its effects.
- We are investigating a higher job failure rate for LHCb and Atlas. This appears to be caused by job set-ups taking a long time. We are investigating running jobs re-niced.
- High load observed on uplink to one of network stacks (stack 13), serving SL09 disk servers (~ 3PB of storage). (Anticipate resolution of this during intervention on 12th March).
- Investigations are ongoing into asymmetric bandwidth to/from other sites. We are seeing some poor outbound rates - a problem which disappears when we are not loading the network. (Scheduled intervention on 12th March will progress this.)
- The testing of FTS3 is continuing. (This runs in parallel with our existing FTS2 service).
- We are participating in xrootd federated access tests for Atlas.
- Test batch queue with five SL6/EMI-2 worker nodes and own CE in place.
Ongoing Disk Server Issues
|
- GDSS594 (GenTape) is still unavailable as it will re-run acceptance testing before being considered for going back into service.
Notable Changes made this last week
|
- All remaining tape servers have now been upgraded to Castor 2.1.13-9.
- The number of nodes behind the SL6 trial batch queue has been increased with a few hundred job slots now available.
- Disk controller firmware updates in the 2011 Clustervision batch of disk servers (ongoing).
- Site outage on Tuesday 12th March for replacement of core network switch (C300).
Advanced warning for other interventions
|
The following items are being discussed and are still to be formally scheduled and announced.
|
- A program of updating the disk controller firmware in the 2011 Clustervision batch of disk servers is ongoing.
Listing by category:
- Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
- Castor:
- Upgrade to version 2.1.13
- Networking:
- Replace central switch (C300). (Planned for a Tuesday during March). This will:
- Improve the stack 13 uplink.
- Change the network trunking as part of investigation (& possible fix) into asymmetric data rates.
- Update core Tier1 network and change connection to site and OPN including:
- Install new Routing layer for Tier1
- Change the way the Tier1 connects to the RAL network.
- These changes will lead to the removal of the UKLight Router.
- Addition of caching DNSs into the Tier1 network.
- Grid Services
- Upgrade of Site-BDII & WMS from EMI-1 to EMI-2 by end of March.
- Upgrade of other EMI-1 components (APEL, UI) under investigation.
- Infrastructure:
- Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
- Remedial work on the BMS (Building Management System) due to one its three modules being faulty.
- Electrical safety check (will require some downtime).
Entries in GOC DB starting between 27th February to 6th March 2013.
|
There were no entries in the GOC DB for last week.
Open GGUS Tickets (Snapshot at time of meeting)
|
GGUS ID |
Level |
Urgency |
State |
Creation |
Last Update |
VO |
Subject
|
91974
|
Green
|
Urgent
|
In Progress
|
2013-03-04
|
2013-03-04
|
|
NAGIOS *eu.egi.sec.EMI-1* failed on lcgwms01.gridpp.rl.ac.uk@RAL-LCG2
|
91687
|
Red
|
Less Urgent
|
Waiting Reply
|
2013-02-21
|
2013-03-06
|
epic
|
Support for epic.vo.gridpp.ac.uk VO on WMS
|
91658
|
Red
|
Less Urgent
|
In Progress
|
2013-02-20
|
2013-02-22
|
|
LFC webdav support
|
91146
|
Red
|
Urgent
|
In Progress
|
2013-02-04
|
2013-03-05
|
Atlas
|
RAL input bandwith issues
|
91029
|
Red
|
Very Urgent
|
On Hold
|
2013-01-30
|
2013-02-27
|
Atlas
|
FTS problem in queryin jobs
|
90528
|
Red
|
Less Urgent
|
In Progress
|
2013-01-17
|
2013-02-19
|
SNO+
|
WMS not assiging jobs to sheffield
|
86152
|
Red
|
Less Urgent
|
In Progress
|
2012-09-17
|
2013-03-06
|
|
correlated packet-loss on perfsonar host
|
Day |
OPS |
Alice |
Atlas |
CMS |
LHCb |
Comment
|
27/02/13 |
100 |
100 |
97.5 |
95.9 |
100 |
Atlas: Few SRM test timeouts, CMS: Single SRM test timeout.
|
28/02/13 |
100 |
-100 |
100 |
100 |
100 |
Problem with ALICE monitoring
|
01/03/13 |
100 |
-100 |
100 |
95.9 |
100 |
Problem with ALICE monitoring, CMS Single SRM test timeout.
|
02/03/13 |
100 |
-100 |
100 |
100 |
100 |
Problem with ALICE monitoring
|
03/03/13 |
100 |
-100 |
100 |
100 |
100 |
Problem with ALICE monitoring
|
04/03/13 |
100 |
-100 |
100 |
100 |
100 |
Problem with ALICE monitoring
|
05/03/13 |
100 |
100 |
100 |
95.8 |
100 |
Single SRM test failure coincident with network router reboot.
|