Latest revision as of 09:18, 9 May 2012
RAL Tier1 Operations Report for 9th May 2012
Review of Issues during the week 2nd to 9th May 2012
|
- On Wednesday (2nd March) LHCb reported a problem with file transfers that was traced to a user was mapped to a VO by some of the FTS web front ends.
- During the early hours of this morning (Wednesday 9th May) there was a problem with a network switch within the Tier1 network. This in turn caused a problem for the whole of the switch stack it is in. Staff attended on-site. This particularly affected the batch system and a four hour outage was declared on the CEs.
Resolved Disk Server Issues
|
- GDSS596 (AtlasDataDisk - D1T0) was out for production for a few hours on Thursday (3rd May) to enable a disk to start rebuilding.
- GDSS607 (LHCbDst - D1T0) failed with FSProbe errors on Friday evening (4th May). It was left in "readonly" mode for the weekend and since then a start has been made on draining the disk server ahead of further work.
Current operational status and issues
|
- Investigations into an ongoing communications problem between the CEs and the batch server continue.
- There have been no further problems in the last week on the UKLight-SAR link although we will continue to track this here.
- There is a known problem with the handling of some certificates within FTS that is currently causing problems for LHCb FTS transfers.
- Two worker nodes with the newer EMI release of the software have been tested although some problems were uncovered.
Ongoing Disk Server Issues
|
Notable Changes made this last week
|
- Five disk servers deployed to Alice (AliceDisk) (will replace five older, smaller capacity servers).
- Ten disk servers have been deployed for LHCb (LHCbDst).
Forthcoming Work & Interventions
|
- Castor will move to use the new "Tape Gateway" and "Transfer Manager" features.
- Wednesday 16th May: Update LFC front ends (except LHCb) to glite version 1.8.2
- Some modified WAN tuning settings are being rolled out across disk servers.
- The ganglia server will be replaced.
- Thursday 10th May - Short interruption to the Castor GEN instance as it is reconfigured to use the newer Castor Tape Gateway.
Advanced warning for other interventions
|
The following items are being discussed and are still to be formally scheduled and announced.
|
- Databases:
- Regular Oracle "PSU" patches are pending.
- Switch LFC/FTS/3D to new Database Infrastructure.
- Update LFC/FTS databases to Oracle 11.
- Castor:
- Update the Castor Information Provider (CIP) (Need to re-schedule.)
- Move to use Oracle 11g (requires a minor Castor update to version 2.1.11-9).
- Upgrade to version 2.1.12.
- Networking:
- Install new Routing & Spine layers for Tier1 network.
- Main RAL network updates - early summer.
- Addition of caching DNSs into the Tier1 network.
- Grid Services:
- Updates of Grid Services (including WMS, LFC front ends) to EMI/UMD versions.
- Infrastructure:
- The electricity supply company plan to work on the main site power supply for 6 months commencing 14th May. This involves powering off one half of the resilient supply for 3 months while being overhauled, then repeat with the other half.
Entries in GOC DB starting between 2nd and 9th May 2012
|
There was one unscheduled entries in the GOC DB during this period. This was for the network switch problem last night (Wed 9th May).
Service
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
All CEs
|
UNSCHEDULED
|
OUTAGE
|
09/05/2012 05:00
|
09/05/2012 09:01
|
4 hours and 1 minutes
|
Outage on the CEs due to a network switch failure affecting parts of the batch system
|
GGUS ID |
Level |
Urgency |
State |
Creation |
Last Update |
VO |
Subject
|
81669
|
yellow
|
Less Urgent
|
Assigned
|
2011-04-27
|
2012-05-09
|
NA62
|
FTS channel for na62.vo.gridpp.ac.uk
|
68853
|
Red
|
Less Urgent
|
On hold
|
2011-03-22
|
2012-04-20
|
|
Retirement of SL4 and 32bit DPM Head nodes and Servers (Holding Ticket for Tier2s)
|