RAL Tier1 Operations Report for 15th August 2012
Review of Issues during the week 8th to 15th August 2012
|
- The primary link to CERN failed yesterday morning (around 10:45). We failed over to the backup. Traffic switched back to the primary link at around 12:30 the same day. Fault between Janet and RAL's equipment at RAL - but root cause not found.
Resolved Disk Server Issues
|
Current operational status and issues
|
- On 12th/13th June the first stage of switching ready for the work on the main site power supply took place. The work on the two transformers is expected to take until 18th December and involves powering off one half of the resilient supply for 3 months while being overhauled, then repeat with the other half.
Ongoing Disk Server Issues
|
Notable Changes made this last week
|
- Continuing test of hyperthreading, one batch of worker nodes (the Dell 2011 batch) has number of jobs increased further (from 18 to 20) on Thursday (9th August).
- As stated before: CVMFS available for testing by non-LHC VOs (including "stratum 0" facilities).
- A test queue ("gridTest") is available with (currently) four worker nodes running EMI2/SL5.
- Site "Warning" (At Risk) for couple of hours during morning of Tuesday 21st August for a site firewall re-configuration. Will also drain and stop FTS during this period.
Advanced warning for other interventions
|
The following items are being discussed and are still to be formally scheduled and announced.
|
Listing by category:
- Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
- Castor:
- Upgrade to version 2.1.12.
- Networking:
- Install new Routing layer for Tier1 and update the way the Tier1 connects to the RAL network. (Plan to co-locate with replacement of UKlight network).
- Update Spine layer for Tier1 network.
- Replacement of UKLight Router.
- Addition of caching DNSs into the Tier1 network.
- Grid Services:
- Updates of Grid Services as appropriate. (Services now on EMI/UMD versions unless there is a specific reason not.)
- Infrastructure:
- Intervention required on the "Essential Power Board". Will most likely require power outage in UPS room (TBC).
- Remedial work on three (out of four) transformers. Will require two "At Risk" periods.
Entries in GOC DB starting between 8th and 15th August 2012
|
There were no entries in the GOC DB for this period.
GGUS ID |
Level |
Urgency |
State |
Creation |
Last Update |
VO |
Subject
|
85077
|
Green
|
Less Urgent
|
On Hold
|
2012-08-13
|
2012-08-14
|
BIOMED
|
CE lcgce05.gridpp.rl.ac.uk job cannot register file on SE srm-biomed.gridpp.rl.ac.uk
|
84492
|
Red
|
Urgent
|
In Progress
|
2012-07-24
|
2012-08-10
|
snoplus
|
Job time/memory requirements not provided
|
84408
|
Red
|
Very Urgent
|
Waiting Reply
|
2012-07-20
|
2012-08-13
|
neurogrid
|
Enable neurogrid.incf.org on WMS and LFC
|
68853
|
Red
|
Less Urgent
|
On hold
|
2011-03-22
|
2012-07-30
|
N/A
|
Retirenment of SL4 and 32bit DPM Head nodes and Servers
|