(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
RAL Tier1 Operations Report for 3rd April 2013
Review of Issues during the fortnight 20th March to 3rd April 2013.
|
- On Thursday morning, 21st March around 08:40 to 09:00 there was a networking problem that caused some transitory problems for the Tier1. The effect was seen in a single SUM test failure for Atlas.
- Overnight Monday/Tuesday 25/26 March there was a failure of one of the three top BDII nodes. It was removed from the alias the following morning.
- On Tuesday 26th Mar high CPU usage on the site firewall caused intermittent network problems affecting the Tier1 from ~08:00 – ~09:45. Services again experienced transitory failures. There were FTS failures & SAM test failures.
- On Thursday afternoon, around 16:00, an operational error led to a networking break that caused some transitory problems for the Tier1.
- Services ran well over the Easter weekend. There were a couple of problems that did not affect front line services although one of the perfsonar network monitoring systems went down.
Resolved Disk Server Issues
|
- GDSS446 (AtlasDataDisk D1T0) was taken out of service after reporting FSProbe errors yesterday evening (2nd April). A disk drive in the server has been replaced and it was returned to service around 12:25 today.
Current operational status and issues
|
- The uplink to the UKLight Router is running on a single 10Gbit link, rather than a pair of such links.
- There have been intermittent problems over the past few weeks with the start rate for batch jobs. A script has been introduced to regularly check for this and take action to minimise its effects.
- The problem LHCb and Atlas jobs failing due to long job set-up times remains. A different version of CVMFS has been installed as a test and investigations continue.
- The testing of FTS3 is continuing. (This runs in parallel with our existing FTS2 service).
- We are participating in xrootd federated access tests for Atlas.
- Test batch queue with five SL6/EMI-2 worker nodes and own CE in place.
- There is an outstanding problem (and GGUS ticket) affecting the cerificate on the MyProxy server.
Ongoing Disk Server Issues
|
Notable Changes made this last week
|
- Updating disk controller firmware on Clustervision '11 batch of disk servers ongoing.
- Kernel/errata updates and removal of AFS software (as opposed to just disabling) being done across worker nodes.
- Removal of EMI-1 WMS systems (WMS01,02,03). (Disabled in GOC DB).
- This evening (Wed 3rd March 18:00 - 23:59 BST) Emergency maintenance in Geneva affecting both the main and backup links to CERN. No outage is expected during this maintenance. Services are considered at risk only.
Advanced warning for other interventions
|
The following items are being discussed and are still to be formally scheduled and announced.
|
- Next Tuesday (9th April) morning a networking update will cause a couple of short breaks in external connectivity as switches are rebooted.
- One of the disk arrays hosting the LFC/FTS/3D databases has given some errors and an intervention will be necessary requiring a stop of these services. We are checking how long this will take ahead of making an announcement.
- A program of updating the disk controller firmware in the 2011 Clustervision batch of disk servers is ongoing.
Listing by category:
- Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
- Castor:
- Upgrade to version 2.1.13
- Networking:
- Single link to UKLight Router to be restored as paired (2*10Gbit) link.
- Update core Tier1 network and change connection to site and OPN including:
- Install new Routing layer for Tier1
- Change the way the Tier1 connects to the RAL network.
- These changes will lead to the removal of the UKLight Router.
- Addition of caching DNSs into the Tier1 network.
- Grid Services
- Upgrade of other EMI-1 components (APEL, UI) under investigation.
- Infrastructure:
- Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
- Remedial work on the BMS (Building Management System) due to one its three modules being faulty.
- Electrical safety check (will require some downtime).
Entries in GOC DB starting between 20th March and 3rd April 2013.
|
There were no unscheduled entries in the GOC DB for the last fortnight.
Service
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
lcgwms01, lcgwms02, lcgwms03
|
SCHEDULED
|
OUTAGE
|
22-03-2013 11.00.00
|
15-04-2013 12.00.00
|
24 days
|
EMI-1 WMS service retirement
|
Open GGUS Tickets (Snapshot at time of meeting)
|
GGUS ID |
Level |
Urgency |
State |
Creation |
Last Update |
VO |
Subject
|
92266
|
Amber
|
Less Urgent
|
In Progress
|
2013-03-06
|
2013-03-28
|
|
Certificate for RAL myproxy server
|
91974
|
Red
|
Urgent
|
In Progress
|
2013-03-04
|
2013-04-03
|
|
NAGIOS *eu.egi.sec.EMI-1* failed on lcgwms01.gridpp.rl.ac.uk@RAL-LCG2
|
91658
|
Red
|
Less Urgent
|
On Hold
|
2013-02-20
|
2013-04-03
|
|
LFC webdav support
|
91029
|
Red
|
Very Urgent
|
On Hold
|
2013-01-30
|
2013-02-27
|
Atlas
|
FTS problem in queryin jobs
|
86152
|
Red
|
Less Urgent
|
On Hold
|
2012-09-17
|
2013-03-19
|
|
correlated packet-loss on perfsonar host
|
Day |
OPS |
Alice |
Atlas |
CMS |
LHCb |
Comment
|
20/03/13 |
100 |
100 |
100 |
100 |
100 |
|
21/03/13 |
100 |
100 |
98.7 |
100 |
100 |
Single SRM test failure at time of network problem.
|
22/03/13 |
100 |
100 |
100 |
100 |
100 |
|
23/03/13 |
100 |
100 |
100 |
100 |
100 |
|
24/03/13 |
100 |
100 |
100 |
100 |
100 |
|
25/03/13 |
100 |
100 |
100 |
100 |
100 |
|
26/03/13 |
100 |
100 |
94.2 |
100 |
100 |
Network problem triggered by site firewall overload.
|
27/03/13 |
100 |
100 |
100 |
100 |
100 |
|
28/03/13 |
100 |
100 |
100 |
100 |
100 |
|
29/03/13 |
100 |
100 |
100 |
95.9 |
100 |
Single SRM Put failure (user timeout).
|
30/03/13 |
100 |
100 |
100 |
100 |
100 |
|
31/03/13 |
100 |
100 |
98.1 |
100 |
100 |
Two consecutive failures of SRM Put test. "could not open connection to srm-atlas.gridpp.rl.ac.uk"
|
01/04/13 |
100 |
100 |
100 |
100 |
100 |
|
02/04/13 |
100 |
100 |
99.1 |
100 |
100 |
Single SUM test failure of SRM delete.
|