(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
RAL Tier1 Operations Report for 12th February 2014
Review of Issues during the week 29th January to 5th February 2014.
|
- There was a successful UPS/Generator load test this morning.
- There was a problem with updating grid-mapfiles in Castor caused by a certificate problem that was resolved on Tuesday (11th). The problem was first seen on Friday 7th).
Resolved Disk Server Issues
|
- GDSS653 (LHCbDst - D1T0) had a problem aound 06:00 on Monday morning (10th Feb). The on-call person worked on the system and it was unavailable for less than an hour. One file was lost and this has been declared to LHCb.
Current operational status and issues
|
- The intermittent failures of Castor access via the SRM (as seen in the availability tests) reported last week is still present. This has been seen across multiple Castor instances. The Castor team are actively working on this and have been in contact with the Castor developers at CERN to try and find a solution.
- We are participating in an extensive FTS3 test with Atlas and CMS.
- There has been a problem over the last couple of days with LHCb jobs aborting.
Ongoing Disk Server Issues
|
Notable Changes made this last week.
|
- CVMFS client version 2.1.17 continues to be tested on one batch of worker nodes (approx 10% of the batch farm).
- On Thursday (6th Feb) all remaining worker nodes were configured to access the new CernVM-FS Stratum-1 service at RAL (cvmfs-wlcg.gridpp.rl.ac.uk).
- There have been updates to the WMSs to resolve the proxy renewal problems.
- There was a successful intervention on the Tier1 network yesterday morning (Tuesday 12th February) to add equipment that will form the new 'mesh' network.
Advanced warning for other interventions
|
The following items are being discussed and are still to be formally scheduled and announced.
|
(Proposed) Tuesday 25th February: Change Tier1 connection to site network (expect around 1 day outage).
Listing by category:
- Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
- Castor:
- Castor 2.1.14 testing is ongoing. A date for deployments awaits successful completion of this testing.
- Networking:
- Implementation of new site firewall. Date for Tier1 proposed to be 11th March. (Initial changes for links that do not affect the Tier1 commenced this week.)
- Update core Tier1 network and change connection to site and OPN including:
- Install new Routing layer for Tier1 & change the way the Tier1 connects to the RAL network. (Proposed for Tuesday 25th February).
- These changes will lead to the removal of the UKLight Router.
- Fabric
- We are phasing out the use of the software server used by the small VOs.
- Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
- There will be circuit testing of the remaining (i.e. non-UPS) circuits in the machine room during 2014.
- The floor in the machine room in the Atlas building is being replaced. We currently run some production services on hypervisors located there. These will be moved ahead of the first part of this work (re-routing some networking) on the morning of Wednesday 19th February. We are experiencing some problems with the hypervisors which means this move may not be transparent.
Entries in GOC DB starting between the 5th and 12th February 2014.
|
Service
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
Whole Site.
|
SCHEDULED
|
WARNING
|
12/02/2014 10:00
|
12/02/2014 12:00
|
2 hours
|
RAL Tier1 site in warning state due to UPS/generator test.
|
Whole Site.
|
SCHEDULED
|
WARNING
|
11/02/2014 09:30
|
11/02/2014 11:30
|
2 hours
|
Site services at risk as additional equipment added to the internal network.
|
Open GGUS Tickets (Snapshot during morning of meeting)
|
GGUS ID |
Level |
Urgency |
State |
Creation |
Last Update |
VO |
Subject
|
101164
|
Green
|
Less Urgent
|
In Progress
|
2014-02-12
|
2014-02-12
|
Atlas
|
Fair amount of "file not found" srm-atlas.gridpp.rl.ac.uk
|
101079
|
Green
|
Urgent
|
In Progress
|
2014-02-09
|
2014-02-10
|
|
ARC CEs have VOViews with a default SE of "0"
|
101068
|
Green
|
Less Urgent
|
In Progress
|
2014-02-07
|
2014-02-10
|
CMS
|
[sr #141938] fts problem
|
101052
|
Green
|
Urgent
|
In Progress
|
2014-02-06
|
2014-02-11
|
Biomed
|
Can't retrieve job result file from cream-ce02.gridpp.rl.ac.uk
|
101015
|
Green
|
Less Urgent
|
In Progress
|
2014-02-05
|
2014-02-06
|
CMS
|
[sr #141890] Failed PhEDEx transfers between T3_US_Minnesota and T1_UK_RAL_Buffer
|
100887
|
Green
|
Less Urgent
|
In Progress
|
2014-01-31
|
2014-02-07
|
|
Please update gridsite on WebDAV LFC
|
100343
|
Red
|
Less Urgent
|
In Progress
|
2014-01-16
|
2014-02-12
|
|
RAL WMS still generating 512 proxies
|
100114
|
Red
|
Waiting Reply
|
On Hold
|
2014-01-08
|
2014-02-11
|
|
Jobs failing to get from RAL WMS to Imperial
|
99556
|
Red
|
Very Urgent
|
In Progress
|
2013-12-06
|
2014-01-30
|
|
NGI Argus requests for NGI_UK
|
98249
|
Red
|
Urgent
|
On Hold
|
2013-10-21
|
2014-01-29
|
SNO+
|
please configure cvmfs stratum-0 for SNO+ at RAL T1
|
97025
|
Red
|
Less urgent
|
On Hold
|
2013-09-03
|
2014-02-05
|
|
Myproxy server certificate does not contain hostname
|
Day |
OPS |
Alice |
Atlas |
CMS |
LHCb |
Comment
|
05/02/14 |
100 |
100 |
96.8 |
96.1 |
91.6 |
Various SRM test failures.
|
06/02/14 |
100 |
100 |
100 |
96.1 |
100 |
Single SRM test failure (Error reading token data header)
|
07/02/14 |
100 |
100 |
100 |
100 |
95.9 |
Single SRM test failure (User timeout)
|
08/02/14 |
100 |
100 |
100 |
100 |
95.8 |
Single SRM test failure (SRM_FILE_BUSY)
|
09/02/14 |
100 |
100 |
100 |
100 |
95.8 |
Single SRM test failure (SRM_FILE_BUSY)
|
10/02/14 |
100 |
100 |
99.5 |
88.4 |
95.7 |
Various SRM test failures.
|
11/02/14 |
100 |
100 |
100 |
100 |
91.7 |
2 SRM test failures (both with SRM_FILE_BUSY)
|