|
|
Line 237: |
Line 237: |
| |-style="background:#b7f1ce" | | |-style="background:#b7f1ce" |
| ! Day !! OPS !! Alice !! Atlas !! CMS !! LHCb !! Atlas HC !! CMS HC !! Comment | | ! Day !! OPS !! Alice !! Atlas !! CMS !! LHCb !! Atlas HC !! CMS HC !! Comment |
− | |-
| |
− | | 12/03/14 || 100 || 100 || style="background-color: lightgrey;" | 91.4 || style="background-color: lightgrey;" | 93.7 || style="background-color: lightgrey;" | 90.4 || There was a failure of the Primary OPN link to CERN. Traffic flipped to backup link but the failover was not complete.
| |
− | |-
| |
− | | 13/03/14 || 100 || 100 || 100 || style="background-color: lightgrey;" | 91.7 || 100 || 2 SRM test failures (both "User Timeout") See next entry for cause.
| |
− | |-
| |
− | | 14/03/14 || 100 || 100 || 100 || style="background-color: lightgrey;" | 72.1 || 100 || SRM test failures. The appear as "User Timeout" problem as yesterday - bad request in the database.
| |
− | |-
| |
− | | 15/03/14 || 100 || 100 || 100 || 100 || 100 ||
| |
− | |-
| |
− | | 16/03/14 || 100 || 100 || 100 || 100 || 100 ||
| |
− | |-
| |
− | | 17/03/14 || 100 || 100 || style="background-color: lightgrey;" | 99.1 || style="background-color: lightgrey;" | 87.7 || 100 || Atlas: Single SRM Test ("User Timeout"); CMS: Continuation of what are believed to be load triggered problems in CMS_Tape.
| |
− | |-
| |
− | | 18/03/14 || 100 || 100 || style="background-color: lightgrey;" | 97.9 || style="background-color: lightgrey;" | 64.2 || 100 || Atlas: Single SRM Test failure ("could not open connection to srm-atlas"); CMS: Continuation of above problems.
| |
− |
| |
− |
| |
− |
| |
| |- | | |- |
| | 19/03/14 || 100 || 100 || 100 || style="background-color: lightgrey;" | 88.6 || 100 || 99 || 73 || Multiple SRM test failures (load problems). | | | 19/03/14 || 100 || 100 || 100 || style="background-color: lightgrey;" | 88.6 || 100 || 99 || 73 || Multiple SRM test failures (load problems). |
Revision as of 09:11, 2 April 2014
RAL Tier1 Operations Report for 2nd April 2014
Review of Issues during the fortnight 19th March to 2nd April 2014.
|
- On Wednesday early evening (12th March) there was a failure of the primary link to CERN between 17:00 and 19:00. Traffic flowed over the backup link. However, the failover was not clean and during this time we were failing the VO SUM tests.
- There was a problem with one of the FTS2 agent systems in the early hours of Thursday 13th March. Owing to a configuration error the hypervisor hosting this virtual machine rebooted and this particular system was not configured to re-start. This was resolved by the primary on-call.
Resolved Disk Server Issues
|
Current operational status and issues
|
- There have been problems with the CMS Castor instance through the last week. These are triggered by high load on CMS_Tape - with all the disk servers that provide the cache for this ervice class running flat out (as far as network connectivity goes). Work is underway to increase the throughput of this disk cache.
- The Castor Team are now able to reproduce the intermittent failures of Castor access via the SRM that has been reported in recent weeks. Understanding of the problem is significantly adcanced and further investigations are ongoing using the Castor Preprod instance. Ideas for a workaround are being developed.
- The problem of full Castor disk space for Atlas has been eased. Working with Atlas the file deletion rate has been somewhat improved. However, there is still a problem that needs to be understood.
- Around 50 files in tape backed service classes (mainly in GEN) have been found not to have migrated to tape. This is under investigation. The cause for some of these is understood (a bad tape at time of migration). CERN will provide a script to re-send the remaining ones to tape.
Ongoing Disk Server Issues
|
Notable Changes made this last week.
|
- The move of the Tier1 to use the new site firewall took place on Monday 17th March between 07:00 and 07:30. FTS (2 & 3) services were drained and stopped during the change. The batch system was also reconfigured such that new batch jobs world not startt during this period. The change was successful. There was a routing problem that affected the LFC in particular and external access from many worker nodes but that was fixed in around an hour.
- One batch of WNs now updated to EMI-3 version of WN a week ago. So far so good.
- The EMI3 Argus server is in use for most of the CEs and one batch of worker nodes.
- The planned and announced UPS/Generator load test scheduled for this morning (19th March) was cancelled.
Advanced warning for other interventions
|
The following items are being discussed and are still to be formally scheduled and announced.
|
Listing by category:
- Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
- Castor:
- Castor 2.1.14 testing is largely complete. We are starting to look at possible dates for rolling this out (probably around April).
- Networking:
- Update core Tier1 network and change connection to site and OPN including:
- Install new Routing layer for Tier1 & change the way the Tier1 connects to the RAL network.
- These changes will lead to the removal of the UKLight Router.
- Fabric
- We are phasing out the use of the software server used by the small VOs.
- Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
- There will be circuit testing of the remaining (i.e. non-UPS) circuits in the machine room during 2014.
Entries in GOC DB starting between the 19th March and 2nd April 2014.
|
Service
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
Whole Site
|
SCHEDULED
|
WARNING
|
17/03/2014 07:00
|
17/03/2014 17:00
|
10 hours
|
Site At Risk during and following change to use new firewall.
|
lcgfts.gridpp.rl.ac.uk, lcgfts3.gridpp.rl.ac.uk
|
SCHEDULED
|
OUTAGE
|
17/03/2014 06:00
|
17/03/2014 09:00
|
3 hours
|
Drain and stop of FTS services during update to new site firewall.
|
srm-cms.gridpp.rl.ac.uk
|
UNSCHEDULED
|
OUTAGE
|
14/03/2014 09:40
|
14/03/2014 10:26
|
46 minutes
|
Problem with CMS Castor instance being investigated.
|
srm-cms.gridpp.rl.ac.uk
|
UNSCHEDULED
|
OUTAGE
|
14/03/2014 04:15
|
14/03/2014 07:15
|
3 hours
|
Currently investigtating problems with Oracle DB behind Castor CMS
|
Open GGUS Tickets (Snapshot during morning of meeting)
|
GGUS ID |
Level |
Urgency |
State |
Creation |
Last Update |
VO |
Subject
|
101968
|
Green
|
Less Urgent
|
On Hold
|
2014-03-11
|
2014-03-12
|
Atlas
|
RAL-LCG2_SCRATCHDISK: One dataset to delete is causing 1379 deletion errors
|
101079
|
Red
|
Urgent
|
In Progress
|
2014-02-09
|
2014-03-17
|
|
ARC CEs have VOViews with a default SE of "0"
|
101052
|
Red
|
Urgent
|
In Progress
|
2014-02-06
|
2014-03-17
|
Biomed
|
Can't retrieve job result file from cream-ce02.gridpp.rl.ac.uk
|
99556
|
Red
|
Very Urgent
|
In Progress
|
2013-12-06
|
2014-03-06
|
|
NGI Argus requests for NGI_UK
|
98249
|
Red
|
Urgent
|
Waiting Reply
|
2013-10-21
|
2014-03-13
|
SNO+
|
please configure cvmfs stratum-0 for SNO+ at RAL T1
|
97025
|
Red
|
Less urgent
|
On Hold
|
2013-09-03
|
2014-03-04
|
|
Myproxy server certificate does not contain hostname
|
Atlas HC = Atlas HammerCloud (Queueu ANALY_RAL_SL6, Template 508); CMS HC = CMS HammerCloud
Day |
OPS |
Alice |
Atlas |
CMS |
LHCb |
Atlas HC |
CMS HC |
Comment
|
19/03/14 |
100 |
100 |
100 |
88.6 |
100 |
99 |
73 |
Multiple SRM test failures (load problems).
|
20/03/14 |
100 |
100 |
99.7 |
99.6 |
100 |
100 |
n/a |
Atlas: One SRM Test failure; CMS - CE Test failures on all 3 Arc-ce’s (no compatible resources).
|
21/03/14 |
100 |
100 |
100 |
100 |
100 |
100 |
n/a |
|
22/03/14 |
100 |
100 |
100 |
100 |
100 |
100 |
n/a |
|
23/03/14 |
100 |
100 |
100 |
100 |
100 |
100 |
n/a |
|
24/03/14 |
100 |
100 |
100 |
100 |
100 |
100 |
n/a |
|
25/03/14 |
100 |
100 |
99.0 |
89.8 |
100 |
98 |
99 |
Atlas: Castor database problem (Atlas_srm DB moved to another RAC node following a DB crash); CMS SRM SUM test failures separated through day.
|
26/03/14 |
100 |
100 |
100 |
87.1 |
100 |
100 |
99 |
Four separate SRM test failures.
|
27/03/14 |
100 |
100 |
100 |
96.5 |
100 |
97 |
100 |
Two test failures of SRM Put test.
|
28/03/14 |
100 |
100 |
100 |
100 |
100 |
100 |
100 |
|
29/03/14 |
100 |
100 |
100 |
100 |
100 |
99 |
100 |
|
30/03/14 |
100 |
100 |
100 |
100 |
100 |
100 |
99 |
|
31/03/14 |
100 |
100 |
100 |
100 |
100 |
100 |
99 |
|
01/04/14 |
100 |
100 |
100 |
100 |
100 |
100 |
99 |
|