Difference between revisions of "Tier1 Operations Report 2017-05-17"
From GridPP Wiki
(→) |
(→) |
||
(2 intermediate revisions by one user not shown) | |||
Line 105: | Line 105: | ||
** Merge AtlasScratchDisk into larger Atlas disk pool. | ** Merge AtlasScratchDisk into larger Atlas disk pool. | ||
* Networking | * Networking | ||
+ | ** Increase OPN link to CERN from 2*10Gbit to 3*10Gbit links. | ||
** Enable first services on production network with IPv6 now that the addressing scheme has been agreed. (Perfsonar already working over IPv6). | ** Enable first services on production network with IPv6 now that the addressing scheme has been agreed. (Perfsonar already working over IPv6). | ||
* Services | * Services | ||
** Put argus systems behind a load balancer to improve resilience. | ** Put argus systems behind a load balancer to improve resilience. | ||
+ | ** The production FTS needs updating. This will no longer support the soap interface. (The "test" FTS , used by Atlas, has already been upgraded.) | ||
<!-- ***************End Advanced warning for other interventions*************** -----> | <!-- ***************End Advanced warning for other interventions*************** -----> | ||
<!-- ************************************************************************** -----> | <!-- ************************************************************************** -----> | ||
Line 298: | Line 300: | ||
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Notes from Meeting. | | style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Notes from Meeting. | ||
|} | |} | ||
− | * | + | * Catalin reported on the ALICE T1/T2 workshop. There is a request for some space in ECHO to be allocated to ALICE so that they can test access. |
+ | * Brian reported that a problem has been reported putting files into Castor from the Durham Dirac site. |
Latest revision as of 13:08, 17 May 2017
RAL Tier1 Operations Report for 17th May 2017
Review of Issues during the week 10th to 17th May 2017. |
- Following the failure of the UPS in building R89 on Friday 28th April a replacement UPS was installed at the end of last week. This was brought into use yesterday (16th May) and then a ups/generator load test successfully carried out this morning.
- There was a significant problem with the CMS Castor instance over the weekend that severely affected availabilities. Space was only available on a small number of disk servers and these became heavily overloaded.
- Atlas and CMS were affected for a couple of hours yesterday when Castor was reporting disk pools as full. An update had unexpectedly caused processes on disk servers to restart and this had a knock effect.
- There have been some problems with the ECHO CEPH xrootd gateways. A xrootd proxy cache has been installed on these gateways and this has resolved this issue. However, the root cause is still being investigated.
Resolved Disk Server Issues |
- GDSS744 (AtlasDataDisk - D1T0) Crashed on Monday morning (15th May). Two disk drives were replace and it was returned to service (initially read-only) at the end of Tuesday afternoon (16th).
Current operational status and issues |
- We are still seeing a rate of failures of the CMS SAM tests against the SRM. These are affecting our (CMS) availabilities. CMS are also looking at file access performance and have turned off "lazy-download". This will be re-addresses once we have upgraded to Castor 2.1.16.
- LHCb Castor performance. I have left this item in place although here has been a Castor update for LHCb and testing has been carried out this week.
Ongoing Disk Server Issues |
- None
Limits on concurrent batch system jobs. |
- CMS Multicore 550
Notable Changes made since the last meeting. |
- New UPS installed in R89 and tested.
- LHCb Castor instance updated to Castor version 2.1.16-13.
- Edinburgh Dirac site now moving 'production' files to Castor.
- Support for the following VOs removed from batch (as no longer supported by GridPP): "hone" "fusion" "superbvo.org"
Declared in the GOC DB |
None
Advanced warning for other interventions |
The following items are being discussed and are still to be formally scheduled and announced. |
Pending - but not yet formally announced:
- Update Castor (including SRMs) to version 2.1.16. Central nameserver done. Current plan: LHCb stager on Thursday 11th May. Others to follow.
- Update Castor SRMs - CMS & GEN still to do. Problems seen with the SRM update mean these will wait until Castor 2.1.16 is rolled out.
Listing by category:
- Castor:
- Update SRMs to new version, including updating to SL6.
- Update Castor to version 2.1.16 (ongoing)
- Merge AtlasScratchDisk into larger Atlas disk pool.
- Networking
- Increase OPN link to CERN from 2*10Gbit to 3*10Gbit links.
- Enable first services on production network with IPv6 now that the addressing scheme has been agreed. (Perfsonar already working over IPv6).
- Services
- Put argus systems behind a load balancer to improve resilience.
- The production FTS needs updating. This will no longer support the soap interface. (The "test" FTS , used by Atlas, has already been upgraded.)
Entries in GOC DB starting since the last report. |
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
Whole site | UNSCHEDULED | WARNING | 16/05/2017 14:00 | 16/05/2017 17:00 | 3 hours | Emergency warning while bringing UPS supply back online. |
srm-lhcb.gridpp.rl.ac.uk | SCHEDULED | OUTAGE | 11/05/2017 10:00 | 11/05/2017 12:50 | 2 hours and 50 minutes | Downtime while upgrading LHCb Castor instance to 2.1.16 |
Open GGUS Tickets (Snapshot during morning of meeting) |
GGUS ID | Level | Urgency | State | Creation | Last Update | VO | Subject |
---|---|---|---|---|---|---|---|
128350 | Green | Less Urgent | In Progress | 2017-05-16 | 2017-05-16 | Atlas | RAL-LCG2_DATADISK: transfer errors |
128308 | Green | Urgent | In Progress | 2017-05-14 | 2017-05-15 | CMS | Description: T1_UK_RAL in error for about 6 hours |
128180 | Green | Urgent | In Progress | 2017-05-05 | 2017-05-08 | WLGC-IPv6 Tier-1 readiness | |
127968 | Green | Less Urgent | In Progress | 2017-04-27 | 2017-04-27 | MICE | RAL castor: not able to list directories and copy to |
127967 | Green | Less Urgent | On Hold | 2017-04-27 | 2017-04-28 | MICE | Enabling pilot role for mice VO at RAL-LCG2 |
127612 | Red | Alarm | In Progress | 2017-04-08 | 2017-05-09 | LHCb | CEs at RAL not responding |
127598 | Green | Urgent | In Progress | 2017-04-07 | 2017-05-12 | CMS | UK XRootD Redirector |
127597 | Yellow | Urgent | Waiting for Reply | 2017-04-07 | 2017-05-16 | CMS | Check networking and xrootd RAL-CERN performance |
127240 | Amber | Urgent | Waiting for Reply | 2017-03-21 | 2017-05-15 | CMS | Staging Test at UK_RAL for Run2 |
124876 | Red | Less Urgent | On Hold | 2016-11-07 | 2017-01-01 | OPS | [Rod Dashboard] Issue detected : hr.srce.GridFTP-Transfer-ops@gridftp.echo.stfc.ac.uk |
117683 | Red | Less Urgent | On Hold | 2015-11-18 | 2017-05-10 | CASTOR at RAL not publishing GLUE 2. |
Availability Report |
Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 845); Atlas HC ECHO = Atlas ECHO (Template 841);CMS HC = CMS HammerCloud
Day | OPS | Alice | Atlas | CMS | LHCb | Atlas ECHO | Atlas HC | Atlas HC ECHO | CMS HC | Comment |
---|---|---|---|---|---|---|---|---|---|---|
10/05/17 | 100 | 100 | 85 | 100 | 100 | 100 | 100 | 100 | 99 | CMS: Mainly CE tests failed owing to xroot redirection failing. Atlas: intermittent SRM test failures. |
11/05/17 | 100 | 100 | 96 | 83 | 87 | 100 | 100 | 100 | 96 | Atlas: Single SRM test failure; CMS: 83% Solid blobk of failures for CE & SRM tests. LHCb:Castor Stager 2.1.16 update.’ |
12/05/17 | 100 | 100 | 94 | 99 | 100 | 100 | 100 | 100 | 100 | Intermittent SRM test failures. |
13/05/17 | 100 | 100 | 94 | 39 | 100 | 100 | 100 | 100 | 100 | Atlas: Intermittent SRM test failures.; CMS: SRM and CE tests failing with problems writing into Castor. |
14/05/17 | 100 | 100 | 90 | 35 | 100 | 100 | 98 | 100 | 100 | Atlas: Intermittent SRM test failures.; CMS: SRM and CE tests failing with problems writing into Castor. |
15/05/17 | 100 | 100 | 90 | 99 | 100 | 100 | 100 | 100 | 100 | Intermittent SRM test failures. |
16/05/17 | 100 | 100 | 90 | 91 | 100 | 100 | 91 | 100 | 94 | Both Atlas and CMS had a block of failures reporting disk pool full. |
Notes from Meeting. |
- Catalin reported on the ALICE T1/T2 workshop. There is a request for some space in ECHO to be allocated to ALICE so that they can test access.
- Brian reported that a problem has been reported putting files into Castor from the Durham Dirac site.