Difference between revisions of "Tier1 Operations Report 2017-05-24"
From GridPP Wiki
(→) |
(→) |
||
Line 82: | Line 82: | ||
| style="background-color: #d8e8ff; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Declared in the GOC DB | | style="background-color: #d8e8ff; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Declared in the GOC DB | ||
|} | |} | ||
− | + | {| border=1 align=center | |
+ | |- bgcolor="#7c8aaf" | ||
+ | ! Service | ||
+ | ! Scheduled? | ||
+ | ! Outage/At Risk | ||
+ | ! Start | ||
+ | ! End | ||
+ | ! Duration | ||
+ | ! Reason | ||
+ | |- | ||
+ | | srm-cms-disk.gridpp.rl.ac.uk, srm-cms.gridpp.rl.ac.uk | ||
+ | | SCHEDULED | ||
+ | | OUTAGE | ||
+ | | 25/05/2017 10:00 | ||
+ | | 25/05/2017 16:00 | ||
+ | | 6 hours | ||
+ | |Upgrade of CMS Castor instance to version 2.1.16. | ||
+ | |} | ||
<!-- **********************End GOC DB Entries************************** -----> | <!-- **********************End GOC DB Entries************************** -----> | ||
<!-- ****************************************************************** -----> | <!-- ****************************************************************** -----> |
Revision as of 07:06, 24 May 2017
RAL Tier1 Operations Report for 24th May 2017
Review of Issues during the week 17th to 24th May 2017. |
- Following the failure of the UPS in building R89 on Friday 28th April a replacement UPS was installed at the end of last week. This was brought into use yesterday (16th May) and then a ups/generator load test successfully carried out this morning.
- There was a significant problem with the CMS Castor instance over the weekend that severely affected availabilities. Space was only available on a small number of disk servers and these became heavily overloaded.
- Atlas and CMS were affected for a couple of hours yesterday when Castor was reporting disk pools as full. An update had unexpectedly caused processes on disk servers to restart and this had a knock effect.
- There have been some problems with the ECHO CEPH xrootd gateways. A xrootd proxy cache has been installed on these gateways and this has resolved this issue. However, the root cause is still being investigated.
Resolved Disk Server Issues |
- GDSS744 (AtlasDataDisk - D1T0) Crashed on Monday morning (15th May). Two disk drives were replace and it was returned to service (initially read-only) at the end of Tuesday afternoon (16th).
Current operational status and issues |
- We are still seeing a rate of failures of the CMS SAM tests against the SRM. These are affecting our (CMS) availabilities. CMS are also looking at file access performance and have turned off "lazy-download". This will be re-addresses once we have upgraded to Castor 2.1.16.
- LHCb Castor performance. I have left this item in place although here has been a Castor update for LHCb and testing has been carried out this week.
Ongoing Disk Server Issues |
- None
Limits on concurrent batch system jobs. |
- CMS Multicore 550
Notable Changes made since the last meeting. |
- New UPS installed in R89 and tested.
- LHCb Castor instance updated to Castor version 2.1.16-13.
- Edinburgh Dirac site now moving 'production' files to Castor.
- Support for the following VOs removed from batch (as no longer supported by GridPP): "hone" "fusion" "superbvo.org"
Declared in the GOC DB |
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
srm-cms-disk.gridpp.rl.ac.uk, srm-cms.gridpp.rl.ac.uk | SCHEDULED | OUTAGE | 25/05/2017 10:00 | 25/05/2017 16:00 | 6 hours | Upgrade of CMS Castor instance to version 2.1.16. |
Advanced warning for other interventions |
The following items are being discussed and are still to be formally scheduled and announced. |
Pending - but not yet formally announced:
- Update Castor (including SRMs) to version 2.1.16. Central nameserver done. Current plan: LHCb stager on Thursday 11th May. Others to follow.
- Update Castor SRMs - CMS & GEN still to do. Problems seen with the SRM update mean these will wait until Castor 2.1.16 is rolled out.
Listing by category:
- Castor:
- Update SRMs to new version, including updating to SL6.
- Update Castor to version 2.1.16 (ongoing)
- Merge AtlasScratchDisk into larger Atlas disk pool.
- Networking
- Increase OPN link to CERN from 2*10Gbit to 3*10Gbit links.
- Enable first services on production network with IPv6 now that the addressing scheme has been agreed. (Perfsonar already working over IPv6).
- Services
- Put argus systems behind a load balancer to improve resilience.
- The production FTS needs updating. This will no longer support the soap interface. (The "test" FTS , used by Atlas, has already been upgraded.)
Entries in GOC DB starting since the last report. |
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
srm-atlas.gridpp.rl.ac.uk | SCHEDULED | OUTAGE | 23/05/2017 10:00 | 23/05/2017 11:34 | 1 hour and 34 minutes | Upgrade of Atlas Castor instance to version 2.1.16. |
Open GGUS Tickets (Snapshot during morning of meeting) |
GGUS ID | Level | Urgency | State | Creation | Last Update | VO | Subject |
---|---|---|---|---|---|---|---|
128350 | Green | Less Urgent | In Progress | 2017-05-16 | 2017-05-16 | Atlas | RAL-LCG2_DATADISK: transfer errors |
128308 | Green | Urgent | In Progress | 2017-05-14 | 2017-05-15 | CMS | Description: T1_UK_RAL in error for about 6 hours |
128180 | Green | Urgent | In Progress | 2017-05-05 | 2017-05-08 | WLGC-IPv6 Tier-1 readiness | |
127968 | Green | Less Urgent | In Progress | 2017-04-27 | 2017-04-27 | MICE | RAL castor: not able to list directories and copy to |
127967 | Green | Less Urgent | On Hold | 2017-04-27 | 2017-04-28 | MICE | Enabling pilot role for mice VO at RAL-LCG2 |
127612 | Red | Alarm | In Progress | 2017-04-08 | 2017-05-09 | LHCb | CEs at RAL not responding |
127598 | Green | Urgent | In Progress | 2017-04-07 | 2017-05-12 | CMS | UK XRootD Redirector |
127597 | Yellow | Urgent | Waiting for Reply | 2017-04-07 | 2017-05-16 | CMS | Check networking and xrootd RAL-CERN performance |
127240 | Amber | Urgent | Waiting for Reply | 2017-03-21 | 2017-05-15 | CMS | Staging Test at UK_RAL for Run2 |
124876 | Red | Less Urgent | On Hold | 2016-11-07 | 2017-01-01 | OPS | [Rod Dashboard] Issue detected : hr.srce.GridFTP-Transfer-ops@gridftp.echo.stfc.ac.uk |
117683 | Red | Less Urgent | On Hold | 2015-11-18 | 2017-05-10 | CASTOR at RAL not publishing GLUE 2. |
Availability Report |
Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 845); Atlas HC ECHO = Atlas ECHO (Template 841);CMS HC = CMS HammerCloud
Day | OPS | Alice | Atlas | CMS | LHCb | Atlas ECHO | Atlas HC | Atlas HC ECHO | CMS HC | Comment |
---|---|---|---|---|---|---|---|---|---|---|
17/05/17 | 100 | 100 | 100 | 91 | 100 | 100 | 96 | 98 | 100 | Intermittent SRM test failures. (User timeout) |
18/05/17 | 100 | 100 | 98 | 79 | 100 | 100 | 94 | 100 | 100 | Atlas: One SRM test failure; CMS: Intermittent SRM test failures. (timeout) |
19/05/17 | 100 | 100 | 100 | 78 | 100 | 100 | 100 | 100 | 100 | Intermittent SRM test failures. (User) |
20/05/17 | 100 | 100 | 100 | 83 | 100 | 100 | 95 | 100 | 100 | Intermittent SRM test failures. (User) |
21/05/17 | 100 | 100 | 100 | 80 | 100 | 100 | 100 | 199 | 100 | Intermittent SRM test failures. (User) |
22/05/17 | 100 | 100 | 100 | 83 | 100 | 100 | 100 | 100 | 100 | Intermittent SRM test failures. (User) |
23/05/17 | 100 | 100 | 92 | 96 | 100 | 100 | 100 | 100 | 100 | Atlas Castor 2.1.16 update; CMS: Intermittent SRM test failures. (timeout) |
Notes from Meeting. |
- None yet