Difference between revisions of "Tier1 Operations Report 2016-10-26"
From GridPP Wiki
(→) |
(→) |
||
(3 intermediate revisions by one user not shown) | |||
Line 11: | Line 11: | ||
|} | |} | ||
* The main issue this week has been the security announcement CVE-2016-5195. In response a number of services were stopped. In essence we stopped the batch system on Monday (24th Oct). Storage (Castor) was able to continue running. At the time of the meeting we are testing the patch. | * The main issue this week has been the security announcement CVE-2016-5195. In response a number of services were stopped. In essence we stopped the batch system on Monday (24th Oct). Storage (Castor) was able to continue running. At the time of the meeting we are testing the patch. | ||
− | * There was a recurrence of the problem with the squids on Wednesday evening (19th Oct). This had a knock-on effect on CVMFS clients on the batch worker nodes and for | + | * There was a recurrence of the problem with the squids on Wednesday evening (19th Oct). This had a knock-on effect on CVMFS clients on the batch worker nodes and for some hours reduced the number of worker nodes available to start jobs. |
<!-- ***********End Review of Issues during last week*********** -----> | <!-- ***********End Review of Issues during last week*********** -----> | ||
<!-- *********************************************************** -----> | <!-- *********************************************************** -----> | ||
Line 57: | Line 57: | ||
|} | |} | ||
* Ongoing security updates. | * Ongoing security updates. | ||
− | * While some services (mainly batch-related) were down | + | * While some services (mainly batch-related) were down some more VMs were migrated to the Windows 2012 Hyper-V infrastructure. |
<!-- *************End Notable Changes made this last week************** -----> | <!-- *************End Notable Changes made this last week************** -----> | ||
<!-- ****************************************************************** -----> | <!-- ****************************************************************** -----> | ||
Line 119: | Line 119: | ||
** Update to Castor version 2.1.15. Planning to roll out January 2017. (Proposed dates: 10th Jan: Nameserver; 17th Jan: First stager (LHCb); 24th Jan: Stager (Atlas); 26th Jan: Stager (GEN); 31st Jan: Final stager (CMS)). | ** Update to Castor version 2.1.15. Planning to roll out January 2017. (Proposed dates: 10th Jan: Nameserver; 17th Jan: First stager (LHCb); 24th Jan: Stager (Atlas); 26th Jan: Stager (GEN); 31st Jan: Final stager (CMS)). | ||
** Update SRMs to new version, including updating to SL6. This will be done after the Castor 2.1.15 update. | ** Update SRMs to new version, including updating to SL6. This will be done after the Castor 2.1.15 update. | ||
− | ** Migration of LHCb data from T10KC to T10KD tapes. The additional 'D' tape drives | + | ** Migration of LHCb data from T10KC to T10KD tapes. The additional 'D' tape drives have now been installed. Plan to start migration after next week's intervention on the tape libraries. |
* Fabric | * Fabric | ||
** Firmware updates on older disk servers. | ** Firmware updates on older disk servers. | ||
Line 173: | Line 173: | ||
! GGUS ID !! Level !! Urgency !! State !! Creation !! Last Update !! VO !! Subject | ! GGUS ID !! Level !! Urgency !! State !! Creation !! Last Update !! VO !! Subject | ||
|- | |- | ||
− | | | + | | 124606 |
| Green | | Green | ||
− | | | + | | Urgent |
| In Progress | | In Progress | ||
− | | 2016-10- | + | | 2016-10-24 |
− | | 2016-10- | + | | 2016-10-24 |
− | | | + | | CMS |
− | | | + | | Consistency Check for T1_UK_RAL |
+ | |- | ||
+ | | 124478 | ||
+ | | Green | ||
+ | | Urgent | ||
+ | | In Progress | ||
+ | | 2016-10-17 | ||
+ | | 2016-10-25 | ||
+ | | | ||
+ | | Jobs submitted via RAL WMS stuck in state READY forever and ever and ever | ||
+ | |- | ||
+ | | 124244 | ||
+ | | Green | ||
+ | | Very Urgent | ||
+ | | In Progress | ||
+ | | 2016-10-05 | ||
+ | | 2016-10-05 | ||
+ | | LHCb | ||
+ | | Jobs can not connect to sqlDB at CVMFS at RAL-LCG2 | ||
|- | |- | ||
| 123504 | | 123504 | ||
Line 196: | Line 214: | ||
| Waiting for Reply | | Waiting for Reply | ||
| 2016-07-12 | | 2016-07-12 | ||
− | | 2016- | + | | 2016-10-11 |
| SNO+ | | SNO+ | ||
| Disk area at RAL | | Disk area at RAL | ||
Line 203: | Line 221: | ||
| Red | | Red | ||
| Less Urgent | | Less Urgent | ||
− | | | + | | In Progress |
| 2016-05-20 | | 2016-05-20 | ||
− | | 2016- | + | | 2016-10-10 |
| | | | ||
| packet loss problems seen on RAL-LCG perfsonar | | packet loss problems seen on RAL-LCG perfsonar | ||
Line 223: | Line 241: | ||
| On Hold | | On Hold | ||
| 2015-11-18 | | 2015-11-18 | ||
− | | 2016- | + | | 2016-10-05 |
| | | | ||
| CASTOR at RAL not publishing GLUE 2 (Updated. There are ongoing discussions with GLUE & WLCG) | | CASTOR at RAL not publishing GLUE 2 (Updated. There are ongoing discussions with GLUE & WLCG) |
Latest revision as of 13:14, 26 October 2016
RAL Tier1 Operations Report for 26th October 2016
Review of Issues during the week 19th to 26th October 2016. |
- The main issue this week has been the security announcement CVE-2016-5195. In response a number of services were stopped. In essence we stopped the batch system on Monday (24th Oct). Storage (Castor) was able to continue running. At the time of the meeting we are testing the patch.
- There was a recurrence of the problem with the squids on Wednesday evening (19th Oct). This had a knock-on effect on CVMFS clients on the batch worker nodes and for some hours reduced the number of worker nodes available to start jobs.
Resolved Disk Server Issues |
- GDSS648 (LHCbUser - D1T0) failed on Saturday evening (15th Oct). It is showing both disk and network problems. It was finally returned to service on Tuesday 25th Oct.
- GDSS699 (LHCbDst - D1T0) failed on Saturday (22nd Oct). It was returned to service later that day read-only. On Monday (23rd) it was taken down for further investigation - being put back in service again the following day.
Current operational status and issues |
- There is a problem seen by LHCb of a low but persistent rate of failure when copying the results of batch jobs to Castor. There is also a further problem that sometimes occurs when these (failed) writes are attempted to storage at other sites.
- The intermittent, low-level, load-related packet loss that has been seen over external connections is still being tracked. The replacement of the UKLight router appears to have reduced this - but we are allowing more time to pass before drawing any conclusions.
Ongoing Disk Server Issues |
- GDSS896 (CMSTape - D0T1) was taken out of service yesterday (25th Oct) to investigate memory errors.
Notable Changes made since the last meeting. |
- Ongoing security updates.
- While some services (mainly batch-related) were down some more VMs were migrated to the Windows 2012 Hyper-V infrastructure.
Declared in the GOC DB |
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
All Castor Tape. | SCHEDULED | WARNING | 01/11/2016 07:00 | 01/11/2016 16:00 | 9 hours | Tape Library not available during work on the mechanics. Tape access for read will stop. Writes will be buffered on disk and flushed to tape after the work has completed. |
arc-ce01, arc-ce02, arc-ce03, arc-ce04, lcgvo07, lcgvo08, lcgwms04, lcgwms05 | UNSCHEDULED | OUTAGE | 24/10/2016 15:32 | 28/10/2016 13:00 | 3 days, 21 hours and 28 minutes | EGI-SVG-CVE-2016-5195, vulnerability handling in progress |
arc-ce01, gridftp.echo.stfc.ac.uk, ip6tb-ps01, ip6tb-ps01, lcgps01, lcgps02, s3.echo.stfc.ac.uk, vacuum.gridpp.rl.ac.uk, xrootd.echo.stfc.ac.uk, | UNSCHEDULED | OUTAGE | 24/10/2016 15:00 | 28/10/2016 12:00 | 3 days, 21 hours | EGI-SVG-CVE-2016-5195, vulnerability handling in progress |
Advanced warning for other interventions |
The following items are being discussed and are still to be formally scheduled and announced. |
Listing by category:
- Castor:
- Update to Castor version 2.1.15. Planning to roll out January 2017. (Proposed dates: 10th Jan: Nameserver; 17th Jan: First stager (LHCb); 24th Jan: Stager (Atlas); 26th Jan: Stager (GEN); 31st Jan: Final stager (CMS)).
- Update SRMs to new version, including updating to SL6. This will be done after the Castor 2.1.15 update.
- Migration of LHCb data from T10KC to T10KD tapes. The additional 'D' tape drives have now been installed. Plan to start migration after next week's intervention on the tape libraries.
- Fabric
- Firmware updates on older disk servers.
Entries in GOC DB starting since the last report. |
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
arc-ce01, arc-ce02, arc-ce03, arc-ce04, lcgvo07, lcgvo08, lcgwms04, lcgwms05 | UNSCHEDULED | OUTAGE | 24/10/2016 15:32 | 28/10/2016 13:00 | 3 days, 21 hours and 28 minutes | EGI-SVG-CVE-2016-5195, vulnerability handling in progress |
arc-ce01, gridftp.echo.stfc.ac.uk, ip6tb-ps01, ip6tb-ps01, lcgps01, lcgps02, s3.echo.stfc.ac.uk, vacuum.gridpp.rl.ac.uk, xrootd.echo.stfc.ac.uk, | UNSCHEDULED | OUTAGE | 24/10/2016 15:00 | 28/10/2016 12:00 | 3 days, 21 hours | EGI-SVG-CVE-2016-5195, vulnerability handling in progress |
Open GGUS Tickets (Snapshot during morning of meeting) |
GGUS ID | Level | Urgency | State | Creation | Last Update | VO | Subject |
---|---|---|---|---|---|---|---|
124606 | Green | Urgent | In Progress | 2016-10-24 | 2016-10-24 | CMS | Consistency Check for T1_UK_RAL |
124478 | Green | Urgent | In Progress | 2016-10-17 | 2016-10-25 | Jobs submitted via RAL WMS stuck in state READY forever and ever and ever | |
124244 | Green | Very Urgent | In Progress | 2016-10-05 | 2016-10-05 | LHCb | Jobs can not connect to sqlDB at CVMFS at RAL-LCG2 |
123504 | Yellow | Less Urgent | Waiting for Reply | 2016-08-19 | 2016-09-20 | T2K | proxy expiration |
122827 | Green | Less Urgent | Waiting for Reply | 2016-07-12 | 2016-10-11 | SNO+ | Disk area at RAL |
121687 | Red | Less Urgent | In Progress | 2016-05-20 | 2016-10-10 | packet loss problems seen on RAL-LCG perfsonar | |
120350 | Yellow | Less Urgent | On Hold | 2016-03-22 | 2016-08-09 | LSST | Enable LSST at RAL |
117683 | Amber | Less Urgent | On Hold | 2015-11-18 | 2016-10-05 | CASTOR at RAL not publishing GLUE 2 (Updated. There are ongoing discussions with GLUE & WLCG) |
Availability Report |
Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 729); CMS HC = CMS HammerCloud
Day | OPS | Alice | Atlas | CMS | LHCb | Atlas HC | CMS HC | Comment |
---|---|---|---|---|---|---|---|---|
19/10/16 | 100 | 100 | 100 | 98 | 100 | N/A | N/A | Single SRM test failure because of a user timeout error |
20/10/16 | 100 | 100 | 100 | 98 | 100 | N/A | N/A | Single SRM test failure because of a user timeout error |
21/10/16 | 100 | 100 | 100 | 98 | 100 | N/A | N/A | Single SRM test failure because of a user timeout error |
22/10/16 | 100 | 100 | 100 | 100 | 100 | N/A | 100 | |
23/10/16 | 100 | 100 | 100 | 100 | 100 | N/A | N/A | |
24/10/16 | 0 | 100 | 100 | 63 | 100 | N/A | N/A | Systems (especially batch) down for CVE-2016-5195 |
25/10/16 | 0 | 57 | 56 | 0 | 57 | N/A | N/A | Systems (especially batch) down for CVE-2016-5195 |