Difference between revisions of "Tier1 Operations Report 2017-11-29"
From GridPP Wiki
(→) |
(→) |
||
(4 intermediate revisions by one user not shown) | |||
Line 21: | Line 21: | ||
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Current operational status and issues | | style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Current operational status and issues | ||
|} | |} | ||
− | * | + | * Certificate deployment issues with UKeScience 2B ICA 1.88-1 and SL6. Possible SHA-1/SHA-2 incompatibility. |
<!-- ***********End Current operational status and issues*********** -----> | <!-- ***********End Current operational status and issues*********** -----> | ||
<!-- *************************************************************** -----> | <!-- *************************************************************** -----> | ||
Line 32: | Line 32: | ||
| style="background-color: #f8d6a9; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Resolved Disk Server Issues | | style="background-color: #f8d6a9; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Resolved Disk Server Issues | ||
|} | |} | ||
− | |||
− | |||
<!-- ***************************************************** -----> | <!-- ***************************************************** -----> | ||
Line 43: | Line 41: | ||
| style="background-color: #f8d6a9; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Ongoing Disk Server Issues | | style="background-color: #f8d6a9; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Ongoing Disk Server Issues | ||
|} | |} | ||
− | * | + | * GDSS818 RAID Partially Degraded faulty drive port 1 |
<!-- ***************End Ongoing Disk Server Issues**************** -----> | <!-- ***************End Ongoing Disk Server Issues**************** -----> | ||
<!-- ************************************************************* -----> | <!-- ************************************************************* -----> |
Latest revision as of 15:36, 12 December 2017
RAL Tier1 Operations Report for 29th November 2017
Review of Issues during the week 23rd to 29th November 2017. |
- IPv6 issues have now been resolved – [Tier1] Unit 2 is master for IPv6 but there is no physical connections to that router from the switch core. Consequently the fail-over did not complete successfully. Once understood this was resolved 23/11/17.
Current operational status and issues |
- Certificate deployment issues with UKeScience 2B ICA 1.88-1 and SL6. Possible SHA-1/SHA-2 incompatibility.
Resolved Disk Server Issues |
Ongoing Disk Server Issues |
- GDSS818 RAID Partially Degraded faulty drive port 1
Limits on concurrent batch system jobs. |
- CMS Multicore 550
Notable Changes made since the last meeting. |
- None.
Entries in GOC DB starting since the last report. |
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
lcgfts3.gridpp.rl.ac.uk, | SCHEDULED | OUTAGE | 21/11/2017 07:00 | 21/11/2017 15:00 | 8 hours | MySQL backend consolidation |
srm-alice.gridpp.rl.ac.uk, srm-atlas.gridpp.rl.ac.uk, srm-biomed.gridpp.rl.ac.uk, srm-cert.gridpp.rl.ac.uk, srm-cms-disk.gridpp.rl.ac.uk, srm-cms.gridpp.rl.ac.uk, srm-dteam.gridpp.rl.ac.uk, srm-ilc.gridpp.rl.ac.uk, srm-lhcb.gridpp.rl.ac.uk, srm-lhcb.gridpp.rl.ac.uk, srm-mice.gridpp.rl.ac.uk, srm-minos.gridpp.rl.ac.uk, srm-na62.gridpp.rl.ac.uk, srm-pheno.gridpp.rl.ac.uk, srm-preprod.gridpp.rl.ac.uk, srm-snoplus.gridpp.rl.ac.uk, srm-solid.gridpp.rl.ac.uk, srm-t2k.gridpp.rl.ac.uk, | UNSCHEDULED | OUTAGE | 20/11/2017 15:00 | 20/11/2017 18:01 | 3 hours and 1 minutes | RAL Tier1 Power outage update : All systems now back with the exception of CASTOR. |
arc-ce01.gridpp.rl.ac.uk, arc-ce01.gridpp.rl.ac.uk, arc-ce01.gridpp.rl.ac.uk, arc-ce02.gridpp.rl.ac.uk, arc-ce02.gridpp.rl.ac.uk, arc-ce02.gridpp.rl.ac.uk, arc-ce03.gridpp.rl.ac.uk, arc-ce03.gridpp.rl.ac.uk, arc-ce03.gridpp.rl.ac.uk, arc-ce04.gridpp.rl.ac.uk, arc-ce04.gridpp.rl.ac.uk, arc-ce04.gridpp.rl.ac.uk, argusngi.gridpp.rl.ac.uk, atlas-squid.gridpp.rl.ac.uk, cms-squid.gridpp.rl.ac.uk, gridftp.echo.stfc.ac.uk, ip6tb-ps01.gridpp.rl.ac.uk, ip6tb-ps01.gridpp.rl.ac.uk, lcgargus01.gridpp.rl.ac.uk, lcgbdii.gridpp.rl.ac.uk, lcgft-atlas.gridpp.rl.ac.uk, lcgfts3.gridpp.rl.ac.uk, lcglb01.gridpp.rl.ac.uk, lcglb02.gridpp.rl.ac.uk, lcgps01.gridpp.rl.ac.uk, lcgps02.gridpp.rl.ac.uk, lcgvo07.gridpp.rl.ac.uk, lcgvo08.gridpp.rl.ac.uk, lcgwms04.gridpp.rl.ac.uk, lcgwms05.gridpp.rl.ac.uk, lfc.gridpp.rl.ac.uk, lfc.gridpp.rl.ac.uk, myproxy.gridpp.rl.ac.uk, openstack.stfc.ac.uk, s3.echo.stfc.ac.uk, s3.echo.stfc.ac.uk, site-bdii.gridpp.rl.ac.uk, srm-alice.gridpp.rl.ac.uk, srm-atlas.gridpp.rl.ac.uk, srm-biomed.gridpp.rl.ac.uk, srm-cert.gridpp.rl.ac.uk, srm-cms-disk.gridpp.rl.ac.uk, srm-cms.gridpp.rl.ac.uk, srm-dteam.gridpp.rl.ac.uk, srm-ilc.gridpp.rl.ac.uk, srm-lhcb.gridpp.rl.ac.uk, srm-lhcb.gridpp.rl.ac.uk, srm-mice.gridpp.rl.ac.uk, srm-minos.gridpp.rl.ac.uk, srm-na62.gridpp.rl.ac.uk, srm-pheno.gridpp.rl.ac.uk, srm-preprod.gridpp.rl.ac.uk, srm-snoplus.gridpp.rl.ac.uk, srm-solid.gridpp.rl.ac.uk, srm-t2k.gridpp.rl.ac.uk, vacuum.gridpp.rl.ac.uk, xrootd-cms-uk.gridpp.rl.ac.uk, xrootd.echo.stfc.ac.uk, | UNSCHEDULED | OUTAGE | 20/11/2017 12:05 | 20/11/2017 15:00 | 2 hours and 55 minutes | Site power cut Power restored, investigations ongoing. |
srm-lhcb.gridpp.rl.ac.uk, srm-lhcb.gridpp.rl.ac.uk, | SCHEDULED | OUTAGE | 15/11/2017 10:00 | 15/11/2017 10:40 | 40 minutes | LHCb CASTOR SRM Update to 2.1.16-18 |
Declared in the GOC DB |
- None
Advanced warning for other interventions |
The following items are being discussed and are still to be formally scheduled and announced. |
Ongoing or Pending - but not yet formally announced:
Listing by category:
- Castor:
- Update systems (initially tape servers) to use SL7 and configured by Quattor/Aquilon.
- Move to generic Castor headnodes.
- Echo:
- Update to next CEPH version ("Luminous").
- Networking
- Extend the number of services on the production network with IPv6 dual stack. (Done for Perfsonar, FTS3, all squids and the CVMFS Stratum-1 servers).
- Services
- Internal
- DNS servers will be rolled out within the Tier1 network.
Open GGUS Tickets (Snapshot during morning of meeting) |
GGUS ID | Level | Urgency | State | Creation | Last Update | VO | Subject |
---|---|---|---|---|---|---|---|
132094 | Green | To Priority | In Progress | 2017-11-28 | 2017-11-28 | LHCB | Connection timeout srm-lhcb |
131840 | Green | Urgent | Waiting for reply | 2017-11-14 | 2017-11-15 | Other | solidexperiment.org CASTOR tape copy fails |
131815 | Green | Less Urgent | In Progress | 2017-11-13 | 2017-11-20 | T2K.Org | Extremely long download times for T2K files on tape at RAL |
130207 | Red | Urgent | On Hold | 2017-08-24 | 2017-11-13 | MICE | Timeouts when copyiing MICE reco data to CASTOR |
127597 | Red | Urgent | On Hold | 2017-04-07 | 2017-10-05 | CMS | Check networking and xrootd RAL-CERN performance |
124876 | Red | Less Urgent | On Hold | 2016-11-07 | 2017-11-13 | Ops | [Rod Dashboard] Issue detected : hr.srce.GridFTP-Transfer-ops@gridftp.echo.stfc.ac.uk |
117683 | Red | Less Urgent | On Hold | 2015-11-18 | 2017-11-06 | None | CASTOR at RAL not publishing GLUE 2 |
Availability Report |
Day | OPS | Alice | Atlas | CMS | LHCb | Atlas Echo | Comment |
---|---|---|---|---|---|---|---|
23/11/17 | 100 | 100 | 100 | 99 | 100 | 100 | |
24/11/17 | 100 | 100 | 100 | 100 | 100 | 100 | |
25/11/17 | 100 | 100 | 100 | 100 | 100 | 100 | |
26/11/17 | 100 | 100 | 100 | 100 | 100 | 100 | |
27/11/17 | 100 | 100 | 100 | 100 | 100 | 100 | |
28/11/17 | 100 | 100 | 98 | 100 | 100 | 100 | |
29/11/17 | 100 | 100 | 199 | 100 | 100 | 100 |
Hammercloud Test Report |
Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 845); Atlas HC Echo = Atlas Echo (Template 841);CMS HC = CMS HammerCloud
Day | Atlas HC | Atlas HC Echo | CMS HC | Comment |
---|---|---|---|---|
22/11/17 | 69 | 100 | 100 | |
23/11/17 | 99 | 100 | 99 | |
24/11/17 | 93 | 100 | 100 | |
25/11/17 | 100 | 100 | 100 | |
26/11/17 | 100 | 100 | 100 | |
27/11/17 | 100 | 99 | 100 | |
28/11/17 | 100 | 100 | 100 |
Notes from Meeting. |
- None yet