Difference between revisions of "Tier1 Operations Report 2017-12-06"
(→) |
(→) |
||
Line 32: | Line 32: | ||
'''Christmas Plans:''' | '''Christmas Plans:''' | ||
• We will follow the same pattern as in previous years. The on-call team will be in place as usual. Some additional checks will be made by those on-call. RaL is closed after Friday afternoon 22nd December and will re-open on Tuesday 2nd January. | • We will follow the same pattern as in previous years. The on-call team will be in place as usual. Some additional checks will be made by those on-call. RaL is closed after Friday afternoon 22nd December and will re-open on Tuesday 2nd January. | ||
− | |||
<!-- ***********End Review of Issues during last week*********** -----> | <!-- ***********End Review of Issues during last week*********** -----> | ||
<!-- *********************************************************** -----> | <!-- *********************************************************** -----> |
Revision as of 14:53, 12 December 2017
RAL Tier1 Operations Report for 6th December 2017
Review of Issues during the week 30th November to 6th December 2017. |
Castor: • On Wednesday (6th Dec) all the SRMs systems (except LHCb – which had already been done) were successfully upgraded to the latest version (2.1.16-18) • Three disk servers (old ones from 2012) have been added to the LHCb Disk-only space in Castor to alleviate problems of this area being too full.
Echo: • The maximum number of gridftp connections to each Echo gateways has been increased to 200 (from 100). • Echo is running normally. Background scrubbing is going on. This is flushing out bad disks – and the rate at which it finds these is expected to drop over the next week or two. The plan is to run like this through the holiday period.
Services: • EGI will withdraw support for the WMS from the start of 2018. Our WMS service will be stopped on this timescale.
Network: • There was a problem of high packet loss for traffic to/from the Tier that passed through the RAL core network (and firewall) on Monday (4th). The problem started at midnight and was fixed around 15:30.
Infrastructure: • Following the failure of the generator to start during the power outage of a couple of weeks ago a faulty emergency power-off switch was found and has been replaced. Planes are being made for a generator load test – hopefully on Wednesday (13th Dec).
Certificates: • Following problems with the updated UK CA certificate in the IGTF 1.88 rollout we had updated and then rolled back. This had left is with some issues in our configuration/deployment system (Quattor/Aquilon) – but those were resolved quickly. We made a plan to roll forward again tomorrow (12th Dec) – and that is still the plan.
Christmas Plans: • We will follow the same pattern as in previous years. The on-call team will be in place as usual. Some additional checks will be made by those on-call. RaL is closed after Friday afternoon 22nd December and will re-open on Tuesday 2nd January.
Current operational status and issues |
Resolved Disk Server Issues |
- None
Ongoing Disk Server Issues |
- None
Limits on concurrent batch system jobs. |
- CMS Multicore 550
Notable Changes made since the last meeting. |
- Allocation in Echo for ATLAS increased to 4.1PB. They now have 4PB in datadisk and 100TB in scratchdisk. This is part of the gradual increase of their usage to 5.1PB.
- The maximum number of gridftp connections to each Echo gateways has been increased to 200 (from 100).
Entries in GOC DB starting since the last report. |
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
srm-alice.gridpp.rl.ac.uk, srm-atlas.gridpp.rl.ac.uk, srm-biomed.gridpp.rl.ac.uk, srm-cert.gridpp.rl.ac.uk, srm-cms-disk.gridpp.rl.ac.uk, srm-cms.gridpp.rl.ac.uk, srm-dteam.gridpp.rl.ac.uk, srm-ilc.gridpp.rl.ac.uk, srm-mice.gridpp.rl.ac.uk, srm-minos.gridpp.rl.ac.uk, srm-na62.gridpp.rl.ac.uk, srm-pheno.gridpp.rl.ac.uk, srm-preprod.gridpp.rl.ac.uk, srm-snoplus.gridpp.rl.ac.uk, srm-solid.gridpp.rl.ac.uk, srm-t2k.gridpp.rl.ac.uk, | SCHEDULED | OUTAGE | 06/12/2017 13:00 | 06/12/2017 15:00 | 2 hours | Upgrade of non-LHCb SRM to version 2.1.16-18 |
lcgfts3.gridpp.rl.ac.uk, | SCHEDULED | WARNING | 05/12/2017 11:00 | 05/12/2017 13:00 | 2 hours | FTS update to v3.7.7 |
Declared in the GOC DB |
- None
Advanced warning for other interventions |
The following items are being discussed and are still to be formally scheduled and announced. |
Ongoing or Pending - but not yet formally announced:
Listing by category:
- Castor:
- Update systems (initially tape servers) to use SL7 and configured by Quattor/Aquilon.
- Move to generic Castor headnodes.
- Echo:
- Update to next CEPH version ("Luminous").
- Networking
- Extend the number of services on the production network with IPv6 dual stack. (Done for Perfsonar, FTS3, all squids and the CVMFS Stratum-1 servers).
- Services
- Internal
- DNS servers will be rolled out within the Tier1 network.
Open GGUS Tickets (Snapshot during morning of meeting) |
GGUS ID | Level | Urgency | State | Creation | Last Update | VO | Subject |
---|---|---|---|---|---|---|---|
132356 | Green | Very Urgent | Waiting for Reply | 2017-12-07 | 2017-12-11 | Ops | [Rod Dashboard] Issue detected : org.nagios.GLUE2-Check@site-bdii.gridpp.rl.ac.uk |
132336 | Green | Less Urgent | In Progress | 2017-12-05 | 2017-12-06 | Ops | [Rod Dashboard] Issue detected : org.nagios.GLUE2-Check@site-bdii.gridpp.rl.ac.uk |
132314 | Green | Less Urgent | In Progress | 2017-12-05 | 2017-12-11 | Ops | [Rod Dashboard] Issue detected : org.nordugrid.ARC-CE-SRM-result-ops@arc-ce02.gridpp.rl.ac.uk |
132222 | Green | Urgent | In Progress | 2017-11-30 | 2017-12-05 | CMS | Transfers failing to T1_UK_RAL_Disk |
131840 | Green | Urgent | Waiting for reply | 2017-11-14 | 2017-12-05 | Other | solidexperiment.org CASTOR tape copy fails |
131815 | Green | Less Urgent | In Progress | 2017-11-13 | 2017-12-01 | T2K.Org | Extremely long download times for T2K files on tape at RAL |
130207 | Red | Urgent | On Hold | 2017-08-24 | 2017-11-13 | MICE | Timeouts when copyiing MICE reco data to CASTOR |
127597 | Red | Urgent | On Hold | 2017-04-07 | 2017-10-05 | CMS | Check networking and xrootd RAL-CERN performance |
124876 | Red | Less Urgent | On Hold | 2016-11-07 | 2017-11-13 | Ops | [Rod Dashboard] Issue detected : hr.srce.GridFTP-Transfer-ops@gridftp.echo.stfc.ac.uk |
117683 | Red | Less Urgent | On Hold | 2015-11-18 | 2017-11-06 | None | CASTOR at RAL not publishing GLUE 2 |
Availability Report |
Day | OPS | Alice | Atlas | CMS | LHCb | Atlas Echo | Comment |
---|---|---|---|---|---|---|---|
6/12/17 | 100 | 100 | 83 | 81 | 100 | 100 | |
7/12/17 | 100 | 100 | 100 | 100 | 100 | 100 | |
8/12/17 | 100 | 100 | 100 | 100 | 100 | 100 | |
9/12/17 | 100 | 100 | 100 | 100 | 100 | 100 | |
10/12/17 | 100 | 100 | 100 | 100 | 100 | 100 | |
11/12/17 | 100 | 100 | 100 | 100 | 100 | 100 | |
12/12/17 | 100 | 100 | 100 | 100 | 100 | 100 |
Hammercloud Test Report |
Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 845); Atlas HC Echo = Atlas Echo (Template 841);CMS HC = CMS HammerCloud
Day | Atlas HC | Atlas HC Echo | CMS HC | Comment |
---|---|---|---|---|
6/12/17 | 99 | 99 | 81 | Atlas HC Echo - No test run in time bin |
7/12/17 | 89 | 100 | 100 | Atlas HC Echo - No test run in time bin |
8/12/17 | 100 | 100 | 100 | Atlas HC Echo - No test run in time bin |
9/12/17 | 100 | 0 | 100 | Atlas HC Echo - No test run in time bin |
10/12/17 | 100 | 0 | 100 | Atlas HC Echo - No test run in time bin |
11/12/17 | 100 | 0 | 100 | Atlas HC Echo - No test run in time bin |
12/12/17 | 99 | 0 | 100 | Atlas HC Echo - No test run in time bin |
Notes from Meeting. |
- EGI will withdraw support for the WMS from the end of 2017. Our WMS service will be stopped on this timescale.
- There is a problem with Perfsonar measurements using IPv6 to nodes accessed via JANET.
- There was a discussion about how best to bring files back online from tape. The MICE VO needs a better (bulk) solution than they are using at the moment.