Difference between revisions of "Tier1 Operations Report 2017-03-22"
From GridPP Wiki
(→) |
(→) |
||
Line 150: | Line 150: | ||
|-style="background:#b7f1ce" | |-style="background:#b7f1ce" | ||
! GGUS ID !! Level !! Urgency !! State !! Creation !! Last Update !! VO !! Subject | ! GGUS ID !! Level !! Urgency !! State !! Creation !! Last Update !! VO !! Subject | ||
+ | |- | ||
+ | | 127240 | ||
+ | | Green | ||
+ | | Urgent | ||
+ | | In Progress | ||
+ | | 2017-03-21 | ||
+ | | 2017-03-21 | ||
+ | | CMS | ||
+ | | Staging Test at UK_RAL for Run2 | ||
+ | |- | ||
+ | | 127185 | ||
+ | | Green | ||
+ | | Urgent | ||
+ | | In Progress | ||
+ | | 2017-03-17 | ||
+ | | 2017-03-17 | ||
+ | | | ||
+ | | WLGC-IPv6 readiness | ||
|- | |- | ||
| 126905 | | 126905 | ||
| Green | | Green | ||
| Less Urgent | | Less Urgent | ||
− | | | + | | Waiting Reply |
− | + | ||
| 2017-03-02 | | 2017-03-02 | ||
+ | | 2017-03-21 | ||
| solid | | solid | ||
| finish commissioning cvmfs server for solidexperiment.org | | finish commissioning cvmfs server for solidexperiment.org |
Revision as of 13:05, 22 March 2017
RAL Tier1 Operations Report for 22nd March 2017
Review of Issues during the week 15th to 22nd March 2017. |
- There was a problem with the Atlas Castor instance on the evening of Wednesday 15th Mar. The oncall was contacted and Castor services restarted to fix it. The cause was a known bug that causes exhaustion of a particular database resource.
- A crash of one of the five hypervisors in the Microsoft Hyper-V high availability cluster caused a number of VMs to reboot overnight Thursday-Friday (16-17 Mar).
- We have an ongoing problem with the SRM SAM tests for Atlas which are failing a lot of the time. We have confirmed this is not affecting Atlas operationally it is just the tests that fails. We still have a GGUS ticket open with Atlas as the test appears to be problematic.
- On Friday (17th March) there was a problem with the Argus srever when it had a full disk partition. This affected CMS glexec tests.
Resolved Disk Server Issues |
- None
Current operational status and issues |
- We are still seeing a rate of failures of the CMS SAM tests against the SRM. These are affecting our (CMS) availabilities but the level of failures is reduced as compared to a few weeks ago.
Ongoing Disk Server Issues |
- None
Limits on concurrent batch system jobs. |
- Atlas Pilot (Analysis) 1500
- CMS Multicore 460
Notable Changes made since the last meeting. |
- Last week nine of the ’14 generation disk servers – each 100TB - were deployed into AtlasDataDisk. (These are from the batch that was used as CEPH test servers).
- Work ongoing on replacing two of the chillers.
Declared in the GOC DB |
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
srm-lhcb.gridpp.rl.ac.uk | SCHEDULED | OUTAGE | 23/03/2017 11:00 | 23/03/2017 17:00 | 6 hours | Upgrade of Castor SRMs for LHCb to version 2.1.16-10 |
Advanced warning for other interventions |
The following items are being discussed and are still to be formally scheduled and announced. |
Pending - but not yet formally announced:
- Update Castor SRMs starting with LHCb. (Announced for 23rd March.
- Chiller replacement - work ongoing.
- Merge AtlasScratchDisk into larger Atlas disk pool.
Listing by category:
- Castor:
- Update SRMs to new version, including updating to SL6.
- Bring some newer disk servers ('14 generation) into service, replacing some older ('12 generation) servers.
- Databases
- Removal of "asmlib" layer on Oracle database nodes. (Ongoing)
- Networking
- Enable first services on production network with IPv6 once addressing scheme agreed.
- Infrastructure:
- Two of the chillers supplying the air-conditioning for the R89 machine room will be replaced.
Entries in GOC DB starting since the last report. |
None
Open GGUS Tickets (Snapshot during morning of meeting) |
GGUS ID | Level | Urgency | State | Creation | Last Update | VO | Subject |
---|---|---|---|---|---|---|---|
127240 | Green | Urgent | In Progress | 2017-03-21 | 2017-03-21 | CMS | Staging Test at UK_RAL for Run2 |
127185 | Green | Urgent | In Progress | 2017-03-17 | 2017-03-17 | WLGC-IPv6 readiness | |
126905 | Green | Less Urgent | Waiting Reply | 2017-03-02 | 2017-03-21 | solid | finish commissioning cvmfs server for solidexperiment.org |
126184 | Yellow | Less Urgent | In Progress | 2017-01-26 | 2017-02-07 | Atlas | Request of inputs for new sites monitoring |
124876 | Red | Less Urgent | On Hold | 2016-11-07 | 2017-01-01 | OPS | [Rod Dashboard] Issue detected : hr.srce.GridFTP-Transfer-ops@gridftp.echo.stfc.ac.uk |
117683 | Red | Less Urgent | On Hold | 2015-11-18 | 2017-03-02 | CASTOR at RAL not publishing GLUE 2. |
Availability Report |
Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 845); Atlas HC ECHO = Atlas ECHO (Template 842);CMS HC = CMS HammerCloud
Day | OPS | Alice | Atlas | CMS | LHCb | Atlas HC | Atlas HC ECHO | CMS HC | Comment |
---|---|---|---|---|---|---|---|---|---|
15/03/17 | 100 | 100 | 100 | 99 | 100 | 99 | 98 | 99 | Single SRM test failure (timeout) |
16/03/17 | 100 | 99 | 100 | 99 | 100 | 79 | 72 | 99 | ALICE: Test failed with 'no compatible resources found in BDII'. CMS: Single SRM test failure |
17/03/17 | 100 | 100 | 100 | 85 | 100 | 100 | 100 | 97 | Problem with glexec tests from CMS. |
18/03/17 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | |
19/03/17 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | |
20/03/17 | 100 | 100 | 100 | 98 | 100 | 100 | 100 | 100 | SRM test failures on GET (User timeout). |
21/03/17 | 100 | 100 | 100 | 98 | 100 | 100 | 100 | 99 | SRM test failures on GET (User timeout). |
Notes from Meeting. |
- None yet