Difference between revisions of "Tier1 Operations Report 2016-02-10"
From GridPP Wiki
(→) |
(→) |
||
(2 intermediate revisions by one user not shown) | |||
Line 11: | Line 11: | ||
* On Wednesday afternoon (27th Jan) one of the Windows hypervisors crashed. This was a local storage node - with no failover to other hypervisors. A number of services were unavailable or degraded during the afternoon and an outage for those affected services in the GOC DB was declared. | * On Wednesday afternoon (27th Jan) one of the Windows hypervisors crashed. This was a local storage node - with no failover to other hypervisors. A number of services were unavailable or degraded during the afternoon and an outage for those affected services in the GOC DB was declared. | ||
* The OPN link to CERN has been very busy during this last fortnight. | * The OPN link to CERN has been very busy during this last fortnight. | ||
− | * We have had a | + | * We have had a lot of failures of the Atlas HammerCloud Tests and running Atlas analysis jobs. Initially this was thought to be due to a missing disk server (GDSS677). However, it subsequently became clear that the cause was the CVMFS inode count on the Worker Nodes exceeding 2**32. This is a problem that we had seen some months ago. |
* During last week we had a problem with a particular Atlas Tape. We were unable to read most of the files off the tape. We have declared 5898 files lost from this tape. This was a T10KC tape and it is being sent off for analysis to see if we can understand what happened. | * During last week we had a problem with a particular Atlas Tape. We were unable to read most of the files off the tape. We have declared 5898 files lost from this tape. This was a T10KC tape and it is being sent off for analysis to see if we can understand what happened. | ||
<!-- ***********End Review of Issues during last week*********** -----> | <!-- ***********End Review of Issues during last week*********** -----> | ||
Line 98: | Line 98: | ||
** Migration of data from T10KC to T10KD tapes (affects Atlas & LHCb data). | ** Migration of data from T10KC to T10KD tapes (affects Atlas & LHCb data). | ||
* Networking: | * Networking: | ||
− | ** Replace the UKLight Router. | + | ** Replace the UKLight Router. Then upgrade the 'bypass' link to the RAL border routers to 2*10Gbit. |
* Fabric | * Fabric | ||
** Firmware updates on remaining EMC disk arrays (Castor, LFC) | ** Firmware updates on remaining EMC disk arrays (Castor, LFC) |
Latest revision as of 09:48, 10 February 2016
RAL Tier1 Operations Report for 10th February 2016
Review of Issues during the week 27th January to 3rd February 2016. |
- On Wednesday afternoon (27th Jan) one of the Windows hypervisors crashed. This was a local storage node - with no failover to other hypervisors. A number of services were unavailable or degraded during the afternoon and an outage for those affected services in the GOC DB was declared.
- The OPN link to CERN has been very busy during this last fortnight.
- We have had a lot of failures of the Atlas HammerCloud Tests and running Atlas analysis jobs. Initially this was thought to be due to a missing disk server (GDSS677). However, it subsequently became clear that the cause was the CVMFS inode count on the Worker Nodes exceeding 2**32. This is a problem that we had seen some months ago.
- During last week we had a problem with a particular Atlas Tape. We were unable to read most of the files off the tape. We have declared 5898 files lost from this tape. This was a T10KC tape and it is being sent off for analysis to see if we can understand what happened.
Resolved Disk Server Issues |
- GDSS620 (GenTape - D0T1) failed on the first January. The server was returned to service on 1st February. This server had failed on repeat occasions before. Two disks were replaced in the server.
- GDSS682 (AtlasDataDisk - D1T0) failed on Sunday evening (31st Jan). It was returned to service at the end of the afternoon the following day although investigations had failed to find a cause. There was one corrupt file found on gdss682 when it returned to production. This file was declared lost to Atlas.
- GDSS667 (AtlasScratchDisk - D1T0) Failed on Monday 18th Jan with a read-only file system. On investigation three disks in the RAID set had problems. Following a lot of work a small number of files were recovered from the server. However, the large majority of the files were declared lost to Atlas. The server re-ran the acceptance tests before being put back into service on Friday 5th Feb. As reported a fortnight ago most of the files on this server were reported lost to Atlas.
- GDSS744 (AtlasDataDisk - D1T0) failed on Sunday evening (7th Feb). A RAID consistency check threw out a bad drive. Following the replacement of that drive the server was returned to service yesterday afternoon (9th Feb). It is initially in read-only mode - and will be put back in read/write once the RAID consistency check has been re-run.
- GDSS706 (AtlasDataDisk - D1T0) was taken out of service for a couple of yours yesterday (9th Feb) for the battery on the RAID card to be replaced.
Current operational status and issues |
- There is a problem seen by LHCb of a low but persistent rate of failure when copying the results of batch jobs to Castor. There is also a further problem that sometimes occurs when these (failed) writes are attempted to storage at other sites. A recent modification has improved, but not completed fixed this.
- The intermittent, low-level, load-related packet loss seen over external connections is still being tracked. Likewise we have been working to understand some remaining low level of packet loss seen within a part of our Tier1 network.
Ongoing Disk Server Issues |
- None
Notable Changes made since the last meeting. |
- ARC-CE02,3 & 4 have been upgraded to version 5.0.5. (ARC-CE01 was done last week).
- Batches of worker nodes are being drained to reset the CVMFS inodes count.
- Condor is being updated (to version 8.4.4) on a couple of batches of worker nodes. (These nodes are being drained and rebooted - which also resolves the CVMFS inodes problem on them).
Declared in the GOC DB |
- None
Advanced warning for other interventions |
The following items are being discussed and are still to be formally scheduled and announced. |
- Update the repack Castor repack instance from version 2.1.14.13 to 2.1.14.15. (Proposed for 10/2/16)
Listing by category:
- Databases:
- Switch LFC/3D to new Database Infrastructure.
- Castor:
- Update SRMs to new version (includes updating to SL6).
- Update to Castor version 2.1.15.
- Migration of data from T10KC to T10KD tapes (affects Atlas & LHCb data).
- Networking:
- Replace the UKLight Router. Then upgrade the 'bypass' link to the RAL border routers to 2*10Gbit.
- Fabric
- Firmware updates on remaining EMC disk arrays (Castor, LFC)
- Grid Services
- A Load Balancer (HAProxy) will be used in front of the FTS service.
Entries in GOC DB starting since the last report. |
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
arc-ce04 | SCHEDULED | WARNING | 01/02/2016 11:00 | 01/02/2016 12:00 | 1 hour | Upgrade ARC-CE04 to version 5.0.5 |
arc-ce02, arc-ce03 | SCHEDULED | WARNING | 28/01/2016 12:00 | 28/01/2016 13:00 | 1 hour | Upgrading ARC-CE02 & ARC-CE03 to version 5.0.5 |
lcgvo07, lcgvo08, lcgwms04 | UNSCHEDULED | OUTAGE | 27/01/2016 13:00 | 27/01/2016 17:30 | 4 hours and 30 minutes | Some services unavailable following problem on hypervisor. |
Open GGUS Tickets (Snapshot during morning of meeting) |
GGUS ID | Level | Urgency | State | Creation | Last Update | VO | Subject |
---|---|---|---|---|---|---|---|
119456 | Green | Urgent | In Progress | 2016-02-10 | 2016-02-10 | CMS | File waiting for staging |
119450 | Green | Less Urgent | In Progress | 2016-02-09 | 2016-02-09 | Atlas | RAL_LCG2: SOURCE SRM_GET_TURL error on the turl request |
119389 | Green | Urgent | In Progress | 2016-02-05 | 2016-02-05 | LHCb | Data transfer problem to RAL BUFFER |
118809 | Green | Urgent | On Hold | 2016-01-05 | 2016-01-13 | Towards a recommendation on how to configure memory limits for batch jobs | |
117683 | Green | Less Urgent | In Progress | 2015-11-18 | 2016-01-05 | CASTOR at RAL not publishing GLUE 2 | |
116864 | Red | Urgent | In Progress | 2015-10-12 | 2016-01-29 | CMS | T1_UK_RAL AAA opening and reading test failing again... |
109358 | Green | Less Urgent | Waiting Reply | 2015-02-05 | 2016-02-09 | SNO+ | RAL WMS unavailable |
Availability Report |
Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 508); CMS HC = CMS HammerCloud
Day | OPS | Alice | Atlas | CMS | LHCb | Atlas HC | CMS HC | Comment |
---|---|---|---|---|---|---|---|---|
27/01/16 | 100 | 100 | 99 | 96 | 96 | 77 | 100 | Atlas & CMS: VM crash took out one of site BDIIs. Tests could not find resources.; LHCb: Single SRM test failure. |
28/01/16 | 100 | 100 | 100 | 100 | 100 | 76 | 100 | |
29/01/16 | 100 | 100 | 100 | 100 | 100 | 72 | N/A | |
30/01/16 | 100 | 100 | 100 | 96 | 100 | 54 | N/A | Single SRM Test failure (“User timeout error”) |
31/01/16 | 100 | 100 | 100 | 96 | 100 | 58 | 100 | Single SRM Test failure (“User timeout error”) |
01/02/16 | 100 | 100 | 100 | 100 | 100 | 62 | 100 | Atlas HC failing with "Trf exit code 40. trans: Athena crash - consult log file" |
02/02/16 | 100 | 100 | 100 | 100 | 100 | 63 | 100 | |
03/02/16 | 100 | 100 | 100 | 100 | 100 | 61 | N/A | |
04/02/16 | 100 | 100 | 100 | 100 | 96 | 48 | 100 | Single SRM Test failure ([SRM_INVALID_PATH] No such file or directory) |
05/02/16 | 100 | 100 | 100 | 100 | 100 | 42 | 100 | |
06/02/16 | 100 | 100 | 100 | 100 | 100 | 88 | N/A | |
07/02/16 | 100 | 100 | 100 | 100 | 100 | 36 | 100 | |
0/02/16 | 100 | 100 | 100 | 100 | 100 | 78 | 100 | |
09/02/16 | 100 | 100 | 100 | 100 | 100 | 83 | 100 |