Difference between revisions of "Tier1 Operations Report 2016-02-10"
From GridPP Wiki
(→) |
(→) |
||
Line 78: | Line 78: | ||
|- | |- | ||
| style="background-color: #d8e8ff; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Advanced warning for other interventions | | style="background-color: #d8e8ff; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Advanced warning for other interventions | ||
+ | * Update the repack Castor repack instance from version 2.1.14.13 to 2.1.14.15. | ||
|- | |- | ||
| style="background-color: #d8e8ff; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;"| The following items are being discussed and are still to be formally scheduled and announced. | | style="background-color: #d8e8ff; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;"| The following items are being discussed and are still to be formally scheduled and announced. |
Revision as of 09:59, 3 February 2016
RAL Tier1 Operations Report for 3rd February 2016
Review of Issues during the week 27th January to 3rd February 2016. |
- On Wednesday afternoon (27th Jan) one of the Windows hypervisors crashed. This was a local storage node - with no failover to other hypervisors. A number of services were unavailable or degraded during the afternoon and an outage for those affected services in the GOC DB was declared.
- The OPN link to CERN has been very busy during this last week.
Resolved Disk Server Issues |
- GDSS620 (GenTape - D0T1) failed on the first January. The server was returned to service on 1st February. This server had failed on repeat occasions before. Two disks were replaced in the server.
- GDSS682 (AtlasDataDisk - D1T0) failed on Sunday evening (31st Jan). It was returned to service at the end of the afternoon the following day although investigations had failed to find a cause. There was one corrupt file found on gdss682 when it returned to production. This file was declared lost to Atlas.
Current operational status and issues |
- There is a problem seen by LHCb of a low but persistent rate of failure when copying the results of batch jobs to Castor. There is also a further problem that sometimes occurs when these (failed) writes are attempted to storage at other sites. A recent modification has improved, but not completed fixed this.
- The intermittent, low-level, load-related packet loss seen over external connections is still being tracked. Likewise we have been working to understand some remaining low level of packet loss seen within a part of our Tier1 network.
Ongoing Disk Server Issues |
- GDSS667 (AtlasScratchDisk - D1T0) Failed on Monday 18th Jan with a read-only file system. On investigation three disks in the RAID set had problems. Following a lot of work a small number of files were recovered from the server. However, the large majority of the files were declared lost to Atlas. The server is re-running the acceptance tests before being put back into service.
Notable Changes made since the last meeting. |
- ARC-CE02,3 & 4 have been upgraded to version 5.0.5. (ARC-CE01 was done last week).
Declared in the GOC DB |
- None
Advanced warning for other interventions
|
The following items are being discussed and are still to be formally scheduled and announced. |
Listing by category:
- Databases:
- Switch LFC/3D to new Database Infrastructure.
- Castor:
- Update SRMs to new version (includes updating to SL6).
- Update to Castor version 2.1.15.
- Migration of data from T10KC to T10KD tapes (affects Atlas & LHCb data).
- Networking:
- Replace the UKLight Router.
- Fabric
- Firmware updates on remaining EMC disk arrays (Castor, LFC)
Entries in GOC DB starting since the last report. |
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
arc-ce04 | SCHEDULED | WARNING | 01/02/2016 11:00 | 01/02/2016 12:00 | 1 hour | Upgrade ARC-CE04 to version 5.0.5 |
arc-ce02, arc-ce03 | SCHEDULED | WARNING | 28/01/2016 12:00 | 28/01/2016 13:00 | 1 hour | Upgrading ARC-CE02 & ARC-CE03 to version 5.0.5 |
lcgvo07, lcgvo08, lcgwms04 | UNSCHEDULED | OUTAGE | 27/01/2016 13:00 | 27/01/2016 17:30 | 4 hours and 30 minutes | Some services unavailable following problem on hypervisor. |
Open GGUS Tickets (Snapshot during morning of meeting) |
GGUS ID | Level | Urgency | State | Creation | Last Update | VO | Subject |
---|---|---|---|---|---|---|---|
119218 | Green | Less Urgent | On Hold | 2016-01-29 | 2016-01-29 | OPS | [Rod Dashboard] Issue detected : org.bdii.GLUE2-Validate@site-bdii.gridpp.rl.ac.uk |
118809 | Green | Urgent | On Hold | 2016-01-05 | 2016-01-13 | Towards a recommendation on how to configure memory limits for batch jobs | |
117683 | Green | Less Urgent | In Progress | 2015-11-18 | 2016-01-05 | CASTOR at RAL not publishing GLUE 2 | |
116864 | Red | Urgent | In Progress | 2015-10-12 | 2016-01-29 | CMS | T1_UK_RAL AAA opening and reading test failing again... |
Availability Report |
Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 508); CMS HC = CMS HammerCloud
Day | OPS | Alice | Atlas | CMS | LHCb | Atlas HC | CMS HC | Comment |
---|---|---|---|---|---|---|---|---|
27/01/16 | 100 | 100 | 99 | 96 | 96 | 77 | 100 | Atlas & CMS: VM crash took out one of site BDIIs. Tests could not find resources.; LHCb: Single SRM test failure. |
28/01/16 | 100 | 100 | 100 | 100 | 100 | 76 | 100 | |
29/01/16 | 100 | 100 | 100 | 100 | 100 | 72 | N/A | |
30/01/16 | 100 | 100 | 100 | 96 | 100 | 54 | N/A | Single SRM Test failure (“User timeout error”) |
31/01/16 | 100 | 100 | 100 | 96 | 100 | 58 | 100 | Single SRM Test failure (“User timeout error”) |
01/02/16 | 100 | 100 | 100 | 100 | 100 | 62 | 100 | Atlas HC failing with "Trf exit code 40. trans: Athena crash - consult log file" |
02/02/16 | 100 | 100 | 100 | 100 | 100 | 63 | 100 |