Difference between revisions of "Tier1 Operations Report 2017-02-22"
From GridPP Wiki
(→) |
(→) |
||
Line 10: | Line 10: | ||
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Review of Issues during the week 15th to 22nd February 2017. | | style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Review of Issues during the week 15th to 22nd February 2017. | ||
|} | |} | ||
− | * | + | * There was a file access problem seen by LHCb last night. This appears to have been a temporary problem (thread starvation within Castor) that went away this morning. |
− | ** | + | * There remains some issues following the Castor 2.1.15 upgrade - |
− | ** We | + | ** An occasional problem with a database resource (number of cursors) becoming exhausted. This has affected more than one of the instances. Investigations into this are ongoing. There is a bugfix to Castor in version 2.1.16 in this area. |
− | ** We | + | ** We are managing memory leaks seen in the transfer manager component. |
− | + | ** We still see some timeout test failures in SAM tests for CMS. | |
<!-- ***********End Review of Issues during last week*********** -----> | <!-- ***********End Review of Issues during last week*********** -----> | ||
<!-- *********************************************************** -----> | <!-- *********************************************************** -----> |
Revision as of 12:50, 22 February 2017
RAL Tier1 Operations Report for 22nd February 2017
Review of Issues during the week 15th to 22nd February 2017. |
- There was a file access problem seen by LHCb last night. This appears to have been a temporary problem (thread starvation within Castor) that went away this morning.
- There remains some issues following the Castor 2.1.15 upgrade -
- An occasional problem with a database resource (number of cursors) becoming exhausted. This has affected more than one of the instances. Investigations into this are ongoing. There is a bugfix to Castor in version 2.1.16 in this area.
- We are managing memory leaks seen in the transfer manager component.
- We still see some timeout test failures in SAM tests for CMS.
Resolved Disk Server Issues |
- None
Current operational status and issues |
- We are still seeing a rate of failures of the CMS SAM tests against the SRM. These are affecting our (CMS) availabilities but the level of failures have been reduced recently.
Ongoing Disk Server Issues |
- GDSS663 (AtlasTape - D0T1) crashed on Saturday (18th Feb). Two faulty disks found and replaced. Expected back in service imminently.
Notable Changes made since the last meeting. |
- The ECHO (CEPH) instance was upgraded yesterday (Tuesday 14th) to kraken.
Declared in the GOC DB |
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
Whole site | SCHEDULED | WARNING | 01/03/2017 07:00 | 01/03/2017 11:00 | 4 hours | Warning on site during network intervention in preparation for IPv6. |
Advanced warning for other interventions |
The following items are being discussed and are still to be formally scheduled and announced. |
Pending - but not yet formally announced:
- Merge AtlasScratchDisk into larger Atlas disk pool.
Listing by category:
- Castor:
- Update SRMs to new version, including updating to SL6. This will be done after the Castor 2.1.15 update.
- Networking:
- Enabling IPv6 onto production network.
- Databases
- Removal of "asmlib" layer on Oracle database nodes.
Entries in GOC DB starting since the last report. |
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
lfc.gridpp.rl.ac.uk | SCHEDULED | WARNING | 22/02/2017 08:45 | 22/02/2017 13:00 | 4 hours and 15 minutes | LFC Oracle backend security updates |
All Castor and ECHO storage and Perfsonar. | SCHEDULED | WARNING | 22/02/2017 07:00 | 22/02/2017 11:00 | 4 hours | Warning on Storage and Perfsonar during network intervention in preparation for IPv6. |
Open GGUS Tickets (Snapshot during morning of meeting) |
GGUS ID | Level | Urgency | State | Creation | Last Update | VO | Subject |
---|---|---|---|---|---|---|---|
1267 | Green | Very Urgent | In Progress | 2017-02-22 | 2017-02-22 | LHCb | File access problem at RAL |
126718 | Green | Urgent | In Progress | 2017-02-21 | 2017-02-21 | Atlas | UK RAL-LCG2-ECHO DATADISK: ~8k deletion error due to "Device or resource busy" |
126532 | Green | Urgent | In Progress | 2017-02-09 | 2017-02-21 | Atlas | RAL tape staging errors |
126184 | Green | Less Urgent | In Progress | 2017-01-26 | 2017-02-07 | Atlas | Request of inputs for new sites monitoring |
124876 | Red | Less Urgent | On Hold | 2016-11-07 | 2017-01-01 | OPS | [Rod Dashboard] Issue detected : hr.srce.GridFTP-Transfer-ops@gridftp.echo.stfc.ac.uk |
117683 | Red | Less Urgent | On Hold | 2015-11-18 | 2017-02-10 | CASTOR at RAL not publishing GLUE 2. Looking at it again now (Feb), progress made on back end. Need to update ticket. |
Availability Report |
Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 845); Atlas HC ECHO = Atlas ECHO (Template 842);CMS HC = CMS HammerCloud
Day | OPS | Alice | Atlas | CMS | LHCb | Atlas HC | Atlas HC ECHO | CMS HC | Comment |
---|---|---|---|---|---|---|---|---|---|
15/02/17 | 100 | 100 | 100 | 96 | 100 | 99 | 100 | 100 | Timeouts on CMS SRM tests. |
16/02/17 | 100 | 100 | 100 | 92 | 100 | 100 | 100 | 100 | Timeouts on CMS SRM tests. |
17/02/17 | 100 | 100 | 100 | 88 | 100 | 100 | 99 | 100 | Timeouts on CMS SRM tests. |
18/02/17 | 100 | 100 | 100 | 97 | 100 | 100 | 96 | 100 | Timeouts on CMS SRM tests. |
19/02/17 | 100 | 100 | 100 | 97 | 100 | 100 | 99 | 100 | Timeouts on CMS SRM tests. |
20/02/17 | 100 | 100 | 100 | 96 | 100 | 98 | 97 | 100 | Timeouts on CMS SRM tests. |
21/02/17 | 100 | 100 | 100 | 98 | 100 | 98 | 93 | 100 | Timeouts on CMS SRM tests. |
Notes from Meeting. |
- None yet