Difference between revisions of "Tier1 Operations Report 2017-02-22"
From GridPP Wiki
(→) |
(→) |
||
(6 intermediate revisions by one user not shown) | |||
Line 10: | Line 10: | ||
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Review of Issues during the week 15th to 22nd February 2017. | | style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Review of Issues during the week 15th to 22nd February 2017. | ||
|} | |} | ||
− | * | + | * There was a file access problem seen by LHCb last night. This appears to have been a temporary problem (thread starvation within the SRM) that went away this morning. |
− | ** | + | * There remains some issues following the Castor 2.1.15 upgrade - |
− | ** We | + | ** An occasional problem with a database resource (number of cursors) becoming exhausted. This has affected more than one of the instances. Investigations into this are ongoing. There is a bugfix to Castor in version 2.1.16 in this area. |
− | ** We | + | ** We are managing memory leaks seen in the transfer manager component. |
− | + | ** We still see some timeout test failures in SAM tests for CMS. | |
<!-- ***********End Review of Issues during last week*********** -----> | <!-- ***********End Review of Issues during last week*********** -----> | ||
<!-- *********************************************************** -----> | <!-- *********************************************************** -----> | ||
Line 35: | Line 35: | ||
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Current operational status and issues | | style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Current operational status and issues | ||
|} | |} | ||
− | * We are still seeing a rate of failures of the CMS SAM tests against the SRM. These are affecting our (CMS) availabilities but the level of | + | * We are still seeing a rate of failures of the CMS SAM tests against the SRM. These are affecting our (CMS) availabilities but the level of failures have been reduced recently. |
<!-- ***********End Current operational status and issues*********** -----> | <!-- ***********End Current operational status and issues*********** -----> | ||
<!-- *************************************************************** -----> | <!-- *************************************************************** -----> | ||
Line 49: | Line 49: | ||
<!-- ***************End Ongoing Disk Server Issues**************** -----> | <!-- ***************End Ongoing Disk Server Issues**************** -----> | ||
<!-- ************************************************************* -----> | <!-- ************************************************************* -----> | ||
+ | |||
+ | ====== ====== | ||
+ | <!-- ******************************************************************** -----> | ||
+ | <!-- ******************Start Limits On Batch System Jobs***************** -----> | ||
+ | {| width="100%" cellspacing="0" cellpadding="0" style="background-color: #ffffff; border: 1px solid silver; border-collapse: collapse; width: 100%; margin: 0 0 1em 0;" | ||
+ | |- | ||
+ | | style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Limits on concurrent batch system jobs. | ||
+ | |} | ||
+ | * LHCb Pilot 4500 | ||
+ | * Atlas Pilot (Analysis) 1600 | ||
+ | * CMS Multicore 470 | ||
+ | <!-- ******************End Limits On Batch System Jobs***************** -----> | ||
+ | <!-- ****************************************************************** -----> | ||
====== ====== | ====== ====== | ||
Line 57: | Line 70: | ||
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Notable Changes made since the last meeting. | | style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Notable Changes made since the last meeting. | ||
|} | |} | ||
− | * The | + | * The OPNR (OPN router) has been enabled for IPv6 this morning. A reboot was required to enable IPv6 ACLS. |
+ | * Various systems have had security and other patches applied. In particular back end database systems are being updated to remove a software layer ("asmlib"). | ||
+ | * Two batches of worker nodes are running SL7 with the jobs themselves in SL6 containers. | ||
<!-- *************End Notable Changes made this last week************** -----> | <!-- *************End Notable Changes made this last week************** -----> | ||
<!-- ****************************************************************** -----> | <!-- ****************************************************************** -----> | ||
Line 68: | Line 83: | ||
| style="background-color: #d8e8ff; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Declared in the GOC DB | | style="background-color: #d8e8ff; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Declared in the GOC DB | ||
|} | |} | ||
− | |||
− | |||
{| border=1 align=center | {| border=1 align=center | ||
|- bgcolor="#7c8aaf" | |- bgcolor="#7c8aaf" | ||
Line 161: | Line 174: | ||
! GGUS ID !! Level !! Urgency !! State !! Creation !! Last Update !! VO !! Subject | ! GGUS ID !! Level !! Urgency !! State !! Creation !! Last Update !! VO !! Subject | ||
|- | |- | ||
− | | | + | | 1267 |
+ | | Green | ||
+ | | Very Urgent | ||
+ | | In Progress | ||
+ | | 2017-02-22 | ||
+ | | 2017-02-22 | ||
+ | | LHCb | ||
+ | | File access problem at RAL | ||
+ | |- | ||
+ | | 126718 | ||
| Green | | Green | ||
| Urgent | | Urgent | ||
| In Progress | | In Progress | ||
− | | 2017-02- | + | | 2017-02-21 |
− | | 2017-02- | + | | 2017-02-21 |
| Atlas | | Atlas | ||
− | | UK RAL-LCG2-ECHO | + | | UK RAL-LCG2-ECHO DATADISK: ~8k deletion error due to "Device or resource busy" |
|- | |- | ||
| 126532 | | 126532 | ||
Line 175: | Line 197: | ||
| In Progress | | In Progress | ||
| 2017-02-09 | | 2017-02-09 | ||
− | | 2017-02- | + | | 2017-02-21 |
| Atlas | | Atlas | ||
| RAL tape staging errors | | RAL tape staging errors | ||
Line 202: | Line 224: | ||
| On Hold | | On Hold | ||
| 2015-11-18 | | 2015-11-18 | ||
− | | | + | | 2017-02-10 |
| | | | ||
| CASTOR at RAL not publishing GLUE 2. Looking at it again now (Feb), progress made on back end. Need to update ticket. | | CASTOR at RAL not publishing GLUE 2. Looking at it again now (Feb), progress made on back end. Need to update ticket. | ||
Line 246: | Line 268: | ||
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Notes from Meeting. | | style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Notes from Meeting. | ||
|} | |} | ||
− | * | + | * CMS confirmed their xroot redirection tests are passing OK for RAL Castor. |
+ | * Catalin has carried out a survey of the users of the WMS service. This indicates ongoing interest in this service. | ||
+ | * Dirac VO site Edinburgh has made more test transfers. A problem with access to the Catsor webdav interface for Dirac has been resolved. |
Latest revision as of 14:24, 22 February 2017
RAL Tier1 Operations Report for 22nd February 2017
Review of Issues during the week 15th to 22nd February 2017. |
- There was a file access problem seen by LHCb last night. This appears to have been a temporary problem (thread starvation within the SRM) that went away this morning.
- There remains some issues following the Castor 2.1.15 upgrade -
- An occasional problem with a database resource (number of cursors) becoming exhausted. This has affected more than one of the instances. Investigations into this are ongoing. There is a bugfix to Castor in version 2.1.16 in this area.
- We are managing memory leaks seen in the transfer manager component.
- We still see some timeout test failures in SAM tests for CMS.
Resolved Disk Server Issues |
- None
Current operational status and issues |
- We are still seeing a rate of failures of the CMS SAM tests against the SRM. These are affecting our (CMS) availabilities but the level of failures have been reduced recently.
Ongoing Disk Server Issues |
- GDSS663 (AtlasTape - D0T1) crashed on Saturday (18th Feb). Two faulty disks found and replaced. Expected back in service imminently.
Limits on concurrent batch system jobs. |
- LHCb Pilot 4500
- Atlas Pilot (Analysis) 1600
- CMS Multicore 470
Notable Changes made since the last meeting. |
- The OPNR (OPN router) has been enabled for IPv6 this morning. A reboot was required to enable IPv6 ACLS.
- Various systems have had security and other patches applied. In particular back end database systems are being updated to remove a software layer ("asmlib").
- Two batches of worker nodes are running SL7 with the jobs themselves in SL6 containers.
Declared in the GOC DB |
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
Whole site | SCHEDULED | WARNING | 01/03/2017 07:00 | 01/03/2017 11:00 | 4 hours | Warning on site during network intervention in preparation for IPv6. |
Advanced warning for other interventions |
The following items are being discussed and are still to be formally scheduled and announced. |
Pending - but not yet formally announced:
- Merge AtlasScratchDisk into larger Atlas disk pool.
Listing by category:
- Castor:
- Update SRMs to new version, including updating to SL6. This will be done after the Castor 2.1.15 update.
- Networking:
- Enabling IPv6 onto production network.
- Databases
- Removal of "asmlib" layer on Oracle database nodes.
Entries in GOC DB starting since the last report. |
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
lfc.gridpp.rl.ac.uk | SCHEDULED | WARNING | 22/02/2017 08:45 | 22/02/2017 13:00 | 4 hours and 15 minutes | LFC Oracle backend security updates |
All Castor and ECHO storage and Perfsonar. | SCHEDULED | WARNING | 22/02/2017 07:00 | 22/02/2017 11:00 | 4 hours | Warning on Storage and Perfsonar during network intervention in preparation for IPv6. |
Open GGUS Tickets (Snapshot during morning of meeting) |
GGUS ID | Level | Urgency | State | Creation | Last Update | VO | Subject |
---|---|---|---|---|---|---|---|
1267 | Green | Very Urgent | In Progress | 2017-02-22 | 2017-02-22 | LHCb | File access problem at RAL |
126718 | Green | Urgent | In Progress | 2017-02-21 | 2017-02-21 | Atlas | UK RAL-LCG2-ECHO DATADISK: ~8k deletion error due to "Device or resource busy" |
126532 | Green | Urgent | In Progress | 2017-02-09 | 2017-02-21 | Atlas | RAL tape staging errors |
126184 | Green | Less Urgent | In Progress | 2017-01-26 | 2017-02-07 | Atlas | Request of inputs for new sites monitoring |
124876 | Red | Less Urgent | On Hold | 2016-11-07 | 2017-01-01 | OPS | [Rod Dashboard] Issue detected : hr.srce.GridFTP-Transfer-ops@gridftp.echo.stfc.ac.uk |
117683 | Red | Less Urgent | On Hold | 2015-11-18 | 2017-02-10 | CASTOR at RAL not publishing GLUE 2. Looking at it again now (Feb), progress made on back end. Need to update ticket. |
Availability Report |
Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 845); Atlas HC ECHO = Atlas ECHO (Template 842);CMS HC = CMS HammerCloud
Day | OPS | Alice | Atlas | CMS | LHCb | Atlas HC | Atlas HC ECHO | CMS HC | Comment |
---|---|---|---|---|---|---|---|---|---|
15/02/17 | 100 | 100 | 100 | 96 | 100 | 99 | 100 | 100 | Timeouts on CMS SRM tests. |
16/02/17 | 100 | 100 | 100 | 92 | 100 | 100 | 100 | 100 | Timeouts on CMS SRM tests. |
17/02/17 | 100 | 100 | 100 | 88 | 100 | 100 | 99 | 100 | Timeouts on CMS SRM tests. |
18/02/17 | 100 | 100 | 100 | 97 | 100 | 100 | 96 | 100 | Timeouts on CMS SRM tests. |
19/02/17 | 100 | 100 | 100 | 97 | 100 | 100 | 99 | 100 | Timeouts on CMS SRM tests. |
20/02/17 | 100 | 100 | 100 | 96 | 100 | 98 | 97 | 100 | Timeouts on CMS SRM tests. |
21/02/17 | 100 | 100 | 100 | 98 | 100 | 98 | 93 | 100 | Timeouts on CMS SRM tests. |
Notes from Meeting. |
- CMS confirmed their xroot redirection tests are passing OK for RAL Castor.
- Catalin has carried out a survey of the users of the WMS service. This indicates ongoing interest in this service.
- Dirac VO site Edinburgh has made more test transfers. A problem with access to the Catsor webdav interface for Dirac has been resolved.