Difference between revisions of "Tier1 Operations Report 2014-03-19"
From GridPP Wiki
(→) |
(→) |
||
(4 intermediate revisions by one user not shown) | |||
Line 34: | Line 34: | ||
* There have been problems with the CMS Castor instance through the last week. These are triggered by high load on CMS_Tape - with all the disk servers that provide the cache for this ervice class running flat out (as far as network connectivity goes). Work is underway to increase the throughput of this disk cache. | * There have been problems with the CMS Castor instance through the last week. These are triggered by high load on CMS_Tape - with all the disk servers that provide the cache for this ervice class running flat out (as far as network connectivity goes). Work is underway to increase the throughput of this disk cache. | ||
* The Castor Team are now able to reproduce the intermittent failures of Castor access via the SRM that has been reported in recent weeks. Understanding of the problem is significantly adcanced and further investigations are ongoing using the Castor Preprod instance. Ideas for a workaround are being developed. | * The Castor Team are now able to reproduce the intermittent failures of Castor access via the SRM that has been reported in recent weeks. Understanding of the problem is significantly adcanced and further investigations are ongoing using the Castor Preprod instance. Ideas for a workaround are being developed. | ||
− | * The problem of full | + | * The problem of full Castor disk space for Atlas has been eased. Working with Atlas the file deletion rate has been somewhat improved. However, there is still a problem that needs to be understood. |
* Around 50 files in tape backed service classes (mainly in GEN) have been found not to have migrated to tape. This is under investigation. The cause for some of these is understood (a bad tape at time of migration). CERN will provide a script to re-send the remaining ones to tape. | * Around 50 files in tape backed service classes (mainly in GEN) have been found not to have migrated to tape. This is under investigation. The cause for some of these is understood (a bad tape at time of migration). CERN will provide a script to re-send the remaining ones to tape. | ||
<!-- ***********End Current operational status and issues*********** -----> | <!-- ***********End Current operational status and issues*********** -----> | ||
Line 57: | Line 57: | ||
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Notable Changes made this last week. | | style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Notable Changes made this last week. | ||
|} | |} | ||
− | * | + | * The move of the Tier1 to use the new site firewall took place on Monday 17th March between 07:00 and 07:30. FTS (2 & 3) services were drained and stopped during the change. The batch system was also reconfigured such that new batch jobs world not startt during this period. The change was successful. There was a routing problem that affected the LFC in particular and external access from many worker nodes but that was fixed in around an hour. |
− | * One batch of WNs now updated to EMI-3 version of WN. | + | * One batch of WNs now updated to EMI-3 version of WN a week ago. So far so good. |
− | * | + | * The EMI3 Argus server is in use for most of the CEs and one batch of worker nodes. |
+ | * The planned and announced UPS/Generator load test scheduled for this morning (19th March) was cancelled. | ||
<!-- *************End Notable Changes made this last week************** -----> | <!-- *************End Notable Changes made this last week************** -----> | ||
<!-- ****************************************************************** -----> | <!-- ****************************************************************** -----> | ||
Line 85: | Line 86: | ||
|} | |} | ||
<!-- ******* still to be formally scheduled and/or announced ******* -----> | <!-- ******* still to be formally scheduled and/or announced ******* -----> | ||
− | |||
'''Listing by category:''' | '''Listing by category:''' | ||
* Databases: | * Databases: | ||
Line 92: | Line 92: | ||
** Castor 2.1.14 testing is largely complete. We are starting to look at possible dates for rolling this out (probably around April). | ** Castor 2.1.14 testing is largely complete. We are starting to look at possible dates for rolling this out (probably around April). | ||
* Networking: | * Networking: | ||
− | |||
** Update core Tier1 network and change connection to site and OPN including: | ** Update core Tier1 network and change connection to site and OPN including: | ||
*** Install new Routing layer for Tier1 & change the way the Tier1 connects to the RAL network. | *** Install new Routing layer for Tier1 & change the way the Tier1 connects to the RAL network. |
Latest revision as of 11:36, 19 March 2014
RAL Tier1 Operations Report for 19th March 2014
Review of Issues during the week 12th to 19th March 2014. |
- On Wednesday early evening (12th March) there was a failure of the primary link to CERN between 17:00 and 19:00. Traffic flowed over the backup link. However, the failover was not clean and during this time we were failing the VO SUM tests.
- There was a problem with one of the FTS2 agent systems in the early hours of Thursday 13th March. Owing to a configuration error the hypervisor hosting this virtual machine rebooted and this particular system was not configured to re-start. This was resolved by the primary on-call.
Resolved Disk Server Issues |
- None
Current operational status and issues |
- There have been problems with the CMS Castor instance through the last week. These are triggered by high load on CMS_Tape - with all the disk servers that provide the cache for this ervice class running flat out (as far as network connectivity goes). Work is underway to increase the throughput of this disk cache.
- The Castor Team are now able to reproduce the intermittent failures of Castor access via the SRM that has been reported in recent weeks. Understanding of the problem is significantly adcanced and further investigations are ongoing using the Castor Preprod instance. Ideas for a workaround are being developed.
- The problem of full Castor disk space for Atlas has been eased. Working with Atlas the file deletion rate has been somewhat improved. However, there is still a problem that needs to be understood.
- Around 50 files in tape backed service classes (mainly in GEN) have been found not to have migrated to tape. This is under investigation. The cause for some of these is understood (a bad tape at time of migration). CERN will provide a script to re-send the remaining ones to tape.
Ongoing Disk Server Issues |
- None
Notable Changes made this last week. |
- The move of the Tier1 to use the new site firewall took place on Monday 17th March between 07:00 and 07:30. FTS (2 & 3) services were drained and stopped during the change. The batch system was also reconfigured such that new batch jobs world not startt during this period. The change was successful. There was a routing problem that affected the LFC in particular and external access from many worker nodes but that was fixed in around an hour.
- One batch of WNs now updated to EMI-3 version of WN a week ago. So far so good.
- The EMI3 Argus server is in use for most of the CEs and one batch of worker nodes.
- The planned and announced UPS/Generator load test scheduled for this morning (19th March) was cancelled.
Declared in the GOC DB |
- None
Advanced warning for other interventions |
The following items are being discussed and are still to be formally scheduled and announced. |
Listing by category:
- Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
- Castor:
- Castor 2.1.14 testing is largely complete. We are starting to look at possible dates for rolling this out (probably around April).
- Networking:
- Update core Tier1 network and change connection to site and OPN including:
- Install new Routing layer for Tier1 & change the way the Tier1 connects to the RAL network.
- These changes will lead to the removal of the UKLight Router.
- Update core Tier1 network and change connection to site and OPN including:
- Fabric
- We are phasing out the use of the software server used by the small VOs.
- Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
- There will be circuit testing of the remaining (i.e. non-UPS) circuits in the machine room during 2014.
Entries in GOC DB starting between the 12th and 19th March 2014. |
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
Whole Site | SCHEDULED | WARNING | 17/03/2014 07:00 | 17/03/2014 17:00 | 10 hours | Site At Risk during and following change to use new firewall. |
lcgfts.gridpp.rl.ac.uk, lcgfts3.gridpp.rl.ac.uk | SCHEDULED | OUTAGE | 17/03/2014 06:00 | 17/03/2014 09:00 | 3 hours | Drain and stop of FTS services during update to new site firewall. |
srm-cms.gridpp.rl.ac.uk | UNSCHEDULED | OUTAGE | 14/03/2014 09:40 | 14/03/2014 10:26 | 46 minutes | Problem with CMS Castor instance being investigated. |
srm-cms.gridpp.rl.ac.uk | UNSCHEDULED | OUTAGE | 14/03/2014 04:15 | 14/03/2014 07:15 | 3 hours | Currently investigtating problems with Oracle DB behind Castor CMS |
Open GGUS Tickets (Snapshot during morning of meeting) |
GGUS ID | Level | Urgency | State | Creation | Last Update | VO | Subject |
---|---|---|---|---|---|---|---|
101968 | Green | Less Urgent | On Hold | 2014-03-11 | 2014-03-12 | Atlas | RAL-LCG2_SCRATCHDISK: One dataset to delete is causing 1379 deletion errors |
101079 | Red | Urgent | In Progress | 2014-02-09 | 2014-03-17 | ARC CEs have VOViews with a default SE of "0" | |
101052 | Red | Urgent | In Progress | 2014-02-06 | 2014-03-17 | Biomed | Can't retrieve job result file from cream-ce02.gridpp.rl.ac.uk |
99556 | Red | Very Urgent | In Progress | 2013-12-06 | 2014-03-06 | NGI Argus requests for NGI_UK | |
98249 | Red | Urgent | Waiting Reply | 2013-10-21 | 2014-03-13 | SNO+ | please configure cvmfs stratum-0 for SNO+ at RAL T1 |
97025 | Red | Less urgent | On Hold | 2013-09-03 | 2014-03-04 | Myproxy server certificate does not contain hostname |
Availability Report |
Day | OPS | Alice | Atlas | CMS | LHCb | Comment |
---|---|---|---|---|---|---|
12/03/14 | 100 | 100 | 91.4 | 93.7 | 90.4 | There was a failure of the Primary OPN link to CERN. Traffic flipped to backup link but the failover was not complete. |
13/03/14 | 100 | 100 | 100 | 91.7 | 100 | 2 SRM test failures (both "User Timeout") See next entry for cause. |
14/03/14 | 100 | 100 | 100 | 72.1 | 100 | SRM test failures. The appear as "User Timeout" problem as yesterday - bad request in the database. |
15/03/14 | 100 | 100 | 100 | 100 | 100 | |
16/03/14 | 100 | 100 | 100 | 100 | 100 | |
17/03/14 | 100 | 100 | 99.1 | 87.7 | 100 | Atlas: Single SRM Test ("User Timeout"); CMS: Continuation of what are believed to be load triggered problems in CMS_Tape. |
18/03/14 | 100 | 100 | 97.9 | 64.2 | 100 | Atlas: Single SRM Test failure ("could not open connection to srm-atlas"); CMS: Continuation of above problems. |