Difference between revisions of "Tier1 Operations Report 2014-07-16"
From GridPP Wiki
(Created page with "=RAL Tier1 Operations Report for 16th July 2014= __NOTOC__ ====== ====== <!-- ************************************************************* -----> <!-- ***********Start Revie...") |
(→) |
||
(14 intermediate revisions by one user not shown) | |||
Line 9: | Line 9: | ||
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Review of Issues during the week 9th to 16th July 2014. | | style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Review of Issues during the week 9th to 16th July 2014. | ||
|} | |} | ||
− | * There | + | * There have been recurring problems with the SRM processes for the castor GEN instance crashing since Friday (11th). This appears to be linked to a particular user and is under investigation. |
− | * | + | * There was a problem with file transfers for Atlas but only affecting their functional test transfers. This was solved by Atlas. |
+ | * There were several cases where Atlas SRM servers were automatically restarted between Thursday and Saturday (10-12 July). | ||
<!-- ***********End Review of Issues during last week*********** -----> | <!-- ***********End Review of Issues during last week*********** -----> | ||
<!-- *********************************************************** -----> | <!-- *********************************************************** -----> | ||
Line 21: | Line 22: | ||
| style="background-color: #f8d6a9; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Resolved Disk Server Issues | | style="background-color: #f8d6a9; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Resolved Disk Server Issues | ||
|} | |} | ||
− | * | + | * On Wednesday 9th July GDSS546 (CMSTape - D0T1) crashed. It was returned to service on Friday (11th). The RAID array was reporting a problem - but no failed drives were found. One file was lost at the time the server crashed. |
+ | * On Thursday 10th July GDSS527 (CMSTape - D0T1) was taken out of srevice for a couple of hours to investigate why it did not see a replacement drive. | ||
+ | * On Sunday 13th July GDSS720 (AtlasDataDisk - D1T0) crashed. It was returned to service the next day (Monday 14th) although no fault was found. Fifteen files were lost at from the time the server crashed. | ||
<!-- ***********End Resolved Disk Server Issues*********** -----> | <!-- ***********End Resolved Disk Server Issues*********** -----> | ||
<!-- ***************************************************** -----> | <!-- ***************************************************** -----> | ||
Line 55: | Line 58: | ||
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Notable Changes made this last week. | | style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Notable Changes made this last week. | ||
|} | |} | ||
− | * | + | * New CERN VOMS servers added. |
+ | * Castor re-pack instance being updated to version 2.1.14-13 (ongoing now). | ||
<!-- *************End Notable Changes made this last week************** -----> | <!-- *************End Notable Changes made this last week************** -----> | ||
<!-- ****************************************************************** -----> | <!-- ****************************************************************** -----> | ||
Line 66: | Line 70: | ||
| style="background-color: #d8e8ff; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Declared in the GOC DB | | style="background-color: #d8e8ff; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Declared in the GOC DB | ||
|} | |} | ||
− | + | {| border=1 align=center | |
+ | |- bgcolor="#7c8aaf" | ||
+ | ! Service | ||
+ | ! Scheduled? | ||
+ | ! Outage/At Risk | ||
+ | ! Start | ||
+ | ! End | ||
+ | ! Duration | ||
+ | ! Reason | ||
+ | |- | ||
+ | |perfsonar-ps01.gridpp.rl.ac.uk, perfsonar-ps02.gridpp.rl.ac.uk, | ||
+ | | SCHEDULED | ||
+ | | OUTAGE | ||
+ | | 14/07/2014 11:00 | ||
+ | | 14/08/2014 11:00 | ||
+ | | 31 days, | ||
+ | | Systems being decommissioned. They have been replaced by lcgps01.gridpp.rl.ac.uk and lcgps02.gridpp.rl.ac.uk | ||
+ | |} | ||
<!-- **********************End GOC DB Entries************************** -----> | <!-- **********************End GOC DB Entries************************** -----> | ||
<!-- ****************************************************************** -----> | <!-- ****************************************************************** -----> | ||
Line 81: | Line 102: | ||
<!-- ******* still to be formally scheduled and/or announced ******* -----> | <!-- ******* still to be formally scheduled and/or announced ******* -----> | ||
* We are planning the termination of the FTS2 service (announced for 2nd September) now that almost all use is on FTS3. | * We are planning the termination of the FTS2 service (announced for 2nd September) now that almost all use is on FTS3. | ||
+ | * The FTS3 service will be updated (to v3.2.26) on Monday morning, 21st July. | ||
+ | * The core RAL site network will be updated to use RIP for network routing on Tuesday 22nd July. | ||
'''Listing by category:''' | '''Listing by category:''' | ||
* Databases: | * Databases: | ||
Line 113: | Line 136: | ||
! Reason | ! Reason | ||
|- | |- | ||
− | | | + | |perfsonar-ps01.gridpp.rl.ac.uk, perfsonar-ps02.gridpp.rl.ac.uk, |
| SCHEDULED | | SCHEDULED | ||
| OUTAGE | | OUTAGE | ||
− | | | + | | 14/07/2014 11:00 |
− | | | + | | 14/08/2014 11:00 |
− | | | + | | 31 days, |
− | | | + | | Systems being decommissioned. They have been replaced by lcgps01.gridpp.rl.ac.uk and lcgps02.gridpp.rl.ac.uk |
|- | |- | ||
− | | Castor GEN | + | | Castor GEN SRMs. (srm-alice, srm-biomed, srm-dteam, srm-hone, srm-ilc, srm-mice, srm-minos, srm-na62, srm-snoplus, srm-superb, srm-t2k) |
| UNSCHEDULED | | UNSCHEDULED | ||
| WARNING | | WARNING | ||
− | | | + | | 11/07/2014 17:10 |
− | | | + | | 14/07/2014 12:00 |
− | | | + | | 2 days, 18 hours and 50 minutes |
− | | | + | | There was a problem with the Castor GEN instance SRMs (Castor OK, but not the SRMs). Now improved. Setting a WARNING state over weekend. |
|- | |- | ||
− | | | + | | Castor GEN SRMs. (srm-alice, srm-biomed, srm-dteam, srm-hone, srm-ilc, srm-mice, srm-minos, srm-na62, srm-snoplus, srm-superb, srm-t2k) |
+ | | UNSCHEDULED | ||
+ | | OUTAGE | ||
+ | | 11/07/2014 13:00 | ||
+ | | 11/07/2014 17:00 | ||
+ | | 4 hours | ||
+ | | We are invesitgating a problem with the Castor GEN instance SRMs. (Castor OK, but not the SRMs). | ||
+ | |- | ||
+ | | srm-atlas.gridpp.rl.ac.uk, | ||
| SCHEDULED | | SCHEDULED | ||
− | | | + | | OUTAGE |
− | | | + | | 08/07/2014 06:00 |
− | | | + | | 09/07/2014 10:39 |
− | | 1 | + | | 1 day, 4 hours and 39 minutes |
− | | | + | | Atlas Castor instance down for Castor 2.1.14 Stager Update |
|} | |} | ||
<!-- **********************End GOC DB Entries************************** -----> | <!-- **********************End GOC DB Entries************************** -----> | ||
Line 152: | Line 183: | ||
! GGUS ID !! Level !! Urgency !! State !! Creation !! Last Update !! VO !! Subject | ! GGUS ID !! Level !! Urgency !! State !! Creation !! Last Update !! VO !! Subject | ||
|- | |- | ||
− | | | + | | 106927 |
+ | | Green | ||
+ | | Top Priority | ||
+ | | Waiting Reply | ||
+ | | 2014-07-16 | ||
+ | | 2014-07-16 | ||
+ | | | ||
+ | | GGUS Test alarm ticket | ||
+ | |- | ||
+ | | 106802 | ||
| Green | | Green | ||
| Less Urgent | | Less Urgent | ||
| In Progress | | In Progress | ||
− | | 2014-07- | + | | 2014-07-11 |
− | | 2014-07- | + | | 2014-07-11 |
− | | | + | | CMS |
− | | | + | | Several nodes contacting ILC/CMS VOs |
|- | |- | ||
− | | | + | | 106770 |
| Green | | Green | ||
| Less Urgent | | Less Urgent | ||
| In Progress | | In Progress | ||
− | | 2014-07- | + | | 2014-07-11 |
− | | 2014-07- | + | | 2014-07-14 |
− | | | + | | enmr.eu |
− | | | + | | Unable to add software tags |
|- | |- | ||
| 106655 | | 106655 | ||
Line 175: | Line 215: | ||
| In Progress | | In Progress | ||
| 2014-07-04 | | 2014-07-04 | ||
− | | 2014-07- | + | | 2014-07-10 |
| Ops | | Ops | ||
| [Rod Dashboard] Issues detected at RAL-LCG2 (srm-dteam) | | [Rod Dashboard] Issues detected at RAL-LCG2 (srm-dteam) | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
|- | |- | ||
| 106610 | | 106610 | ||
Line 198: | Line 229: | ||
|- | |- | ||
| 106324 | | 106324 | ||
− | | | + | | Red |
| Urgent | | Urgent | ||
| In Progress | | In Progress | ||
| 2014-06-18 | | 2014-06-18 | ||
− | | 2014-07- | + | | 2014-07-07 |
| CMS | | CMS | ||
| pilots losing network connections at T1_UK_RAL | | pilots losing network connections at T1_UK_RAL |
Latest revision as of 13:41, 16 July 2014
RAL Tier1 Operations Report for 16th July 2014
Review of Issues during the week 9th to 16th July 2014. |
- There have been recurring problems with the SRM processes for the castor GEN instance crashing since Friday (11th). This appears to be linked to a particular user and is under investigation.
- There was a problem with file transfers for Atlas but only affecting their functional test transfers. This was solved by Atlas.
- There were several cases where Atlas SRM servers were automatically restarted between Thursday and Saturday (10-12 July).
Resolved Disk Server Issues |
- On Wednesday 9th July GDSS546 (CMSTape - D0T1) crashed. It was returned to service on Friday (11th). The RAID array was reporting a problem - but no failed drives were found. One file was lost at the time the server crashed.
- On Thursday 10th July GDSS527 (CMSTape - D0T1) was taken out of srevice for a couple of hours to investigate why it did not see a replacement drive.
- On Sunday 13th July GDSS720 (AtlasDataDisk - D1T0) crashed. It was returned to service the next day (Monday 14th) although no fault was found. Fifteen files were lost at from the time the server crashed.
Current operational status and issues |
- We are still investigating xroot access to CMS Castor following the upgrade on the 17th June.
- There is a problem with the dteam SRM regional nagios tests, which may be caused by how dteam is published by the CIP.
Ongoing Disk Server Issues |
- None
Notable Changes made this last week. |
- New CERN VOMS servers added.
- Castor re-pack instance being updated to version 2.1.14-13 (ongoing now).
Declared in the GOC DB |
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
perfsonar-ps01.gridpp.rl.ac.uk, perfsonar-ps02.gridpp.rl.ac.uk, | SCHEDULED | OUTAGE | 14/07/2014 11:00 | 14/08/2014 11:00 | 31 days, | Systems being decommissioned. They have been replaced by lcgps01.gridpp.rl.ac.uk and lcgps02.gridpp.rl.ac.uk |
Advanced warning for other interventions |
The following items are being discussed and are still to be formally scheduled and announced. |
- We are planning the termination of the FTS2 service (announced for 2nd September) now that almost all use is on FTS3.
- The FTS3 service will be updated (to v3.2.26) on Monday morning, 21st July.
- The core RAL site network will be updated to use RIP for network routing on Tuesday 22nd July.
Listing by category:
- Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
- Castor:
- None.
- Networking:
- Move switches connecting the 2011 disk servers batches onto the Tier1 mesh network.
- Make routing changes to allow the removal of the UKLight Router.
- Fabric
- We are phasing out the use of the software server used by the small VOs.
- Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
- There will be circuit testing of the remaining (i.e. non-UPS) circuits in the machine room during 2014.
Entries in GOC DB starting between the 9th and 16th July 2014. |
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
perfsonar-ps01.gridpp.rl.ac.uk, perfsonar-ps02.gridpp.rl.ac.uk, | SCHEDULED | OUTAGE | 14/07/2014 11:00 | 14/08/2014 11:00 | 31 days, | Systems being decommissioned. They have been replaced by lcgps01.gridpp.rl.ac.uk and lcgps02.gridpp.rl.ac.uk |
Castor GEN SRMs. (srm-alice, srm-biomed, srm-dteam, srm-hone, srm-ilc, srm-mice, srm-minos, srm-na62, srm-snoplus, srm-superb, srm-t2k) | UNSCHEDULED | WARNING | 11/07/2014 17:10 | 14/07/2014 12:00 | 2 days, 18 hours and 50 minutes | There was a problem with the Castor GEN instance SRMs (Castor OK, but not the SRMs). Now improved. Setting a WARNING state over weekend. |
Castor GEN SRMs. (srm-alice, srm-biomed, srm-dteam, srm-hone, srm-ilc, srm-mice, srm-minos, srm-na62, srm-snoplus, srm-superb, srm-t2k) | UNSCHEDULED | OUTAGE | 11/07/2014 13:00 | 11/07/2014 17:00 | 4 hours | We are invesitgating a problem with the Castor GEN instance SRMs. (Castor OK, but not the SRMs). |
srm-atlas.gridpp.rl.ac.uk, | SCHEDULED | OUTAGE | 08/07/2014 06:00 | 09/07/2014 10:39 | 1 day, 4 hours and 39 minutes | Atlas Castor instance down for Castor 2.1.14 Stager Update |
Open GGUS Tickets (Snapshot during morning of meeting) |
GGUS ID | Level | Urgency | State | Creation | Last Update | VO | Subject |
---|---|---|---|---|---|---|---|
106927 | Green | Top Priority | Waiting Reply | 2014-07-16 | 2014-07-16 | GGUS Test alarm ticket | |
106802 | Green | Less Urgent | In Progress | 2014-07-11 | 2014-07-11 | CMS | Several nodes contacting ILC/CMS VOs |
106770 | Green | Less Urgent | In Progress | 2014-07-11 | 2014-07-14 | enmr.eu | Unable to add software tags |
106655 | Green | Less Urgent | In Progress | 2014-07-04 | 2014-07-10 | Ops | [Rod Dashboard] Issues detected at RAL-LCG2 (srm-dteam) |
106610 | Green | Less Urgent | In Progress | 2014-07-02 | 2014-07-02 | HyperK | HyperK support |
106324 | Red | Urgent | In Progress | 2014-06-18 | 2014-07-07 | CMS | pilots losing network connections at T1_UK_RAL |
105405 | Red | Urgent | On Hold | 2014-05-14 | 2014-07-01 | please check your Vidyo router firewall configuration |
Availability Report |
Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 508); CMS HC = CMS HammerCloud
Day | OPS | Alice | Atlas | CMS | LHCb | Atlas HC | CMS HC | Comment |
---|---|---|---|---|---|---|---|---|
09/07/14 | 100 | 100 | 99.3 | 96.1 | 95.8 | 100 | 100 | Central networking problem |
10/07/14 | 100 | 100 | 98.0 | 100 | 100 | 97 | 100 | srmServer restart. |
11/07/14 | 100 | 100 | 98.0 | 100 | 100 | 97 | 100 | srmServer restart. |
12/07/14 | 100 | 100 | 100 | 100 | 100 | 99 | 99 | |
13/07/14 | 100 | 100 | 100 | 100 | 100 | 100 | 99 | |
14/07/14 | 100 | 100 | 100 | 100 | 100 | 99 | 100 | |
15/07/14 | 100 | 100 | 100 | 100 | 100 | 99 | 99 |