Difference between revisions of "Tier1 Operations Report 2014-07-09"
From GridPP Wiki
(Created page with "=RAL Tier1 Operations Report for 2nd July 2014= __NOTOC__ ====== ====== <!-- ************************************************************* -----> <!-- ***********Start Review...") |
(→) |
||
(19 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
− | =RAL Tier1 Operations Report for | + | =RAL Tier1 Operations Report for 9th July 2014= |
__NOTOC__ | __NOTOC__ | ||
====== ====== | ====== ====== | ||
Line 7: | Line 7: | ||
{| width="100%" cellspacing="0" cellpadding="0" style="background-color: #ffffff; border: 1px solid silver; border-collapse: collapse; width: 100%; margin: 0 0 1em 0;" | {| width="100%" cellspacing="0" cellpadding="0" style="background-color: #ffffff; border: 1px solid silver; border-collapse: collapse; width: 100%; margin: 0 0 1em 0;" | ||
|- | |- | ||
− | | style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Review of Issues during the week | + | | style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Review of Issues during the week 2nd to 9th July 2014. |
|} | |} | ||
− | * There were | + | * There were problems with the SRM (not Castor) for the GEN instance on Thursday and Friday of last week (3/4 July). It was fixed by a database edit. |
− | + | * Problems with Atlas multicore jobs on Friday 4th July. We believe it is an Atlas issue. | |
− | * | + | |
<!-- ***********End Review of Issues during last week*********** -----> | <!-- ***********End Review of Issues during last week*********** -----> | ||
<!-- *********************************************************** -----> | <!-- *********************************************************** -----> | ||
Line 33: | Line 32: | ||
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Current operational status and issues | | style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Current operational status and issues | ||
|} | |} | ||
− | * We | + | * We are still investigating xroot access to CMS Castor following the upgrade on the 17th June. |
+ | * There is a problem with the dteam SRM regional nagios tests, which may be caused by how dteam is published by the CIP. | ||
<!-- ***********End Current operational status and issues*********** -----> | <!-- ***********End Current operational status and issues*********** -----> | ||
<!-- *************************************************************** -----> | <!-- *************************************************************** -----> | ||
Line 55: | Line 55: | ||
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Notable Changes made this last week. | | style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Notable Changes made this last week. | ||
|} | |} | ||
− | * | + | * Tuesday and Wednesday (8th and 9th July) Atlas Castor instance upgraded to version 2.1.14-13. Castor Atlas was returned to production at 10:40 this morning. |
<!-- *************End Notable Changes made this last week************** -----> | <!-- *************End Notable Changes made this last week************** -----> | ||
<!-- ****************************************************************** -----> | <!-- ****************************************************************** -----> | ||
Line 66: | Line 66: | ||
| style="background-color: #d8e8ff; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Declared in the GOC DB | | style="background-color: #d8e8ff; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Declared in the GOC DB | ||
|} | |} | ||
− | + | * None | |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
<!-- **********************End GOC DB Entries************************** -----> | <!-- **********************End GOC DB Entries************************** -----> | ||
<!-- ****************************************************************** -----> | <!-- ****************************************************************** -----> | ||
Line 98: | Line 80: | ||
|} | |} | ||
<!-- ******* still to be formally scheduled and/or announced ******* -----> | <!-- ******* still to be formally scheduled and/or announced ******* -----> | ||
− | + | * We are planning the termination of the FTS2 service (announced for 2nd September) now that almost all use is on FTS3. | |
− | * We are planning the termination of the FTS2 service ( | + | |
'''Listing by category:''' | '''Listing by category:''' | ||
* Databases: | * Databases: | ||
** Switch LFC/FTS/3D to new Database Infrastructure. | ** Switch LFC/FTS/3D to new Database Infrastructure. | ||
* Castor: | * Castor: | ||
− | ** | + | ** None. |
* Networking: | * Networking: | ||
** Move switches connecting the 2011 disk servers batches onto the Tier1 mesh network. | ** Move switches connecting the 2011 disk servers batches onto the Tier1 mesh network. | ||
Line 120: | Line 101: | ||
{| width="100%" cellspacing="0" cellpadding="0" style="background-color: #ffffff; border: 1px solid silver; border-collapse: collapse; width: 100%; margin: 0 0 1em 0;" | {| width="100%" cellspacing="0" cellpadding="0" style="background-color: #ffffff; border: 1px solid silver; border-collapse: collapse; width: 100%; margin: 0 0 1em 0;" | ||
|- | |- | ||
− | | style="background-color: #7c8aaf; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Entries in GOC DB starting between the | + | | style="background-color: #7c8aaf; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Entries in GOC DB starting between the 2nd and 9th July 2014. |
|} | |} | ||
{| border=1 align=center | {| border=1 align=center | ||
Line 132: | Line 113: | ||
! Reason | ! Reason | ||
|- | |- | ||
− | | Whole | + | | srm-atlas |
+ | | SCHEDULED | ||
+ | | OUTAGE | ||
+ | | 08/07/2014 06:00 | ||
+ | | 09/07/2014 10:40 | ||
+ | | 1 day, 6 hours | ||
+ | | Atlas Castor instance down for Castor 2.1.14 Stager Update | ||
+ | |- | ||
+ | | Castor GEN: srm-alice, srm-biomed, srm-dteam, srm-hone, srm-ilc, srm-mice, srm-minos, srm-na62, srm-snoplus, srm-superb, srm-t2k | ||
+ | | UNSCHEDULED | ||
+ | | WARNING | ||
+ | | 03/07/2014 07:45 | ||
+ | | 03/07/2014 13:00 | ||
+ | | 5 hours and 15 minutes | ||
+ | | Problem with SRMs for Castor GEN instance. (However Castor itself - e.g. xroot access - working OK). | ||
+ | |- | ||
+ | | Whole site | ||
| SCHEDULED | | SCHEDULED | ||
| WARNING | | WARNING | ||
Line 139: | Line 136: | ||
| 1 hour | | 1 hour | ||
| RAL Tier1 site in warning state due to UPS/generator test. | | RAL Tier1 site in warning state due to UPS/generator test. | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
|} | |} | ||
<!-- **********************End GOC DB Entries************************** -----> | <!-- **********************End GOC DB Entries************************** -----> | ||
Line 179: | Line 152: | ||
! GGUS ID !! Level !! Urgency !! State !! Creation !! Last Update !! VO !! Subject | ! GGUS ID !! Level !! Urgency !! State !! Creation !! Last Update !! VO !! Subject | ||
|- | |- | ||
− | | | + | | 106753 |
| Green | | Green | ||
| Less Urgent | | Less Urgent | ||
− | | | + | | In Progress |
− | | 2014- | + | | 2014-07-09 |
− | | 2014- | + | | 2014-07-09 |
− | | dteam | + | | Atlas |
− | | | + | | Errors in transfers to RAL-LCG2 |
+ | |- | ||
+ | | 106695 | ||
+ | | Green | ||
+ | | Less Urgent | ||
+ | | In Progress | ||
+ | | 2014-07-08 | ||
+ | | 2014-07-08 | ||
+ | | Ops | ||
+ | | [Rod Dashboard] Issues detected at RAL-LCG2 | ||
+ | |- | ||
+ | | 106655 | ||
+ | | Green | ||
+ | | Less Urgent | ||
+ | | In Progress | ||
+ | | 2014-07-04 | ||
+ | | 2014-07-04 | ||
+ | | Ops | ||
+ | | [Rod Dashboard] Issues detected at RAL-LCG2 (srm-dteam) | ||
+ | |- | ||
+ | | 106640 | ||
+ | | Green | ||
+ | | Less Urgent | ||
+ | | In Progress | ||
+ | | 2014-07-04 | ||
+ | | 2014-07-04 | ||
+ | | ILC | ||
+ | | Failure to submit jobs to RAL-LCG2 CEs | ||
+ | |- | ||
+ | | 106610 | ||
+ | | Green | ||
+ | | Less Urgent | ||
+ | | In Progress | ||
+ | | 2014-07-02 | ||
+ | | 2014-07-02 | ||
+ | | HyperK | ||
+ | | HyperK support | ||
|- | |- | ||
| 106324 | | 106324 | ||
Line 196: | Line 205: | ||
| CMS | | CMS | ||
| pilots losing network connections at T1_UK_RAL | | pilots losing network connections at T1_UK_RAL | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
|- | |- | ||
| 105405 | | 105405 | ||
Line 214: | Line 214: | ||
| | | | ||
| please check your Vidyo router firewall configuration | | please check your Vidyo router firewall configuration | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
|} | |} | ||
<!-- **********************End GGUS Tickets************************** -----> | <!-- **********************End GGUS Tickets************************** -----> | ||
Line 242: | Line 233: | ||
! Day !! OPS !! Alice !! Atlas !! CMS !! LHCb !! Atlas HC !! CMS HC !! Comment | ! Day !! OPS !! Alice !! Atlas !! CMS !! LHCb !! Atlas HC !! CMS HC !! Comment | ||
|- | |- | ||
− | | | + | | 02/07/14 || 100 || 100 || 100 || 100 || 100 || 98 || 99 || |
|- | |- | ||
− | | | + | | 03/07/14 || 100 || 100 || 100 || 100 || 100 || 99 || 100 || |
|- | |- | ||
− | | | + | | 04/07/14 || 100 || 100 || 100 || 100 || 100 || 97 || 100 || |
|- | |- | ||
− | | | + | | 05/07/14 || 100 || 100 || 100 || 100 || 100 || 92 || 100 || |
|- | |- | ||
− | | | + | | 06/07/14 || 100 || 100 || 100 || 100 || 100 || 99 || 100 || |
|- | |- | ||
− | | | + | | 07/07/14 || 100 || 100 || 100 || 100 || 100 || 97 || 100 || |
|- | |- | ||
− | | | + | | 08/07/14 || 100 || 100 || style="background-color: lightgrey;" | 41 || 100 || 100 || 100 || 99 || Atlas Castor upgrade. |
|} | |} | ||
<!-- **********************End Availability Report************************** -----> | <!-- **********************End Availability Report************************** -----> | ||
<!-- *********************************************************************** -----> | <!-- *********************************************************************** -----> |
Latest revision as of 13:17, 9 July 2014
RAL Tier1 Operations Report for 9th July 2014
Review of Issues during the week 2nd to 9th July 2014. |
- There were problems with the SRM (not Castor) for the GEN instance on Thursday and Friday of last week (3/4 July). It was fixed by a database edit.
- Problems with Atlas multicore jobs on Friday 4th July. We believe it is an Atlas issue.
Resolved Disk Server Issues |
- None
Current operational status and issues |
- We are still investigating xroot access to CMS Castor following the upgrade on the 17th June.
- There is a problem with the dteam SRM regional nagios tests, which may be caused by how dteam is published by the CIP.
Ongoing Disk Server Issues |
- None
Notable Changes made this last week. |
- Tuesday and Wednesday (8th and 9th July) Atlas Castor instance upgraded to version 2.1.14-13. Castor Atlas was returned to production at 10:40 this morning.
Declared in the GOC DB |
- None
Advanced warning for other interventions |
The following items are being discussed and are still to be formally scheduled and announced. |
- We are planning the termination of the FTS2 service (announced for 2nd September) now that almost all use is on FTS3.
Listing by category:
- Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
- Castor:
- None.
- Networking:
- Move switches connecting the 2011 disk servers batches onto the Tier1 mesh network.
- Make routing changes to allow the removal of the UKLight Router.
- Fabric
- We are phasing out the use of the software server used by the small VOs.
- Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
- There will be circuit testing of the remaining (i.e. non-UPS) circuits in the machine room during 2014.
Entries in GOC DB starting between the 2nd and 9th July 2014. |
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
srm-atlas | SCHEDULED | OUTAGE | 08/07/2014 06:00 | 09/07/2014 10:40 | 1 day, 6 hours | Atlas Castor instance down for Castor 2.1.14 Stager Update |
Castor GEN: srm-alice, srm-biomed, srm-dteam, srm-hone, srm-ilc, srm-mice, srm-minos, srm-na62, srm-snoplus, srm-superb, srm-t2k | UNSCHEDULED | WARNING | 03/07/2014 07:45 | 03/07/2014 13:00 | 5 hours and 15 minutes | Problem with SRMs for Castor GEN instance. (However Castor itself - e.g. xroot access - working OK). |
Whole site | SCHEDULED | WARNING | 02/07/2014 10:00 | 02/07/2014 11:00 | 1 hour | RAL Tier1 site in warning state due to UPS/generator test. |
Open GGUS Tickets (Snapshot during morning of meeting) |
GGUS ID | Level | Urgency | State | Creation | Last Update | VO | Subject |
---|---|---|---|---|---|---|---|
106753 | Green | Less Urgent | In Progress | 2014-07-09 | 2014-07-09 | Atlas | Errors in transfers to RAL-LCG2 |
106695 | Green | Less Urgent | In Progress | 2014-07-08 | 2014-07-08 | Ops | [Rod Dashboard] Issues detected at RAL-LCG2 |
106655 | Green | Less Urgent | In Progress | 2014-07-04 | 2014-07-04 | Ops | [Rod Dashboard] Issues detected at RAL-LCG2 (srm-dteam) |
106640 | Green | Less Urgent | In Progress | 2014-07-04 | 2014-07-04 | ILC | Failure to submit jobs to RAL-LCG2 CEs |
106610 | Green | Less Urgent | In Progress | 2014-07-02 | 2014-07-02 | HyperK | HyperK support |
106324 | Yellow | Urgent | In Progress | 2014-06-18 | 2014-07-01 | CMS | pilots losing network connections at T1_UK_RAL |
105405 | Red | Urgent | On Hold | 2014-05-14 | 2014-07-01 | please check your Vidyo router firewall configuration |
Availability Report |
Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 508); CMS HC = CMS HammerCloud
Day | OPS | Alice | Atlas | CMS | LHCb | Atlas HC | CMS HC | Comment |
---|---|---|---|---|---|---|---|---|
02/07/14 | 100 | 100 | 100 | 100 | 100 | 98 | 99 | |
03/07/14 | 100 | 100 | 100 | 100 | 100 | 99 | 100 | |
04/07/14 | 100 | 100 | 100 | 100 | 100 | 97 | 100 | |
05/07/14 | 100 | 100 | 100 | 100 | 100 | 92 | 100 | |
06/07/14 | 100 | 100 | 100 | 100 | 100 | 99 | 100 | |
07/07/14 | 100 | 100 | 100 | 100 | 100 | 97 | 100 | |
08/07/14 | 100 | 100 | 41 | 100 | 100 | 100 | 99 | Atlas Castor upgrade. |