Difference between revisions of "Tier1 Operations Report 2014-05-21"
From GridPP Wiki
(Created page with "=RAL Tier1 Operations Report for 21st May 2014= __NOTOC__ ====== ====== <!-- ************************************************************* -----> <!-- ***********Start Review...") |
m (→) |
||
(17 intermediate revisions by one user not shown) | |||
Line 9: | Line 9: | ||
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Review of Issues during the week 14th to 21st May 2014. | | style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Review of Issues during the week 14th to 21st May 2014. | ||
|} | |} | ||
− | * | + | * Grumbly problems with the WMSs reported last week ongoing. The developers have been contacted. |
− | * | + | * Problem reported last week with some half dozen Atlas files were lost during the draining of a disk server at the end of February is now understood. This was an isolated incident. Draining has been resumed. |
− | * | + | * There was a problem with on FTS2 last Friday (16th) which led to a ticket from CMS. |
+ | * The checksum checker found a corrupt LHCb file in Castor which has been declared lost. | ||
<!-- ***********End Review of Issues during last week*********** -----> | <!-- ***********End Review of Issues during last week*********** -----> | ||
<!-- *********************************************************** -----> | <!-- *********************************************************** -----> | ||
Line 55: | Line 56: | ||
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Notable Changes made this last week. | | style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Notable Changes made this last week. | ||
|} | |} | ||
− | * | + | * CVMFS Client version 2.1.19 in final stages of being rolled out to whole farm following successful testing so far. |
− | * | + | * One new disk server has been deployed to CMS disk. (This replaced a server (GDSS758) that failed a couple of weeks ago). |
− | * | + | * A new tape controller server (ACSLS) was put into operation yesterday morning (Tuesday 20th May). |
<!-- *************End Notable Changes made this last week************** -----> | <!-- *************End Notable Changes made this last week************** -----> | ||
<!-- ****************************************************************** -----> | <!-- ****************************************************************** -----> | ||
Line 69: | Line 70: | ||
|} | |} | ||
<!-- ******* Declared in the GOC DB ******* -----> | <!-- ******* Declared in the GOC DB ******* -----> | ||
+ | {| border=1 align=center | ||
+ | |- bgcolor="#7c8aaf" | ||
+ | ! Service | ||
+ | ! Scheduled? | ||
+ | ! Outage/At Risk | ||
+ | ! Start | ||
+ | ! End | ||
+ | ! Duration | ||
+ | ! Reason | ||
+ | |- | ||
+ | | lcgui02.gridpp.rl.ac.uk, | ||
+ | | SCHEDULED | ||
+ | | OUTAGE | ||
+ | | 30/04/2014 14:00 | ||
+ | | 29/05/2014 13:00 | ||
+ | | 28 days, 23 hours | ||
+ | | Service being decommissioned. | ||
+ | |} | ||
+ | <!-- **********************End GOC DB Entries************************** -----> | ||
+ | <!-- ****************************************************************** -----> | ||
+ | |||
+ | ====== ====== | ||
+ | <!-- ******************************************************************************* -----> | ||
+ | <!-- ****************Start Advanced warning for other interventions***************** -----> | ||
+ | {| width="100%" cellspacing="0" cellpadding="0" style="background-color: #ffffff; border: 1px solid silver; border-collapse: collapse; width: 100%; margin: 0 0 1em 0;" | ||
+ | |- | ||
+ | | style="background-color: #d8e8ff; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Advanced warning for other interventions | ||
+ | |- | ||
+ | | style="background-color: #d8e8ff; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;"| The following items are being discussed and are still to be formally scheduled and announced. | ||
+ | |} | ||
+ | <!-- ******* still to be formally scheduled and/or announced ******* -----> | ||
+ | * Provisional dates for the Castor 2.1.14 upgrade: Nameserver: Tuesday 10th June; Stagers: CMS- Tue 17th June; LHCb - Thu 19th June; GEN - Tue 24th June; Atlas - Thu 26th June. | ||
+ | * We are starting to plan the termination of the FTS2 service now that almost all use is on FTS3. | ||
+ | * Maintenance will be carried out on the diesel generator tomorrow morning (22nd May) from 09:00 - 11:00. Should we suffer a mains power failure during this time window we will not have generator backup. | ||
+ | * On Wednesday 28th May we plan to move the network switches for some Castor disk servers to the mesh network to alleviate a bottleneck. This will be during an at risk on Castor. | ||
+ | '''Listing by category:''' | ||
+ | * Databases: | ||
+ | ** Switch LFC/FTS/3D to new Database Infrastructure. | ||
+ | * Castor: | ||
+ | ** Castor 2.1.14 testing was largely complete, although a new minor version (2.1.14-12) will be released soon. | ||
+ | * Networking: | ||
+ | ** Move switches connecting recent disk servers batches ('11, '12) onto the Tier1 mesh network. | ||
+ | ** Make routing changes to allow the removal of the UKLight Router. | ||
+ | * Fabric | ||
+ | ** We are phasing out the use of the software server used by the small VOs. | ||
+ | ** Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC) | ||
+ | ** There will be circuit testing of the remaining (i.e. non-UPS) circuits in the machine room during 2014. | ||
+ | <!-- ***************End Advanced warning for other interventions*************** -----> | ||
+ | <!-- ************************************************************************** -----> | ||
+ | |||
+ | ====== ====== | ||
+ | <!-- ******************************************************************** -----> | ||
+ | <!-- **********************Start GOC DB Entries************************** -----> | ||
+ | {| width="100%" cellspacing="0" cellpadding="0" style="background-color: #ffffff; border: 1px solid silver; border-collapse: collapse; width: 100%; margin: 0 0 1em 0;" | ||
+ | |- | ||
+ | | style="background-color: #7c8aaf; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Entries in GOC DB starting between the 14th and 21st May 2014. | ||
+ | |} | ||
{| border=1 align=center | {| border=1 align=center | ||
|- bgcolor="#7c8aaf" | |- bgcolor="#7c8aaf" | ||
Line 87: | Line 145: | ||
| Outage of tape system for update of tape library controller. (Postponed from 13th May). | | Outage of tape system for update of tape library controller. (Postponed from 13th May). | ||
|- | |- | ||
− | | All | + | | All SRM end points |
| SCHEDULED | | SCHEDULED | ||
| WARNING | | WARNING | ||
Line 134: | Line 192: | ||
| 1 hour | | 1 hour | ||
| Downtime for system maintenance | | Downtime for system maintenance | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
|- | |- | ||
| lcgwms06.gridpp.rl.ac.uk, | | lcgwms06.gridpp.rl.ac.uk, | ||
Line 194: | Line 197: | ||
| OUTAGE | | OUTAGE | ||
| 14/05/2014 14:00 | | 14/05/2014 14:00 | ||
− | | 14/05/2014 | + | | 14/05/2014 15:20 |
− | | | + | | 1 hour and 20 minutes |
| Downtime for system maintenance | | Downtime for system maintenance | ||
|- | |- | ||
Line 245: | Line 248: | ||
! GGUS ID !! Level !! Urgency !! State !! Creation !! Last Update !! VO !! Subject | ! GGUS ID !! Level !! Urgency !! State !! Creation !! Last Update !! VO !! Subject | ||
|- | |- | ||
− | | | + | | 105405 |
| Green | | Green | ||
− | | | + | | Urgent |
| In Progress | | In Progress | ||
− | |||
| 2014-05-14 | | 2014-05-14 | ||
− | | | + | | 2014-05-15 |
− | | | + | | |
+ | | please check your Vidyo router firewall configuration | ||
|- | |- | ||
| 105308 | | 105308 | ||
| Green | | Green | ||
| Less Urgent | | Less Urgent | ||
− | | | + | | On Hold |
| 2014-05-11 | | 2014-05-11 | ||
− | | 2014-05- | + | | 2014-05-19 |
| Atlas | | Atlas | ||
| Jobs at RAL-LCG2_MCORE are failing with "Failed to open shared memory object: Permission denied" | | Jobs at RAL-LCG2_MCORE are failing with "Failed to open shared memory object: Permission denied" | ||
|- | |- | ||
| 105161 | | 105161 | ||
− | | | + | | Yellow |
| Less Urgent | | Less Urgent | ||
| In Progress | | In Progress | ||
| 2014-05-05 | | 2014-05-05 | ||
− | | 2014-05- | + | | 2014-05-16 |
| H1 | | H1 | ||
| hone jobs submitted into CREAM queues through lcgwms05.gridpp.rl.ac.uk & lcgwms06.gridpp.rl.ac.uk WMSs are are Ready status long time (more as 5 hours) | | hone jobs submitted into CREAM queues through lcgwms05.gridpp.rl.ac.uk & lcgwms06.gridpp.rl.ac.uk WMSs are are Ready status long time (more as 5 hours) | ||
Line 284: | Line 287: | ||
| Red | | Red | ||
| Urgent | | Urgent | ||
− | | | + | | In Progress |
| 2013-10-21 | | 2013-10-21 | ||
− | | 2014-05- | + | | 2014-05-20 |
| SNO+ | | SNO+ | ||
| please configure cvmfs stratum-0 for SNO+ at RAL T1 | | please configure cvmfs stratum-0 for SNO+ at RAL T1 | ||
Line 320: | Line 323: | ||
| 19/05/14 || 100 || 100 || 100 || 100 || 100 || 99 || 100 || | | 19/05/14 || 100 || 100 || 100 || 100 || 100 || 99 || 100 || | ||
|- | |- | ||
− | | 20/05/14 || 100 || 100 || 100 || 100 || 100 || 100 || | + | | 20/05/14 || 100 || 100 || 100 || 100 || 100 || 100 || 100 || |
|} | |} | ||
<!-- **********************End Availability Report************************** -----> | <!-- **********************End Availability Report************************** -----> | ||
<!-- *********************************************************************** -----> | <!-- *********************************************************************** -----> |
Latest revision as of 13:22, 21 May 2014
RAL Tier1 Operations Report for 21st May 2014
Review of Issues during the week 14th to 21st May 2014. |
- Grumbly problems with the WMSs reported last week ongoing. The developers have been contacted.
- Problem reported last week with some half dozen Atlas files were lost during the draining of a disk server at the end of February is now understood. This was an isolated incident. Draining has been resumed.
- There was a problem with on FTS2 last Friday (16th) which led to a ticket from CMS.
- The checksum checker found a corrupt LHCb file in Castor which has been declared lost.
Resolved Disk Server Issues |
- None.
Current operational status and issues |
- Nothing to report.
Ongoing Disk Server Issues |
- None
Notable Changes made this last week. |
- CVMFS Client version 2.1.19 in final stages of being rolled out to whole farm following successful testing so far.
- One new disk server has been deployed to CMS disk. (This replaced a server (GDSS758) that failed a couple of weeks ago).
- A new tape controller server (ACSLS) was put into operation yesterday morning (Tuesday 20th May).
Declared in the GOC DB |
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
lcgui02.gridpp.rl.ac.uk, | SCHEDULED | OUTAGE | 30/04/2014 14:00 | 29/05/2014 13:00 | 28 days, 23 hours | Service being decommissioned. |
Advanced warning for other interventions |
The following items are being discussed and are still to be formally scheduled and announced. |
- Provisional dates for the Castor 2.1.14 upgrade: Nameserver: Tuesday 10th June; Stagers: CMS- Tue 17th June; LHCb - Thu 19th June; GEN - Tue 24th June; Atlas - Thu 26th June.
- We are starting to plan the termination of the FTS2 service now that almost all use is on FTS3.
- Maintenance will be carried out on the diesel generator tomorrow morning (22nd May) from 09:00 - 11:00. Should we suffer a mains power failure during this time window we will not have generator backup.
- On Wednesday 28th May we plan to move the network switches for some Castor disk servers to the mesh network to alleviate a bottleneck. This will be during an at risk on Castor.
Listing by category:
- Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
- Castor:
- Castor 2.1.14 testing was largely complete, although a new minor version (2.1.14-12) will be released soon.
- Networking:
- Move switches connecting recent disk servers batches ('11, '12) onto the Tier1 mesh network.
- Make routing changes to allow the removal of the UKLight Router.
- Fabric
- We are phasing out the use of the software server used by the small VOs.
- Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
- There will be circuit testing of the remaining (i.e. non-UPS) circuits in the machine room during 2014.
Entries in GOC DB starting between the 14th and 21st May 2014. |
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
srm-lhcb-tape.gridpp.rl.ac.uk, | SCHEDULED | OUTAGE | 20/05/2014 08:00 | 20/05/2014 11:00 | 3 hours | Outage of tape system for update of tape library controller. (Postponed from 13th May). |
All SRM end points | SCHEDULED | WARNING | 20/05/2014 08:00 | 20/05/2014 11:00 | 3 hours | Outage of tape system for update of tape library controller. (Postponed from 13th May). |
lcgvo08.gridpp.rl.ac.uk, | SCHEDULED | OUTAGE | 15/05/2014 15:00 | 15/05/2014 16:00 | 1 hour | Downtime for system maintenance |
lcgvo07.gridpp.rl.ac.uk, | SCHEDULED | OUTAGE | 15/05/2014 14:00 | 15/05/2014 15:00 | 1 hour | Downtime for system maintenance |
lcglb02.gridpp.rl.ac.uk, | SCHEDULED | OUTAGE | 15/05/2014 13:00 | 15/05/2014 14:00 | 1 hour | Downtime for system maintenance |
lcglb01.gridpp.rl.ac.uk, | SCHEDULED | OUTAGE | 15/05/2014 10:45 | 15/05/2014 11:45 | 1 hour | Downtime for system maintenance |
lcglb04.gridpp.rl.ac.uk, | SCHEDULED | OUTAGE | 15/05/2014 09:30 | 15/05/2014 10:30 | 1 hour | Downtime for system maintenance |
lcgwms06.gridpp.rl.ac.uk, | SCHEDULED | OUTAGE | 14/05/2014 14:00 | 14/05/2014 15:20 | 1 hour and 20 minutes | Downtime for system maintenance |
lcgwms05.gridpp.rl.ac.uk, | SCHEDULED | OUTAGE | 14/05/2014 11:30 | 14/05/2014 13:30 | 2 hours | Downtime for system maintenance |
lcgfts3.gridpp.rl.ac.uk, | SCHEDULED | OUTAGE | 14/05/2014 10:00 | 14/05/2014 12:00 | 2 hours | FTS3 service at RAL unavailable for update to version 3.2.22 |
lcgwms04.gridpp.rl.ac.uk, | SCHEDULED | OUTAGE | 14/05/2014 09:30 | 14/05/2014 10:46 | 1 hour and 16 minutes | Downtime for system maintenance |
lcgui02.gridpp.rl.ac.uk, | SCHEDULED | OUTAGE | 30/04/2014 14:00 | 29/05/2014 13:00 | 28 days, 23 hours | Service being decommissioned. |
Open GGUS Tickets (Snapshot during morning of meeting) |
GGUS ID | Level | Urgency | State | Creation | Last Update | VO | Subject |
---|---|---|---|---|---|---|---|
105405 | Green | Urgent | In Progress | 2014-05-14 | 2014-05-15 | please check your Vidyo router firewall configuration | |
105308 | Green | Less Urgent | On Hold | 2014-05-11 | 2014-05-19 | Atlas | Jobs at RAL-LCG2_MCORE are failing with "Failed to open shared memory object: Permission denied" |
105161 | Yellow | Less Urgent | In Progress | 2014-05-05 | 2014-05-16 | H1 | hone jobs submitted into CREAM queues through lcgwms05.gridpp.rl.ac.uk & lcgwms06.gridpp.rl.ac.uk WMSs are are Ready status long time (more as 5 hours) |
105100 | Green | Urgent | On Hold | 2014-05-02 | 2014-05-12 | CMS | T1_UK_RAL Consistency Check (May14) |
98249 | Red | Urgent | In Progress | 2013-10-21 | 2014-05-20 | SNO+ | please configure cvmfs stratum-0 for SNO+ at RAL T1 |
Availability Report |
Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 508); CMS HC = CMS HammerCloud
Day | OPS | Alice | Atlas | CMS | LHCb | Atlas HC | CMS HC | Comment |
---|---|---|---|---|---|---|---|---|
14/05/14 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | |
15/05/14 | 100 | 100 | 100 | 100 | 100 | 94 | 96 | |
16/05/14 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | |
17/05/14 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | |
18/05/14 | 100 | 100 | 100 | 100 | 100 | 99 | 100 | |
19/05/14 | 100 | 100 | 100 | 100 | 100 | 99 | 100 | |
20/05/14 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |