Difference between revisions of "Tier1 Operations Report 2015-10-28"

From GridPP Wiki
Jump to: navigation, search
(Created page with "DRAFT RAL Tier1 Operations Report for Date= {| width="100%" cellspacing="0" cellpadding="0" style="background-color: #ffffff; border: 1px solid silver; border-collapse: coll...")
 
Line 1: Line 1:
DRAFT
+
==RAL Tier1 Operations Report for 28st October 2015==
RAL Tier1 Operations Report for Date=
+
__NOTOC__
 +
====== ======
  
 +
<!-- ************************************************************* ----->
 +
<!-- ***********Start Review of Issues during last week*********** ----->
 
{| width="100%" cellspacing="0" cellpadding="0" style="background-color: #ffffff; border: 1px solid silver; border-collapse: collapse; width: 100%; margin: 0 0 1em 0;"
 
{| width="100%" cellspacing="0" cellpadding="0" style="background-color: #ffffff; border: 1px solid silver; border-collapse: collapse; width: 100%; margin: 0 0 1em 0;"
 
|-
 
|-
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Review of Issues during the week 9th to 16th May 2012
+
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Review of Issues during the week 14th to 21st October 2015.
 
|}
 
|}
* Item 1
+
* item
* Item 2
+
* item
 +
<!-- ***********End Review of Issues during last week*********** ----->
 +
<!-- *********************************************************** ----->
 +
 
 +
====== ======
 +
<!-- ******************************************************* ----->
 +
<!-- ***********Start Resolved Disk Server Issues*********** ----->
 
{| width="100%" cellspacing="0" cellpadding="0" style="background-color: #ffffff; border: 1px solid silver; border-collapse: collapse; width: 100%; margin: 0 0 1em 0;"
 
{| width="100%" cellspacing="0" cellpadding="0" style="background-color: #ffffff; border: 1px solid silver; border-collapse: collapse; width: 100%; margin: 0 0 1em 0;"
 
|-
 
|-
 
| style="background-color: #f8d6a9; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Resolved Disk Server Issues
 
| style="background-color: #f8d6a9; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Resolved Disk Server Issues
 
|}
 
|}
<!-- ****** Resolved Disk Server Issues ******* ----->
+
* item
* GDSSXXX report
+
<!-- ***********End Resolved Disk Server Issues*********** ----->
<!-- ******************************************* ----->
+
<!-- ***************************************************** ----->
  
 +
====== ======
 +
<!-- ***************************************************************** ----->
 +
<!-- ***********Start Current operational status and issues*********** ----->
 
{| width="100%" cellspacing="0" cellpadding="0" style="background-color: #ffffff; border: 1px solid silver; border-collapse: collapse; width: 100%; margin: 0 0 1em 0;"
 
{| width="100%" cellspacing="0" cellpadding="0" style="background-color: #ffffff; border: 1px solid silver; border-collapse: collapse; width: 100%; margin: 0 0 1em 0;"
 
|-
 
|-
 
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Current operational status and issues
 
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Current operational status and issues
 
|}
 
|}
 +
* The LHCb problem with a low but persistent rate of failure when copying the results of batch jobs to Castor. There is also a further problem that sometimes occurs when these (failed) writes are attempted to storage at other sites.
 +
* The intermittent, low-level, load-related packet loss seen over external connections is still being tracked. Likewise we have been working to understand some remaining low level of packet loss seen within a part of our Tier1 network.
 +
* Long-standing CMS issues. The two items that remain are CMS Xroot (AAA) redirection and file open times. Work is ongoing into the Xroot redirection with a new server having been added in recent weeks. File open times using Xroot remain slow but this is a less significant problem.
 +
<!-- ***********End Current operational status and issues*********** ----->
 +
<!-- *************************************************************** ----->
  
* Issue 1
+
====== ======
* Issue 2
+
<!-- *************************************************************** ----->
<!-- ******************************************* ----->
+
<!-- ***************Start Ongoing Disk Server Issues**************** ----->
 
+
 
{| width="100%" cellspacing="0" cellpadding="0" style="background-color: #ffffff; border: 1px solid silver; border-collapse: collapse; width: 100%; margin: 0 0 1em 0;"
 
{| width="100%" cellspacing="0" cellpadding="0" style="background-color: #ffffff; border: 1px solid silver; border-collapse: collapse; width: 100%; margin: 0 0 1em 0;"
 
|-
 
|-
 
| style="background-color: #f8d6a9; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Ongoing Disk Server Issues
 
| style="background-color: #f8d6a9; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Ongoing Disk Server Issues
 
|}
 
|}
* Ongoing issue 1
+
* GDSS663
* Ongoing issue 2
+
* GDSS665
<!-- ******* Ongoing Disk Server Issues ******** ----->
+
* GDSS644
 +
<!-- ***************End Ongoing Disk Server Issues**************** ----->
 +
<!-- ************************************************************* ----->
  
 +
====== ======
 +
<!-- ******************************************************************** ----->
 +
<!-- *************Start Notable Changes made since the last meeting************** ----->
 
{| width="100%" cellspacing="0" cellpadding="0" style="background-color: #ffffff; border: 1px solid silver; border-collapse: collapse; width: 100%; margin: 0 0 1em 0;"
 
{| width="100%" cellspacing="0" cellpadding="0" style="background-color: #ffffff; border: 1px solid silver; border-collapse: collapse; width: 100%; margin: 0 0 1em 0;"
 
|-
 
|-
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Notable Changes made this last week
+
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Notable Changes made since the last meeting.
 
|}
 
|}
<!-- **** Notable Changes made this last week ***** ----->
+
* item
* Change 1
+
* item
* Change 2
+
<!-- *************End Notable Changes made this last week************** ----->
<!-- ******************************************* ----->
+
<!-- ****************************************************************** ----->
  
 +
====== ======
 +
<!-- ******************************************************************** ----->
 +
<!-- **********************Start GOC DB Entries************************** ----->
 
{| width="100%" cellspacing="0" cellpadding="0" style="background-color: #ffffff; border: 1px solid silver; border-collapse: collapse; width: 100%; margin: 0 0 1em 0;"
 
{| width="100%" cellspacing="0" cellpadding="0" style="background-color: #ffffff; border: 1px solid silver; border-collapse: collapse; width: 100%; margin: 0 0 1em 0;"
 
|-
 
|-
 
| style="background-color: #d8e8ff; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Declared in the GOC DB
 
| style="background-color: #d8e8ff; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Declared in the GOC DB
 
|}
 
|}
<!-- ******* Declared in the GOC DB ******* ----->
+
* None
* Declared outage 1
+
<!-- **********************End GOC DB Entries************************** ----->
<!-- ******************************************* ----->
+
<!-- ****************************************************************** ----->
  
 +
====== ======
 +
<!-- ******************************************************************************* ----->
 +
<!-- ****************Start Advanced warning for other interventions***************** ----->
 
{| width="100%" cellspacing="0" cellpadding="0" style="background-color: #ffffff; border: 1px solid silver; border-collapse: collapse; width: 100%; margin: 0 0 1em 0;"
 
{| width="100%" cellspacing="0" cellpadding="0" style="background-color: #ffffff; border: 1px solid silver; border-collapse: collapse; width: 100%; margin: 0 0 1em 0;"
 
|-
 
|-
Line 56: Line 83:
 
| style="background-color: #d8e8ff; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;"| The following items are being discussed and are still to be formally scheduled and announced.
 
| style="background-color: #d8e8ff; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;"| The following items are being discussed and are still to be formally scheduled and announced.
 
|}
 
|}
<!-- ******* still to be formally scheduled and announced ******* ----->
+
<!-- ******* still to be formally scheduled and/or announced ******* ----->
* Item 1
+
* Upgrade of remaining Castor disk servers (those in tape-backed service classes) to SL6. This will be transparent to users.
** sub Item 1.1
+
* Some detailed internal network re-configurations to enable the removal of the old 'core' switch from our network. This includes changing the way the UKLIGHT router connects into the Tier1 network.
** sub Item 1.2
+
'''Listing by category:'''
* Item 2
+
* Databases:
** Sub Item 2.1
+
** Switch LFC/3D to new Database Infrastructure.
*** sub Item 2.1.1
+
* Castor:
*** sub Item 2.1.2
+
** Update SRMs to new version (includes updating to SL6).
 
+
** Update disk servers to SL6 (ongoing)
<!-- ******************************************* ----->
+
** Update to Castor version 2.1.15.
 +
* Networking:
 +
*** Complete changes needed to remove the old core switch from the Tier1 network.
 +
** Make routing changes to allow the removal of the UKLight Router.
 +
* Fabric
 +
** Firmware updates on remaining EMC disk arrays (Castor, LFC)
 +
<!-- ***************End Advanced warning for other interventions*************** ----->
 +
<!-- ************************************************************************** ----->
  
 +
====== ======
 +
<!-- ******************************************************************** ----->
 +
<!-- **********************Start GOC DB Entries************************** ----->
 
{| width="100%" cellspacing="0" cellpadding="0" style="background-color: #ffffff; border: 1px solid silver; border-collapse: collapse; width: 100%; margin: 0 0 1em 0;"
 
{| width="100%" cellspacing="0" cellpadding="0" style="background-color: #ffffff; border: 1px solid silver; border-collapse: collapse; width: 100%; margin: 0 0 1em 0;"
 
|-
 
|-
| style="background-color: #7c8aaf; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Entries in GOC DB starting between 2nd and 9th May 2012
+
| style="background-color: #7c8aaf; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Entries in GOC DB starting since the last report.
 
|}
 
|}
 +
* None
 +
<!-- **********************End GOC DB Entries************************** ----->
 +
<!-- ****************************************************************** ----->
  
There were XXX  unscheduled outages during the last week.
+
====== ======
 
+
<!-- ****************************************************************** ----->
Put GOCDB table here.
+
<!-- **********************Start GGUS Tickets************************** ----->
 
+
 
{| width="100%" cellspacing="0" cellpadding="0" style="background-color: #ffffff; border: 1px solid silver; border-collapse: collapse; width: 100%; margin: 0 0 1em 0;"
 
{| width="100%" cellspacing="0" cellpadding="0" style="background-color: #ffffff; border: 1px solid silver; border-collapse: collapse; width: 100%; margin: 0 0 1em 0;"
 
|-
 
|-
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Open GGUS Tickets
+
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Open GGUS Tickets (Snapshot during morning of meeting)
 
|}
 
|}
 +
{|border="1" cellpadding="1",center;
 +
|+
 +
|-style="background:#b7f1ce"
 +
! GGUS ID !! Level !! Urgency !! State !! Creation !! Last Update !! VO !! Subject
 +
|-
 +
| 116866
 +
| Green
 +
| Less Urgent
 +
| On Hold
 +
| 2015-10-12
 +
| 2015-10-19
 +
| SNO+
 +
| snoplus support at RAL-LCG2 (pilot role)
 +
|-
 +
| 116864
 +
| Green
 +
| Urgent
 +
| In Progress
 +
| 2015-10-12
 +
| 2015-10-14
 +
| CMS
 +
| T1_UK_RAL AAA opening and reading test failing again...
 +
|}
 +
<!-- **********************End GGUS Tickets************************** ----->
 +
<!-- ****************************************************************** ----->
  
Put GGUS table here.
+
====== ======
 +
<!-- ************************************************************************* ----->
 +
<!-- **********************Start Availability Report************************** ----->
 +
{| width="100%" cellspacing="0" cellpadding="0" style="background-color: #ffffff; border: 1px solid silver; border-collapse: collapse; width: 100%; margin: 0 0 1em 0;"
 +
|-
 +
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Availability Report
 +
|}
 +
Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 508); CMS HC = CMS HammerCloud
 +
{|border="1" cellpadding="1",center;
 +
|+
 +
|-style="background:#b7f1ce"
 +
! Day !! OPS !! Alice !! Atlas !! CMS !! LHCb !! Atlas HC !! CMS HC !! Comment
 +
|-
 +
| 14/10/15 || 100 || 100 || 100 || 100 || 100 || 91 || n/a ||
 +
|-
 +
| 15/10/15 || 100 || 100 || style="background-color: lightgrey;" | 98 || 100 || 100 || 85 || 100 || Single SRM test failure (ould not open connection to srm-atlas.gridpp.rl.ac.uk:8443)
 +
|-
 +
| 16/10/15 || 100 || 100 || 100 || style="background-color: lightgrey;" | 98 || 100 || 89 || 100 || Short problem with glexec in the early hours of the morning.
 +
|-
 +
| 17/10/15 || 100 || 100 || 100 || 100 || 100 || 95 || 100 ||
 +
|-
 +
| 18/10/15 || 100 || 100 || 100 || 100 || 100 || 92 || n/a ||
 +
|-
 +
| 19/10/15 || 100 || 100 || 100 || 100 || 100 || 100 || 100 ||
 +
|-
 +
| 20/10/15 || 100 || 100 || 100 || 100 || 100 || 93 || 100 ||
 +
|}
 +
<!-- **********************End Availability Report************************** ----->
 +
<!-- *********************************************************************** ----->

Revision as of 15:50, 27 October 2015

==RAL Tier1 Operations Report for 28st October 2015==
Review of Issues during the week 14th to 21st October 2015.
* item
* item
Resolved Disk Server Issues
* item
Current operational status and issues
  • The LHCb problem with a low but persistent rate of failure when copying the results of batch jobs to Castor. There is also a further problem that sometimes occurs when these (failed) writes are attempted to storage at other sites.
  • The intermittent, low-level, load-related packet loss seen over external connections is still being tracked. Likewise we have been working to understand some remaining low level of packet loss seen within a part of our Tier1 network.
  • Long-standing CMS issues. The two items that remain are CMS Xroot (AAA) redirection and file open times. Work is ongoing into the Xroot redirection with a new server having been added in recent weeks. File open times using Xroot remain slow but this is a less significant problem.
Ongoing Disk Server Issues
  • GDSS663
  • GDSS665
  • GDSS644
Notable Changes made since the last meeting.
  • item
  • item
Declared in the GOC DB
  • None
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.
  • Upgrade of remaining Castor disk servers (those in tape-backed service classes) to SL6. This will be transparent to users.
  • Some detailed internal network re-configurations to enable the removal of the old 'core' switch from our network. This includes changing the way the UKLIGHT router connects into the Tier1 network.

Listing by category:

  • Databases:
    • Switch LFC/3D to new Database Infrastructure.
  • Castor:
    • Update SRMs to new version (includes updating to SL6).
    • Update disk servers to SL6 (ongoing)
    • Update to Castor version 2.1.15.
  • Networking:
      • Complete changes needed to remove the old core switch from the Tier1 network.
    • Make routing changes to allow the removal of the UKLight Router.
  • Fabric
    • Firmware updates on remaining EMC disk arrays (Castor, LFC)
Entries in GOC DB starting since the last report.
  • None
Open GGUS Tickets (Snapshot during morning of meeting)
GGUS ID Level Urgency State Creation Last Update VO Subject
116866 Green Less Urgent On Hold 2015-10-12 2015-10-19 SNO+ snoplus support at RAL-LCG2 (pilot role)
116864 Green Urgent In Progress 2015-10-12 2015-10-14 CMS T1_UK_RAL AAA opening and reading test failing again...
Availability Report

Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 508); CMS HC = CMS HammerCloud

Day OPS Alice Atlas CMS LHCb Atlas HC CMS HC Comment
14/10/15 100 100 100 100 100 91 n/a
15/10/15 100 100 98 100 100 85 100 Single SRM test failure (ould not open connection to srm-atlas.gridpp.rl.ac.uk:8443)
16/10/15 100 100 100 98 100 89 100 Short problem with glexec in the early hours of the morning.
17/10/15 100 100 100 100 100 95 100
18/10/15 100 100 100 100 100 92 n/a
19/10/15 100 100 100 100 100 100 100
20/10/15 100 100 100 100 100 93 100