Difference between revisions of "GarethSmithTestPage2"

From GridPP Wiki
Jump to: navigation, search
(Created page with "=RAL Tier1 Operations Report for 9th April 2014= __NOTOC__ ====== ====== <!-- ************************************************************* -----> <!-- ***********Start Revie...")
 
 
(9 intermediate revisions by one user not shown)
Line 1: Line 1:
=RAL Tier1 Operations Report for 9th April 2014=
 
__NOTOC__
 
====== ======
 
  
<!-- ************************************************************* ----->
 
<!-- ***********Start Review of Issues during last week*********** ----->
 
{| width="100%" cellspacing="0" cellpadding="0" style="background-color: #ffffff; border: 1px solid silver; border-collapse: collapse; width: 100%; margin: 0 0 1em 0;"
 
|-
 
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Review of Issues during the week 2nd to 9th April 2014.
 
|}
 
* During the afternoon of Thursday 3rd April all three WMS systems reported problems. These problems went away without our intervention and are believed to be caused by something in jobs being submitted.
 
* Maintenence on Primary OPN link overnight Saturday - Sunday 5/6 April took the link down for a few hours. The failover to the backup link did not work properly. The effect of this can be seen in the failure of the SUM tests from CERN during this time.
 
* It was reported last week that around 50 files in tape backed service classes (mainly in GEN) had been found not to have been migrated to tape. This is now fixed.
 
<!-- ***********End Review of Issues during last week*********** ----->
 
<!-- *********************************************************** ----->
 
 
====== ======
 
<!-- ******************************************************* ----->
 
<!-- ***********Start Resolved Disk Server Issues*********** ----->
 
{| width="100%" cellspacing="0" cellpadding="0" style="background-color: #ffffff; border: 1px solid silver; border-collapse: collapse; width: 100%; margin: 0 0 1em 0;"
 
|-
 
| style="background-color: #f8d6a9; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Resolved Disk Server Issues
 
|}
 
* Last Wednesday (2nd April) GDSS239 (Atlas HotDisk) crashed. As AtlasHotDisk was about to be merged into another SpaceToken and there should be multiple copies of files on each server spread across the servers in AtlasHotDisk it was decided to withdraw the server from use rather than spend time investigating the failure. In fact there were 329 unique files on this server. Following discussion with Atlas these were copied back in from other sites rather than investigating the server problems.
 
* In the early hours of Sunday 6th April GDSS600 (AtlasDataDisk - D1T0) failed. Multiple disk failures were being reported by the disk controller. The system was returned to production yesterday evening (8th April) and is being drained. It will be decommissioned after files have been copied off.
 
<!-- ***********End Resolved Disk Server Issues*********** ----->
 
<!-- ***************************************************** ----->
 
 
====== ======
 
<!-- ***************************************************************** ----->
 
<!-- ***********Start Current operational status and issues*********** ----->
 
{| width="100%" cellspacing="0" cellpadding="0" style="background-color: #ffffff; border: 1px solid silver; border-collapse: collapse; width: 100%; margin: 0 0 1em 0;"
 
|-
 
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Current operational status and issues
 
|}
 
* The load related problems reported for the CMS Castor instance havenot been seen this last week. However, work is underway to tackle these problems, in particular servers with faster network connections will be moved into the disk cache in front of CMS_Tape when they become available.
 
* The Castor Team are now able to reproduce the intermittent failures of Castor access via the SRM that has been reported in recent weeks. Understanding of the problem is significantly adcanced and further investigations are ongoing using the Castor Preprod instance. Ideas for a workaround are being developed.
 
* As reported before, working with Atlas the file deletion rate was somewhat improved. However, there is still a problem that needs to be understood.
 
* Problems with the infrastructure used to host many of our non-Catsor services have largely been worked around, although not yet fixed. Some additional migrations of VMs has been necessary.
 
<!-- ***********End Current operational status and issues*********** ----->
 
<!-- *************************************************************** ----->
 
 
====== ======
 
<!-- *************************************************************** ----->
 
<!-- ***************Start Ongoing Disk Server Issues**************** ----->
 
{| width="100%" cellspacing="0" cellpadding="0" style="background-color: #ffffff; border: 1px solid silver; border-collapse: collapse; width: 100%; margin: 0 0 1em 0;"
 
|-
 
| style="background-color: #f8d6a9; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Ongoing Disk Server Issues
 
|}
 
* None.
 
<!-- ***************End Ongoing Disk Server Issues**************** ----->
 
<!-- ************************************************************* ----->
 
 
====== ======
 
<!-- ******************************************************************** ----->
 
<!-- *************Start Notable Changes made this last week************** ----->
 
{| width="100%" cellspacing="0" cellpadding="0" style="background-color: #ffffff; border: 1px solid silver; border-collapse: collapse; width: 100%; margin: 0 0 1em 0;"
 
|-
 
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Notable Changes made this last week.
 
|}
 
* The rollout of of WNs updated to the EMI-3 version has been completed.
 
* The EMI3 Argus server is now in use everywehere in the batch farm.
 
* Since the meeting last week three new disk servers have been deployed in CMSDisk and eight to AtlasDataDisk. (These are in addition to the nine servers added to LHCbDst as reported last week).
 
* Batch farm fairshares have been adjusted for the 2014 pledges.
 
* Atlas have resumed using the RAL FTS3 server for many file transfers.
 
* A required change to ACLs in a router enabled our two new Perfsonar nodes to become active.
 
<!-- *************End Notable Changes made this last week************** ----->
 
<!-- ****************************************************************** ----->
 
 
====== ======
 
<!-- ******************************************************************** ----->
 
<!-- **********************Start GOC DB Entries************************** ----->
 
{| width="100%" cellspacing="0" cellpadding="0" style="background-color: #ffffff; border: 1px solid silver; border-collapse: collapse; width: 100%; margin: 0 0 1em 0;"
 
|-
 
| style="background-color: #d8e8ff; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Declared in the GOC DB
 
|}
 
<!-- ******* Declared in the GOC DB ******* ----->
 
{| border=1 align=center
 
|- bgcolor="#7c8aaf"
 
! Service
 
! Scheduled?
 
! Outage/At Risk
 
! Start
 
! End
 
! Duration
 
! Reason
 
|-
 
| Whole Site
 
| SCHEDULED
 
| OUTAGE
 
| 29/04/2014 07:00
 
| 29/04/2014 17:00
 
| 10 hours
 
| Site outage during Network Upgrade.
 
|-
 
| lcgrbp01.gridpp.rl.ac.uk,
 
| SCHEDULED
 
| OUTAGE
 
| 02/04/2014 12:00
 
| 01/05/2014 12:00
 
| 29 days,
 
| System be decommissioned. (Replaced my myproxy.gridpp.rl.ac.uk).
 
|}
 
<!-- **********************End GOC DB Entries************************** ----->
 
<!-- ****************************************************************** ----->
 
 
====== ======
 
<!-- ******************************************************************************* ----->
 
<!-- ****************Start Advanced warning for other interventions***************** ----->
 
{| width="100%" cellspacing="0" cellpadding="0" style="background-color: #ffffff; border: 1px solid silver; border-collapse: collapse; width: 100%; margin: 0 0 1em 0;"
 
|-
 
| style="background-color: #d8e8ff; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Advanced warning for other interventions
 
|-
 
| style="background-color: #d8e8ff; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;"| The following items are being discussed and are still to be formally scheduled and announced.
 
|}
 
<!-- ******* still to be formally scheduled and/or announced ******* ----->
 
'''Listing by category:'''
 
* Databases:
 
** Switch LFC/FTS/3D to new Database Infrastructure.
 
* Castor:
 
** Castor 2.1.14 testing is largely complete. (A non-Tier1 production Castor instance has been successfully upgraded.) We are starting to look at possible dates for rolling this out (probably around May).
 
* Networking:
 
** Update core Tier1 network and change connection to site and OPN including:
 
*** Install new Routing layer for Tier1 & change the way the Tier1 connects to the RAL network. (Scheduled for 29th April)
 
*** These changes will lead to the removal of the UKLight Router.
 
* Fabric
 
** We are phasing out the use of the software server used by the small VOs.
 
** Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
 
** There will be circuit testing of the remaining (i.e. non-UPS) circuits in the machine room during 2014.
 
<!-- ***************End Advanced warning for other interventions*************** ----->
 
<!-- ************************************************************************** ----->
 
 
====== ======
 
<!-- ******************************************************************** ----->
 
<!-- **********************Start GOC DB Entries************************** ----->
 
{| width="100%" cellspacing="0" cellpadding="0" style="background-color: #ffffff; border: 1px solid silver; border-collapse: collapse; width: 100%; margin: 0 0 1em 0;"
 
|-
 
| style="background-color: #7c8aaf; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Entries in GOC DB starting between the 2nd and 9th April 2014.
 
|}
 
 
 
{| border=1 align=center
 
|- bgcolor="#7c8aaf"
 
! Service
 
! Scheduled?
 
! Outage/At Risk
 
! Start
 
! End
 
! Duration
 
! Reason
 
|-
 
| lcgwms04, lcgwms05, lcgwms06
 
| UNSCHEDULED
 
| WARNING
 
| 03/04/2014 17:00
 
| 04/04/2014 09:25
 
| 16 hours and 25 minutes
 
| We are investigating problems with these WMS systems
 
|-
 
| srm-lhcb-tape.gridpp.rl.ac.uk
 
| UNSCHEDULED
 
| WARNING
 
| 03/04/2014 08:00
 
| 03/04/2014 09:30
 
| 1 hour and 30 minutes
 
| Warning during further testing of new tape interface (ACSLS),
 
|-
 
| lcgrbp01.gridpp.rl.ac.uk
 
| SCHEDULED
 
| OUTAGE
 
| 02/04/2014 12:00
 
| 01/05/2014 12:00
 
| 29 days,
 
| System be decommissioned. (Replaced my myproxy.gridpp.rl.ac.uk).
 
|}
 
<!-- **********************End GOC DB Entries************************** ----->
 
<!-- ****************************************************************** ----->
 
 
====== ======
 
<!-- ****************************************************************** ----->
 
<!-- **********************Start GGUS Tickets************************** ----->
 
 
{| width="100%" cellspacing="0" cellpadding="0" style="background-color: #ffffff; border: 1px solid silver; border-collapse: collapse; width: 100%; margin: 0 0 1em 0;"
 
{| width="100%" cellspacing="0" cellpadding="0" style="background-color: #ffffff; border: 1px solid silver; border-collapse: collapse; width: 100%; margin: 0 0 1em 0;"
 
|-
 
|-
Line 187: Line 7:
 
|+
 
|+
 
|-style="background:#b7f1ce"
 
|-style="background:#b7f1ce"
! GGUS ID !! Level !! Urgency !! State !! Creation !! Last Update !! VO !! Subject
+
! GGUS ID !! Level
 
|-
 
|-
 
| 103197
 
| 103197
 
| Green
 
| Green
| Less Urgent
 
| Waiting Reply
 
| 2014-04-09
 
| 2014-04-09
 
|
 
| RAL myproxy server and GridPP wiki
 
|-
 
| 102611
 
| Yellow
 
| Urgent
 
| In Progress
 
| 2014-03-24
 
| 2014-03-24
 
|
 
| NAGIOS *eu.egi.sec.Argus-EMI-1* failed on argusngi.gridpp.rl.ac.uk@RAL-LCG2
 
|-
 
| 101968
 
| Red
 
| Less Urgent
 
| On Hold
 
| 2014-03-11
 
| 2014-04-01
 
| Atlas
 
| RAL-LCG2_SCRATCHDISK: One dataset to delete is causing 1379 deletion errors
 
|-
 
| 101079
 
| Red
 
| Less Urgent
 
| In Progress
 
| 2014-02-09
 
| 2014-04-01
 
|
 
| ARC CEs have VOViews with a default SE of "0"
 
|-
 
| 98249
 
| Red
 
| Urgent
 
| In Progress
 
| 2013-10-21
 
| 2014-03-13
 
| SNO+
 
| please configure cvmfs stratum-0 for SNO+ at RAL T1
 
|}
 
<!-- **********************End GGUS Tickets************************** ----->
 
<!-- ****************************************************************** ----->
 
 
====== ======
 
<!-- ************************************************************************* ----->
 
<!-- **********************Start Availability Report************************** ----->
 
{| width="100%" cellspacing="0" cellpadding="0" style="background-color: #ffffff; border: 1px solid silver; border-collapse: collapse; width: 100%; margin: 0 0 1em 0;"
 
|-
 
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Availability Report
 
|}
 
 
Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 508); CMS HC = CMS HammerCloud
 
 
{|border="1" cellpadding="1",center;
 
|+
 
|-style="background:#b7f1ce"
 
! Day !! OPS !! Alice !! Atlas !! CMS !! LHCb !! Atlas HC !! CMS HC !! Comment
 
|-
 
| 02/04/14 || 100 || 100 || 100 || 100 || 100 || 100 || 98 ||
 
|-
 
| 03/04/14 || 100 || 100 || 100 || 100 || 100 || 100 || 99 ||
 
|-
 
| 04/04/14 || 100 || 100 || 100 || 100 || 100 || 100 || 99 ||
 
|-
 
| 05/04/14 || 100 || 100 || 100 || 100 || 100 || 100 || 100 ||
 
|-
 
| 06/04/14 || 100 || 100 || style="background-color: lightgrey;" | 93.6 || style="background-color: lightgrey;" | 95.5 || style="background-color: lightgrey;" | 93.6 || 100 || 100 || Primary OPN link to CERN down. Failover to backup link didn't work properly.
 
|-
 
| 07/04/14 || 100 || 100 || style="background-color: lightgrey;" | 86.3 || style="background-color: lightgrey;" | 86.2 || style="background-color: lightgrey;" | 81.5 || 100 || 100 || Primary OPN link to CERN down. Failover to backup link didn't work properly.
 
|-
 
| 08/04/14 || 100 || 100 || 100 || 100 || 100 || 100 || 100 ||
 
 
|}
 
|}
<!-- **********************End Availability Report************************** ----->
 
<!-- *********************************************************************** ----->
 

Latest revision as of 09:40, 16 September 2014

Open GGUS Tickets (Snapshot during morning of meeting)
GGUS ID Level
103197 Green