Difference between revisions of "GarethSmithTestPage"

From GridPP Wiki
Jump to: navigation, search
Line 41: Line 41:
 
* Two of the CV2013 disk servers (120TB each) have been added to LHCbDst. A further 9 are being added today. Three further servers are in CMS non-prod awaiting being moved into production imminently.
 
* Two of the CV2013 disk servers (120TB each) have been added to LHCbDst. A further 9 are being added today. Three further servers are in CMS non-prod awaiting being moved into production imminently.
 
<!-- *************End Notable Changes made this last week************** ----->
 
<!-- *************End Notable Changes made this last week************** ----->
<!-- ****************************************************************** ----->
 
 
 
 
====== ======
 
<!-- ******************************************************************************* ----->
 
<!-- ****************Start Advanced warning for other interventions***************** ----->
 
{| width="100%" cellspacing="0" cellpadding="0" style="background-color: #ffffff; border: 1px solid silver; border-collapse: collapse; width: 100%; margin: 0 0 1em 0;"
 
|-
 
| style="background-color: #d8e8ff; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Advanced warning for other interventions
 
|-
 
| style="background-color: #d8e8ff; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;"| The following items are being discussed and are still to be formally scheduled and announced.
 
|}
 
<!-- ******* still to be formally scheduled and/or announced ******* ----->
 
* The installion of the new Tier1 Routing layer and the change in the way the Tier1 connects to the RAL network is expected to take place in one of the two weeks following Easter.
 
'''Listing by category:'''
 
* Databases:
 
** Switch LFC/FTS/3D to new Database Infrastructure.
 
* Castor:
 
** Castor 2.1.14 testing is largely complete. (A non-Tier1 production Castor instance was successfully upgraded yesterday, 1st April.) We are starting to look at possible dates for rolling this out (probably around May).
 
* Networking:
 
** Update core Tier1 network and change connection to site and OPN including:
 
*** Install new Routing layer for Tier1 & change the way the Tier1 connects to the RAL network.
 
*** These changes will lead to the removal of the UKLight Router.
 
* Fabric
 
** We are phasing out the use of the software server used by the small VOs.
 
** Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
 
** There will be circuit testing of the remaining (i.e. non-UPS) circuits in the machine room during 2014.
 
<!-- ***************End Advanced warning for other interventions*************** ----->
 
<!-- ************************************************************************** ----->
 
 
====== ======
 
<!-- ******************************************************************** ----->
 
<!-- **********************Start GOC DB Entries************************** ----->
 
{| width="100%" cellspacing="0" cellpadding="0" style="background-color: #ffffff; border: 1px solid silver; border-collapse: collapse; width: 100%; margin: 0 0 1em 0;"
 
|-
 
| style="background-color: #7c8aaf; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Entries in GOC DB starting between the 19th March and 2nd April 2014.
 
|}
 
 
{| border=1 align=center
 
|- bgcolor="#7c8aaf"
 
! Service
 
! Scheduled?
 
! Outage/At Risk
 
! Start
 
! End
 
! Duration
 
! Reason
 
|-
 
| lcgrbp01.gridpp.rl.ac.uk,
 
| SCHEDULED
 
| OUTAGE
 
| 02/04/2014 12:00
 
| 01/05/2014 12:00
 
| 29 days,
 
| System be decommissioned. (Replaced my myproxy.gridpp.rl.ac.uk).
 
|-
 
| All Castor endpoints (All SRMs)
 
| SCHEDULED
 
| WARNING
 
| 01/04/2014 09:00
 
| 01/04/2014 11:00
 
| 2 hours
 
| Testing of new interface to the tape library. During this time Castor disk services will remain up but there will be no tape access. Tape recalls will stall. Writes to tape backed service classes will carry on, with files flushed from the disk caches to tape once the testing is completed.
 
|-
 
| srm-lhcb-tape.gridpp.rl.ac.uk,
 
| SCHEDULED
 
| OUTAGE
 
| 01/04/2014 09:00
 
| 01/04/2014 11:00
 
| 2 hours
 
| Testing of new interface to the tape library. During this time Castor disk services will remain up but there will be no tape access. Tape recalls will stall. Writes to tape backed service classes will carry on, with files flushed from the disk caches to tape once the testing is completed.
 
|}
 
<!-- **********************End GOC DB Entries************************** ----->
 
 
<!-- ****************************************************************** ----->
 
<!-- ****************************************************************** ----->
  

Revision as of 09:29, 16 September 2014

RAL Tier1 Operations Report for 2nd April 2014

Review of Issues during the fortnight 19th March to 2nd April 2014.
  • There was a short (around 5 minute) break in external connectivity to the Tier1 during the morning of Thursday 20th March and again a similar event the following morning.
  • There was a failover of an Atlas Castor Database early evening on Tuesday 25th March. The failover triggered a call-out and the database was put back onto its allocated node. The cause is a bug that has been reported to Oracle.
  • On Friday, 28th March, we were not running some of the CE SUM tests in a timely manner. It was found that owing to a separate change in the Condor configuration we were no longer prioritising the test jobs. This was fixed.


Ongoing Disk Server Issues
  • GDSS239 (Atlas HotDisk) crashed this morning. This is being investigated.
Notable Changes made this last fortnight.
  • The rollout of of WNs updated to the EMI-3 version of WN continues and is expected to be completed this week.
  • The EMI3 Argus server is being rolled out for use across all CEs and WNs.
  • The old MyProxy server (lcgrbp01.gridpp.rl.ac.uk) has just been turned off today. Its replacement (myproxy.gridpp.rl.ac.uk) is in production.
  • The 2013 purchases of worker nodes are being added to the farm this week.
  • Two of the CV2013 disk servers (120TB each) have been added to LHCbDst. A further 9 are being added today. Three further servers are in CMS non-prod awaiting being moved into production imminently.
Open GGUS Tickets (Snapshot during morning of meeting)
GGUS ID Level Urgency State Creation Last Update VO Subject
102902 Green Urgent In Progress 2014-04-01 2014-04-02 MICE & NA62 Stale .cvmfswhitelist file MICE VO
102611 Green Urgent In Progress 2014-03-24 2014-03-24 NAGIOS *eu.egi.sec.Argus-EMI-1* failed on argusngi.gridpp.rl.ac.uk@RAL-LCG2
101968 Yellow Less Urgent On Hold 2014-03-11 2014-0-01 Atlas RAL-LCG2_SCRATCHDISK: One dataset to delete is causing 1379 deletion errors
101079 Red Less Urgent In Progress 2014-02-09 2014-04-01 ARC CEs have VOViews with a default SE of "0"
99556 Red Very Urgent On Hold 2013-12-06 2014-03-21 NGI Argus requests for NGI_UK
98249 Red Urgent In Progress 2013-10-21 2014-03-13 SNO+ please configure cvmfs stratum-0 for SNO+ at RAL T1
Availability Report

Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 508); CMS HC = CMS HammerCloud

Day OPS Alice Atlas CMS LHCb Atlas HC CMS HC Comment
19/03/14 100 100 100 88.6 100 99 73 Multiple SRM test failures (load problems).
20/03/14 100 100 99.7 99.6 100 100 n/a Atlas: One SRM Test failure; CMS - CE Test failures on all 3 Arc-ce’s (no compatible resources).
21/03/14 100 100 100 100 100 100 n/a
22/03/14 100 100 100 100 100 100 n/a
23/03/14 100 100 100 100 100 100 n/a
24/03/14 100 100 100 100 100 100 n/a
25/03/14 100 100 99.0 89.8 100 98 99 Atlas: Castor database problem (Atlas_srm DB moved to another RAC node following a DB crash); CMS SRM SUM test failures separated through day.
26/03/14 100 100 100 87.1 100 100 99 Four separate SRM test failures.
27/03/14 100 100 100 96.5 100 97 100 Two test failures of SRM Put test.
28/03/14 100 100 100 100 100 100 100
29/03/14 100 100 100 100 100 99 100
30/03/14 100 100 100 100 100 100 99
31/03/14 100 100 100 100 100 100 99
01/04/14 100 100 100 100 100 100 99