Latest revision as of 10:34, 16 April 2014

RAL Tier1 Operations Report for 16th April 2014

Review of Issues during the week 9th to 16th April 2014.

We note that Atlas has managed to fill the OPN link inbound to RAL for around 24 hours on 9th April.
Over weekend (12/13 April) there were problems with our Atlas Frontier Service which went down. This was part of extended Frontier problems that affcted other Tier1 sites too.
Maintenence on Primary OPN link early evening on Monday 14th April took the link down for a few hours. The failover to the backup link again did not work properly. The effect of this can be seen in the failure of the SUM tests from CERN during this time.

Resolved Disk Server Issues

None.

Current operational status and issues

The load related problems reported for the CMS Castor instance have not been seen this last fortnight. However, work is underway to tackle these problems, in particular servers with faster network connections will be moved into the disk cache in front of CMS_Tape when they become available.
The Castor Team are now able to reproduce the intermittent failures of Castor access via the SRM that has been reported in recent weeks. Understanding of the problem is significantly adcanced and further investigations are ongoing using the Castor Preprod instance. Ideas for a workaround are being developed.
As reported before, working with Atlas the file deletion rate was somewhat improved. However, there is still a problem that needs to be understood.
Problems with the infrastructure used to host many of our non-Castor services have largely been worked around, although not yet fixed. Some additional migrations of VMs has been necessary.
We have now had repeated instances where the OPN link has not cleanly failed over to the backup link during problems with the primary.
One of the network uplinks (for the 2012 disk servers) has been running at full capacity. We have a plan to move the switch into the new Tier1 mesh metwork to alleviate this.

Ongoing Disk Server Issues

GDSS403 (AtlasTape - D0T1) failed on Friday 11th April. There is one file on it wairing to go to tape. A drive was replaced and it is currently rebuilding is RAID array before going back into service.

Notable Changes made this last week.

Appropriate systems patch for the "heartbleed" vulnerbility.
Environment modification made to enable T2K jobs to run on the ARC CEs.
Disk usage limits (of 150 GB) added to CEs.
Alias "cernvmfs.gridpp.rl.ac.uk" updated to point to cvmfs-wlcg.gridpp.rl.ac.uk - the new CVMFS stratum-1 v2.1 on Tuesday (15th April).

Declared in the GOC DB

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
Whole site	SCHEDULED	WARNING	30/04/2014 10:00	30/04/2014 12:00	2 hours	RAL Tier1 site in warning state due to UPS/generator test.
Whole site	SCHEDULED	OUTAGE	29/04/2014 07:00	29/04/2014 17:00	10 hours	Site outage during Network Upgrade.
lcgrbp01.gridpp.rl.ac.uk,	SCHEDULED	OUTAGE	02/04/2014 12:00	01/05/2014 12:00	29 days,	System being decommissioned. (Replaced my myproxy.gridpp.rl.ac.uk).

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced.

Listing by category:

Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
Castor:
- Castor 2.1.14 testing is largely complete. (A non-Tier1 production Castor instance has been successfully upgraded.) We are starting to look at possible dates for rolling this out (probably around May).
Networking:
- Move switches connecting recent disk servers batches ('11, '12) onto the Tier1 mesh network.
- Update core Tier1 network and change connection to site and OPN including:
  - Install new Routing layer for Tier1 & change the way the Tier1 connects to the RAL network. (Scheduled for 29th April)
  - These changes will lead to the removal of the UKLight Router.
Fabric
- We are phasing out the use of the software server used by the small VOs.
- Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
- There will be circuit testing of the remaining (i.e. non-UPS) circuits in the machine room during 2014.

Entries in GOC DB starting between the 9th and 16th April 2014.

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
lcgrbp01.gridpp.rl.ac.uk	SCHEDULED	OUTAGE	02/04/2014 12:00	01/05/2014 12:00	29 days,	System be decommissioned. (Replaced my myproxy.gridpp.rl.ac.uk).

Open GGUS Tickets (Snapshot during morning of meeting)


GGUS ID	Level	Urgency	State	Creation	Last Update	VO	Subject
103197	Green	Less Urgent	Waiting Reply	2014-04-09	2014-04-09		RAL myproxy server and GridPP wiki
102611	Red	Urgent	In Progress	2014-03-24	2014-04-09		NAGIOS eu.egi.sec.Argus-EMI-1 failed on argusngi.gridpp.rl.ac.uk@RAL-LCG2
101968	Red	Less Urgent	On Hold	2014-03-11	2014-04-01	Atlas	RAL-LCG2_SCRATCHDISK: One dataset to delete is causing 1379 deletion errors
98249	Red	Urgent	In Progress	2013-10-21	2014-03-13	SNO+	please configure cvmfs stratum-0 for SNO+ at RAL T1

Availability Report

Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 508); CMS HC = CMS HammerCloud


Day	OPS	Alice	Atlas	CMS	LHCb	Atlas HC	CMS HC	Comment
09/04/14	100	100	100	100	100	100	99
10/04/14	100	100	100	100	100	100	100
11/04/14	100	100	100	100	100	100	100
12/04/14	100	100	100	100	100	100	100
13/04/14	100	100	100	100	100	28	100	Problem with Atlas Frontier across multiple sites.
14/04/14	100	100	94.2	91.8	87.1	67	100	Maintenance on primary CERN link - but didn't failover cleanly.
15/04/14	100	100	100	100	100	100	99

@@ Line 9: / Line 9: @@
 | style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Review of Issues during the week 9th to 16th April 2014.
 |}
-* During the afternoon of Thursday 3rd April all three WMS systems reported problems. These problems went away without our intervention and are believed to be caused by something in jobs being submitted.
+* We note that Atlas has managed to fill the OPN link inbound to RAL for around 24 hours on 9th April.
-* Maintenence on Primary OPN link overnight Saturday - Sunday 5/6 April took the link down for a few hours. The failover to the backup link did not work properly. The effect of this can be seen in the failure of the SUM tests from CERN during this time.
+* Over weekend (12/13 April) there were problems with our Atlas Frontier Service which went down. This was part of extended Frontier problems that affcted other Tier1 sites too.
-* It was reported last week that around 50 files in tape backed service classes (mainly in GEN) had been found not to have been migrated to tape. This is now fixed.
+* Maintenence on Primary OPN link early evening on Monday 14th April took the link down for a few hours. The failover to the backup link again did not work properly. The effect of this can be seen in the failure of the SUM tests from CERN during this time.
 <!-- ***********End Review of Issues during last week*********** ----->
 <!-- *********************************************************** ----->
@@ Line 22: / Line 22: @@
 | style="background-color: #f8d6a9; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Resolved Disk Server Issues
 |}
-* Last Wednesday (2nd April) GDSS239 (Atlas HotDisk) crashed. As AtlasHotDisk was about to be merged into another SpaceToken and there should be multiple copies of files on each server spread across the servers in AtlasHotDisk it was decided to withdraw the server from use rather than spend time investigating the failure. In fact there were 329 unique files on this server. Following discussion with Atlas these were copied back in from other sites rather than investigating the server problems.
+* None.
-* In the early hours of Sunday 6th April GDSS600 (AtlasDataDisk - D1T0) failed. Multiple disk failures were being reported by the disk controller. The system was returned to production yesterday evening (8th April) and is being drained. It will be decommissioned after files have been copied off.
 <!-- ***********End Resolved Disk Server Issues*********** ----->
 <!-- ***************************************************** ----->
@@ Line 34: / Line 33: @@
 | style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Current operational status and issues
 |}
-* The load related problems reported for the CMS Castor instance havenot been seen this last week. However, work is underway to tackle these problems, in particular servers with faster network connections will be moved into the disk cache in front of CMS_Tape when they become available.
+* The load related problems reported for the CMS Castor instance have not been seen this last fortnight. However, work is underway to tackle these problems, in particular servers with faster network connections will be moved into the disk cache in front of CMS_Tape when they become available.
 * The Castor Team are now able to reproduce the intermittent failures of Castor access via the SRM that has been reported in recent weeks. Understanding of the problem is significantly adcanced and further investigations are ongoing using the Castor Preprod instance. Ideas for a workaround are being developed.
 * As reported before, working with Atlas the file deletion rate was somewhat improved. However, there is still a problem that needs to be understood.
-* Problems with the infrastructure used to host many of our non-Catsor services have largely been worked around, although not yet fixed. Some additional migrations of VMs has been necessary.
+* Problems with the infrastructure used to host many of our non-Castor services have largely been worked around, although not yet fixed. Some additional migrations of VMs has been necessary.
+* We have now had repeated instances where the OPN link has not cleanly failed over to the backup link during problems with the primary.
+* One of the network uplinks (for the 2012 disk servers) has been running at full capacity. We have a plan to move the switch into the new Tier1 mesh metwork to alleviate this.
 <!-- ***********End Current operational status and issues*********** ----->
 <!-- *************************************************************** ----->
@@ Line 48: / Line 49: @@
 | style="background-color: #f8d6a9; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Ongoing Disk Server Issues
 |}
-* None.
+* GDSS403 (AtlasTape - D0T1) failed on Friday 11th April. There is one file on it wairing to go to tape. A drive was replaced and it is currently rebuilding is RAID array before going back into service.
 <!-- ***************End Ongoing Disk Server Issues**************** ----->
 <!-- ************************************************************* ----->
@@ Line 59: / Line 60: @@
 | style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Notable Changes made this last week.
 |}
-* The rollout of of WNs updated to the EMI-3 version has been completed.
+* Appropriate systems patch for the "heartbleed" vulnerbility.
-* The EMI3 Argus server is now in use everywehere in the batch farm.
+* Environment modification made to enable T2K jobs to run on the ARC CEs.
-* Since the meeting last week three new disk servers have been deployed in CMSDisk and eight to AtlasDataDisk. (These are in addition to the nine servers added to LHCbDst as reported last week).
+* Disk usage limits (of 150 GB) added to CEs.
-* Batch farm fairshares have been adjusted for the 2014 pledges.
+* Alias "cernvmfs.gridpp.rl.ac.uk" updated to point to cvmfs-wlcg.gridpp.rl.ac.uk - the new CVMFS stratum-1 v2.1 on Tuesday (15th April).
-* Atlas have resumed using the RAL FTS3 server for many file transfers.
-* A required change to ACLs in a router enabled our two new Perfsonar nodes to become active.
 <!-- *************End Notable Changes made this last week************** ----->
 <!-- ****************************************************************** ----->
@@ Line 76: / Line 75: @@
 |}
 <!-- ******* Declared in the GOC DB ******* ----->
 {| border=1 align=center
 |- bgcolor="#7c8aaf"
@@ Line 86: / Line 86: @@
 ! Reason
 |-
-| Whole Site
+| Whole site
+| SCHEDULED
+| WARNING
+| 30/04/2014 10:00
+| 30/04/2014 12:00
+| 2 hours
+| RAL Tier1 site in warning state due to UPS/generator test.
+|-
+| Whole site
 | SCHEDULED
 | OUTAGE
@@ Line 100: / Line 108: @@
 | 01/05/2014 12:00
 | 29 days,
-| System be decommissioned. (Replaced my myproxy.gridpp.rl.ac.uk).
+| System being decommissioned. (Replaced my myproxy.gridpp.rl.ac.uk).
 |}
 <!-- **********************End GOC DB Entries************************** ----->
 <!-- ****************************************************************** ----->
@@ Line 121: / Line 131: @@
 ** Castor 2.1.14 testing is largely complete. (A non-Tier1 production Castor instance has been successfully upgraded.) We are starting to look at possible dates for rolling this out (probably around May).
 * Networking:
+** Move switches connecting recent disk servers batches ('11, '12) onto the Tier1 mesh network.
 ** Update core Tier1 network and change connection to site and OPN including:
 *** Install new Routing layer for Tier1 & change the way the Tier1 connects to the RAL network. (Scheduled for 29th April)
@@ Line 149: / Line 160: @@
 ! Duration
 ! Reason
-|-
-| lcgwms04, lcgwms05, lcgwms06
-| UNSCHEDULED
-| WARNING
-| 03/04/2014 17:00
-| 04/04/2014 09:25
-| 16 hours and 25 minutes
-| We are investigating problems with these WMS systems
-|-
-| srm-lhcb-tape.gridpp.rl.ac.uk
-| UNSCHEDULED
-| WARNING
-| 03/04/2014 08:00
-| 03/04/2014 09:30
-| 1 hour and 30 minutes
-| Warning during further testing of new tape interface (ACSLS),
 |-
 | lcgrbp01.gridpp.rl.ac.uk
@@ Line 199: / Line 194: @@
 |-
 | 102611
-| Yellow
+| Red
 | Urgent
 | In Progress
 | 2014-03-24
-| 2014-03-24
+| 2014-04-09
 |
 | NAGIOS *eu.egi.sec.Argus-EMI-1* failed on argusngi.gridpp.rl.ac.uk@RAL-LCG2
@@ Line 215: / Line 210: @@
 | Atlas
 | RAL-LCG2_SCRATCHDISK: One dataset to delete is causing 1379 deletion errors
-|-
-| 101079
-| Red
-| Less Urgent
-| In Progress
-| 2014-02-09
-| 2014-04-01
-|
-| ARC CEs have VOViews with a default SE of "0"
 |-
 | 98249
@@ Line 252: / Line 238: @@
 ! Day !! OPS !! Alice !! Atlas !! CMS !! LHCb !! Atlas HC !! CMS HC !! Comment
 |-
-| 09/04/14 || 100 || 100 || 100 || 100 || 100 || 100 || 100 ||
+| 09/04/14 || 100 || 100 || 100 || 100 || 100 || 100 || 99 ||
 |-
 | 10/04/14 || 100 || 100 || 100 || 100 || 100 || 100 || 100 ||
@@ Line 260: / Line 246: @@
 | 12/04/14 || 100 || 100 || 100 || 100 || 100 || 100 || 100 ||
 |-
-| 13/04/14 || 100 || 100 || 100 || 100 || 100 || 100 || 100 ||
+| 13/04/14 || 100 || 100 || 100 || 100 || 100 || 28 || 100 || Problem with Atlas Frontier across multiple sites.
 |-
-| 14/04/14 || 100 || 100 || style="background-color: lightgrey;" | 94.2 || style="background-color: lightgrey;" | 91.8 || style="background-color: lightgrey;" | 87.1 || 100 || 100 || Maintenance on primary CERN link - but didn't failover cleanly.
+| 14/04/14 || 100 || 100 || style="background-color: lightgrey;" | 94.2 || style="background-color: lightgrey;" | 91.8 || style="background-color: lightgrey;" | 87.1 || 67 || 100 || Maintenance on primary CERN link - but didn't failover cleanly.
 |-
-| 15/04/14 || 100 || 100 || 100 || 100 || 100 || 100 || 100 ||
+| 15/04/14 || 100 || 100 || 100 || 100 || 100 || 100 || 99 ||
 |}
 <!-- **********************End Availability Report************************** ----->
 <!-- *********************************************************************** ----->

Difference between revisions of "Tier1 Operations Report 2014-04-16"

Latest revision as of 10:34, 16 April 2014

RAL Tier1 Operations Report for 16th April 2014

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools