Difference between revisions of "Tier1 Operations Report 2014-04-09"

From GridPP Wiki
Jump to: navigation, search
()
Line 60: Line 60:
 
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Notable Changes made this last fortnight.
 
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Notable Changes made this last fortnight.
 
|}
 
|}
* The rollout of of WNs updated to the EMI-3 version of WN continues and is expected to be completed this week.
+
* The rollout of of WNs updated to the EMI-3 version has been completed.
* The EMI3 Argus server is being rolled out for use across all CEs and WNs.
+
* The EMI3 Argus server is now in use everyweher in the batch farm.
* The old MyProxy server (lcgrbp01.gridpp.rl.ac.uk) has just been turned off today. Its replacement (myproxy.gridpp.rl.ac.uk) is in production.
+
* Two of the CV2013 disk servers (120TB each) have been added to LHCbDst. A further 9 are being added today. Three further servers are in CMS non-prod awaiting being moved into production imminently. ?? Update this.
* The 2013 purchases of worker nodes are being added to the farm this week.
+
* Two of the CV2013 disk servers (120TB each) have been added to LHCbDst. A further 9 are being added today. Three further servers are in CMS non-prod awaiting being moved into production imminently.
+
 
<!-- *************End Notable Changes made this last week************** ----->
 
<!-- *************End Notable Changes made this last week************** ----->
 
<!-- ****************************************************************** ----->
 
<!-- ****************************************************************** ----->

Revision as of 09:35, 9 April 2014

RAL Tier1 Operations Report for 9th April 2014

Review of Issues during the week 2nd to 9th April 2014.
  • There was a short (around 5 minute) break in external connectivity to the Tier1 during the morning of Thursday 20th March and again a similar event the following morning.
  • There was a failover of an Atlas Castor Database early evening on Tuesday 25th March. The failover triggered a call-out and the database was put back onto its allocated node. The cause is a bug that has been reported to Oracle.
  • On Friday, 28th March, we were not running some of the CE SUM tests in a timely manner. It was found that owing to a separate change in the Condor configuration we were no longer prioritising the test jobs. This was fixed.
Resolved Disk Server Issues
  • Last Wednesday (2nd April) GDSS239 (Atlas HotDisk) crashed. As AtlasHotDisk is being merged into another Space Token and there should be multiple copied of files on each server spread across the servers in AtalsHotDisk it was decided to withdraw teh server from use rather than invest time investigating. (No of unique files found?)
  • In the early hours of Sunday 6th April GDSS600 (AtlasDataDisk - D1T0) failed. Multiple disk failures were being reported by teh disk controller. The system was returned to production yesterday evening (8th April) and is being drained. It will be decommissioned after files ahve been copied off.
Current operational status and issues
  • There have been problems with the CMS Castor instance in recent weeks. These are triggered by high load. Work is underway to alleviate these problems, in particular servers with faster network connections will be moved into the disk cache in front of CMS_Tape when they become available.
  • The Castor Team are now able to reproduce the intermittent failures of Castor access via the SRM that has been reported in recent weeks. Understanding of the problem is significantly adcanced and further investigations are ongoing using the Castor Preprod instance. Ideas for a workaround are being developed.
  • As reported before, working with Atlas the file deletion rate was somewhat improved. However, there is still a problem that needs to be understood.
  • Around 50 files in tape backed service classes (mainly in GEN) have been found not to have migrated to tape. This is under investigation. The cause for some of these is understood (a bad tape at time of migration).
  • Problems with the infrastructure used to host many of our non-Catsor services have largely been worked around, although not yet fixed. Some additional migrations of VMs has been necessary.
Ongoing Disk Server Issues
  • None.
Notable Changes made this last fortnight.
  • The rollout of of WNs updated to the EMI-3 version has been completed.
  • The EMI3 Argus server is now in use everyweher in the batch farm.
  • Two of the CV2013 disk servers (120TB each) have been added to LHCbDst. A further 9 are being added today. Three further servers are in CMS non-prod awaiting being moved into production imminently. ?? Update this.
Declared in the GOC DB
Service Scheduled? Outage/At Risk Start End Duration Reason
Whole Site SCHEDULED OUTAGE 29/04/2014 07:00 29/04/2014 17:00 10 hours Site outage during Network Upgrade.
lcgrbp01.gridpp.rl.ac.uk, SCHEDULED OUTAGE 02/04/2014 12:00 01/05/2014 12:00 29 days, System be decommissioned. (Replaced my myproxy.gridpp.rl.ac.uk).
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.

Listing by category:

  • Databases:
    • Switch LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • Castor 2.1.14 testing is largely complete. (A non-Tier1 production Castor instance was successfully upgraded yesterday, 1st April.) We are starting to look at possible dates for rolling this out (probably around May).
  • Networking:
    • Update core Tier1 network and change connection to site and OPN including:
      • Install new Routing layer for Tier1 & change the way the Tier1 connects to the RAL network. (Scheduled for 29th April)
      • These changes will lead to the removal of the UKLight Router.
  • Fabric
    • We are phasing out the use of the software server used by the small VOs.
    • Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
    • There will be circuit testing of the remaining (i.e. non-UPS) circuits in the machine room during 2014.
Entries in GOC DB starting between the 2nd and 9th April 2014.


Service Scheduled? Outage/At Risk Start End Duration Reason
lcgwms04, lcgwms05, lcgwms06 UNSCHEDULED WARNING 03/04/2014 17:00 04/04/2014 09:25 16 hours and 25 minutes We are investigating problems with these WMS systems
srm-lhcb-tape.gridpp.rl.ac.uk UNSCHEDULED WARNING 03/04/2014 08:00 03/04/2014 09:30 1 hour and 30 minutes Warning during further testing of new tape interface (ACSLS),
lcgrbp01.gridpp.rl.ac.uk SCHEDULED OUTAGE 02/04/2014 12:00 01/05/2014 12:00 29 days, System be decommissioned. (Replaced my myproxy.gridpp.rl.ac.uk).
Open GGUS Tickets (Snapshot during morning of meeting)
GGUS ID Level Urgency State Creation Last Update VO Subject
103197 Green Less Urgent Waiting Reply 2014-04-09 2014-04-09 RAL myproxy server and GridPP wiki
102611 Yellow Urgent In Progress 2014-03-24 2014-03-24 NAGIOS *eu.egi.sec.Argus-EMI-1* failed on argusngi.gridpp.rl.ac.uk@RAL-LCG2
101968 Red Less Urgent On Hold 2014-03-11 2014-04-01 Atlas RAL-LCG2_SCRATCHDISK: One dataset to delete is causing 1379 deletion errors
101079 Red Less Urgent In Progress 2014-02-09 2014-04-01 ARC CEs have VOViews with a default SE of "0"
98249 Red Urgent In Progress 2013-10-21 2014-03-13 SNO+ please configure cvmfs stratum-0 for SNO+ at RAL T1
Availability Report

Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 508); CMS HC = CMS HammerCloud

Day OPS Alice Atlas CMS LHCb Atlas HC CMS HC Comment
02/04/14 100 100 100 100 100 100 98
03/04/14 100 100 100 100 100 100 99
04/04/14 100 100 100 100 100 100 99
05/04/14 100 100 100 100 100 100 100
06/04/14 100 100 93.6 95.5 93.6 100 100 Primary OPN link to CERN down. Failover to backup link didn't work properly.
07/04/14 100 100 86.3 86.2 81.5 100 100 Primary OPN link to CERN down. Failover to backup link didn't work properly.
08/04/14 100 100 100 100 100 100 100