Difference between revisions of "Tier1 Operations Report 2014-05-21"

From GridPP Wiki
Jump to: navigation, search
()
m ()
 
(10 intermediate revisions by one user not shown)
Line 12: Line 12:
 
* Problem reported last week with some half dozen Atlas files were lost during the draining of a disk server at the end of February is now understood. This was an isolated incident. Draining has been resumed.
 
* Problem reported last week with some half dozen Atlas files were lost during the draining of a disk server at the end of February is now understood. This was an isolated incident. Draining has been resumed.
 
* There was a problem with on FTS2 last Friday (16th) which led to a ticket from CMS.
 
* There was a problem with on FTS2 last Friday (16th) which led to a ticket from CMS.
 +
* The checksum checker found a corrupt LHCb file in Castor which has been declared lost.
 
<!-- ***********End Review of Issues during last week*********** ----->
 
<!-- ***********End Review of Issues during last week*********** ----->
 
<!-- *********************************************************** ----->
 
<!-- *********************************************************** ----->
Line 55: Line 56:
 
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Notable Changes made this last week.
 
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Notable Changes made this last week.
 
|}
 
|}
* CVMFS Client version 2.1.19 being rolled out to whole farm following successful testing.
+
* CVMFS Client version 2.1.19 in final stages of being rolled out to whole farm following successful testing so far.
 +
* One new disk server has been deployed to CMS disk. (This replaced a server (GDSS758) that failed a couple of weeks ago).
 +
* A new tape controller server (ACSLS) was put into operation yesterday morning (Tuesday 20th May).
 
<!-- *************End Notable Changes made this last week************** ----->
 
<!-- *************End Notable Changes made this last week************** ----->
 
<!-- ****************************************************************** ----->
 
<!-- ****************************************************************** ----->
Line 100: Line 103:
 
* Provisional dates for the Castor 2.1.14 upgrade: Nameserver: Tuesday 10th June; Stagers: CMS- Tue 17th June; LHCb - Thu 19th June; GEN - Tue 24th June; Atlas - Thu 26th June.
 
* Provisional dates for the Castor 2.1.14 upgrade: Nameserver: Tuesday 10th June; Stagers: CMS- Tue 17th June; LHCb - Thu 19th June; GEN - Tue 24th June; Atlas - Thu 26th June.
 
* We are starting to plan the termination of the FTS2 service now that almost all use is on FTS3.
 
* We are starting to plan the termination of the FTS2 service now that almost all use is on FTS3.
 +
* Maintenance will be carried out on the diesel generator tomorrow morning (22nd May) from 09:00 - 11:00. Should we suffer a mains power failure during this time window we will not have generator backup.
 +
* On Wednesday 28th May we plan to move the network switches for some Castor disk servers to the mesh network to alleviate a bottleneck. This will be during an at risk on Castor.
 
'''Listing by category:'''
 
'''Listing by category:'''
 
* Databases:
 
* Databases:
Line 242: Line 247:
 
|-style="background:#b7f1ce"
 
|-style="background:#b7f1ce"
 
! GGUS ID !! Level !! Urgency !! State !! Creation !! Last Update !! VO !! Subject
 
! GGUS ID !! Level !! Urgency !! State !! Creation !! Last Update !! VO !! Subject
|-
 
| 105493
 
| Green
 
| Urgent
 
| In Progress
 
| 2014-05-16
 
| 2014-05-16
 
| CMS
 
| Failing transfers from T1_UK_RAL_Buffer to many sites
 
 
|-
 
|-
 
| 105405
 
| 105405

Latest revision as of 13:22, 21 May 2014

RAL Tier1 Operations Report for 21st May 2014

Review of Issues during the week 14th to 21st May 2014.
  • Grumbly problems with the WMSs reported last week ongoing. The developers have been contacted.
  • Problem reported last week with some half dozen Atlas files were lost during the draining of a disk server at the end of February is now understood. This was an isolated incident. Draining has been resumed.
  • There was a problem with on FTS2 last Friday (16th) which led to a ticket from CMS.
  • The checksum checker found a corrupt LHCb file in Castor which has been declared lost.
Resolved Disk Server Issues
  • None.
Current operational status and issues
  • Nothing to report.
Ongoing Disk Server Issues
  • None
Notable Changes made this last week.
  • CVMFS Client version 2.1.19 in final stages of being rolled out to whole farm following successful testing so far.
  • One new disk server has been deployed to CMS disk. (This replaced a server (GDSS758) that failed a couple of weeks ago).
  • A new tape controller server (ACSLS) was put into operation yesterday morning (Tuesday 20th May).
Declared in the GOC DB
Service Scheduled? Outage/At Risk Start End Duration Reason
lcgui02.gridpp.rl.ac.uk, SCHEDULED OUTAGE 30/04/2014 14:00 29/05/2014 13:00 28 days, 23 hours Service being decommissioned.
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.
  • Provisional dates for the Castor 2.1.14 upgrade: Nameserver: Tuesday 10th June; Stagers: CMS- Tue 17th June; LHCb - Thu 19th June; GEN - Tue 24th June; Atlas - Thu 26th June.
  • We are starting to plan the termination of the FTS2 service now that almost all use is on FTS3.
  • Maintenance will be carried out on the diesel generator tomorrow morning (22nd May) from 09:00 - 11:00. Should we suffer a mains power failure during this time window we will not have generator backup.
  • On Wednesday 28th May we plan to move the network switches for some Castor disk servers to the mesh network to alleviate a bottleneck. This will be during an at risk on Castor.

Listing by category:

  • Databases:
    • Switch LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • Castor 2.1.14 testing was largely complete, although a new minor version (2.1.14-12) will be released soon.
  • Networking:
    • Move switches connecting recent disk servers batches ('11, '12) onto the Tier1 mesh network.
    • Make routing changes to allow the removal of the UKLight Router.
  • Fabric
    • We are phasing out the use of the software server used by the small VOs.
    • Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
    • There will be circuit testing of the remaining (i.e. non-UPS) circuits in the machine room during 2014.
Entries in GOC DB starting between the 14th and 21st May 2014.
Service Scheduled? Outage/At Risk Start End Duration Reason
srm-lhcb-tape.gridpp.rl.ac.uk, SCHEDULED OUTAGE 20/05/2014 08:00 20/05/2014 11:00 3 hours Outage of tape system for update of tape library controller. (Postponed from 13th May).
All SRM end points SCHEDULED WARNING 20/05/2014 08:00 20/05/2014 11:00 3 hours Outage of tape system for update of tape library controller. (Postponed from 13th May).
lcgvo08.gridpp.rl.ac.uk, SCHEDULED OUTAGE 15/05/2014 15:00 15/05/2014 16:00 1 hour Downtime for system maintenance
lcgvo07.gridpp.rl.ac.uk, SCHEDULED OUTAGE 15/05/2014 14:00 15/05/2014 15:00 1 hour Downtime for system maintenance
lcglb02.gridpp.rl.ac.uk, SCHEDULED OUTAGE 15/05/2014 13:00 15/05/2014 14:00 1 hour Downtime for system maintenance
lcglb01.gridpp.rl.ac.uk, SCHEDULED OUTAGE 15/05/2014 10:45 15/05/2014 11:45 1 hour Downtime for system maintenance
lcglb04.gridpp.rl.ac.uk, SCHEDULED OUTAGE 15/05/2014 09:30 15/05/2014 10:30 1 hour Downtime for system maintenance
lcgwms06.gridpp.rl.ac.uk, SCHEDULED OUTAGE 14/05/2014 14:00 14/05/2014 15:20 1 hour and 20 minutes Downtime for system maintenance
lcgwms05.gridpp.rl.ac.uk, SCHEDULED OUTAGE 14/05/2014 11:30 14/05/2014 13:30 2 hours Downtime for system maintenance
lcgfts3.gridpp.rl.ac.uk, SCHEDULED OUTAGE 14/05/2014 10:00 14/05/2014 12:00 2 hours FTS3 service at RAL unavailable for update to version 3.2.22
lcgwms04.gridpp.rl.ac.uk, SCHEDULED OUTAGE 14/05/2014 09:30 14/05/2014 10:46 1 hour and 16 minutes Downtime for system maintenance
lcgui02.gridpp.rl.ac.uk, SCHEDULED OUTAGE 30/04/2014 14:00 29/05/2014 13:00 28 days, 23 hours Service being decommissioned.
Open GGUS Tickets (Snapshot during morning of meeting)
GGUS ID Level Urgency State Creation Last Update VO Subject
105405 Green Urgent In Progress 2014-05-14 2014-05-15 please check your Vidyo router firewall configuration
105308 Green Less Urgent On Hold 2014-05-11 2014-05-19 Atlas Jobs at RAL-LCG2_MCORE are failing with "Failed to open shared memory object: Permission denied"
105161 Yellow Less Urgent In Progress 2014-05-05 2014-05-16 H1 hone jobs submitted into CREAM queues through lcgwms05.gridpp.rl.ac.uk & lcgwms06.gridpp.rl.ac.uk WMSs are are Ready status long time (more as 5 hours)
105100 Green Urgent On Hold 2014-05-02 2014-05-12 CMS T1_UK_RAL Consistency Check (May14)
98249 Red Urgent In Progress 2013-10-21 2014-05-20 SNO+ please configure cvmfs stratum-0 for SNO+ at RAL T1
Availability Report

Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 508); CMS HC = CMS HammerCloud

Day OPS Alice Atlas CMS LHCb Atlas HC CMS HC Comment
14/05/14 100 100 100 100 100 100 100
15/05/14 100 100 100 100 100 94 96
16/05/14 100 100 100 100 100 100 100
17/05/14 100 100 100 100 100 100 100
18/05/14 100 100 100 100 100 99 100
19/05/14 100 100 100 100 100 99 100
20/05/14 100 100 100 100 100 100 100