Difference between revisions of "Tier1 Operations Report 2014-07-09"

From GridPP Wiki
Jump to: navigation, search
()
()
 
(12 intermediate revisions by one user not shown)
Line 9: Line 9:
 
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Review of Issues during the week 2nd to 9th July 2014.
 
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Review of Issues during the week 2nd to 9th July 2014.
 
|}
 
|}
* There were probelms with the SRM (not Castor) for the GEN instance on Thursday and Friday of last week (3/4 July). details....
+
* There were problems with the SRM (not Castor) for the GEN instance on Thursday and Friday of last week (3/4 July). It was fixed by a database edit.
* Problems with Atlas multicore jobs on Friday 4th July....
+
* Problems with Atlas multicore jobs on Friday 4th July. We believe it is an Atlas issue.
 
<!-- ***********End Review of Issues during last week*********** ----->
 
<!-- ***********End Review of Issues during last week*********** ----->
 
<!-- *********************************************************** ----->
 
<!-- *********************************************************** ----->
Line 32: Line 32:
 
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Current operational status and issues
 
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Current operational status and issues
 
|}
 
|}
* None
+
* We are still investigating xroot access to CMS Castor following the upgrade on the 17th June.
 +
* There is a problem with the dteam SRM regional nagios tests, which may be caused by how dteam is published by the CIP.
 
<!-- ***********End Current operational status and issues*********** ----->
 
<!-- ***********End Current operational status and issues*********** ----->
 
<!-- *************************************************************** ----->
 
<!-- *************************************************************** ----->
Line 54: Line 55:
 
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Notable Changes made this last week.
 
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Notable Changes made this last week.
 
|}
 
|}
* Tuesday (8th July) Atlas Castor instance upgraded to version 2.1.14-13. (to be confirmed....)
+
* Tuesday and Wednesday (8th and 9th July) Atlas Castor instance upgraded to version 2.1.14-13. Castor Atlas was returned to production at 10:40 this morning.
 
<!-- *************End Notable Changes made this last week************** ----->
 
<!-- *************End Notable Changes made this last week************** ----->
 
<!-- ****************************************************************** ----->
 
<!-- ****************************************************************** ----->
Line 116: Line 117:
 
| OUTAGE
 
| OUTAGE
 
| 08/07/2014 06:00
 
| 08/07/2014 06:00
| 09/07/2014 12:00
+
| 09/07/2014 10:40
 
| 1 day, 6 hours
 
| 1 day, 6 hours
 
| Atlas Castor instance down for Castor 2.1.14 Stager Update
 
| Atlas Castor instance down for Castor 2.1.14 Stager Update
Line 150: Line 151:
 
|-style="background:#b7f1ce"
 
|-style="background:#b7f1ce"
 
! GGUS ID !! Level !! Urgency !! State !! Creation !! Last Update !! VO !! Subject
 
! GGUS ID !! Level !! Urgency !! State !! Creation !! Last Update !! VO !! Subject
 +
|-
 +
| 106753
 +
| Green
 +
| Less Urgent
 +
| In Progress
 +
| 2014-07-09
 +
| 2014-07-09
 +
| Atlas
 +
| Errors in transfers to RAL-LCG2
 +
|-
 +
| 106695
 +
| Green
 +
| Less Urgent
 +
| In Progress
 +
| 2014-07-08
 +
| 2014-07-08
 +
| Ops
 +
| [Rod Dashboard] Issues detected at RAL-LCG2
 +
|-
 +
| 106655
 +
| Green
 +
| Less Urgent
 +
| In Progress
 +
| 2014-07-04
 +
| 2014-07-04
 +
| Ops
 +
| [Rod Dashboard] Issues detected at RAL-LCG2 (srm-dteam)
 
|-
 
|-
 
| 106640
 
| 106640
Line 168: Line 196:
 
| HyperK
 
| HyperK
 
| HyperK support
 
| HyperK support
|-
 
| 106480
 
| Green
 
| Less Urgent
 
| Waiting Reply
 
| 2014-06-25
 
| 2014-06-30
 
| dteam
 
| Publishing meaningful Castor version
 
 
|-
 
|-
 
| 106324
 
| 106324
Line 186: Line 205:
 
| CMS
 
| CMS
 
| pilots losing network connections at T1_UK_RAL
 
| pilots losing network connections at T1_UK_RAL
|-
 
| 105571
 
| Red
 
| Less Urgent
 
| In Progress
 
| 2014-05-21
 
| 2014-06-30
 
| LHCb
 
| BDII and SRM publish inconsistent storage capacity numbers
 
 
|-
 
|-
 
| 105405
 
| 105405
Line 223: Line 233:
 
! Day !! OPS !! Alice !! Atlas !! CMS !! LHCb !! Atlas HC !! CMS HC !! Comment
 
! Day !! OPS !! Alice !! Atlas !! CMS !! LHCb !! Atlas HC !! CMS HC !! Comment
 
|-
 
|-
| 25/06/14 || 100 || 100 || style="background-color: lightgrey;" |94.8 || 100 || 100 || 96 || 98 || Several SUM test failures (Invalid Argument).
+
| 02/07/14 || 100 || 100 || 100 || 100 || 100 || 98 || 99 ||
|-
+
| 26/06/14 || 100 || 100 || style="background-color: lightgrey;" |90.6 || style="background-color: lightgrey;" |95.8 || style="background-color: lightgrey;" |92.6 || 90 || 100 || LHCb Castor Stager 2.1.14 upgrade; Atlas: Several SRM test failures; CMS: Single SRM Put test failure.
+
|-
+
| 02/07/14 || 100 || 100 || 100 || 100 || 100 || 100 || 100 ||
+
 
|-
 
|-
| 03/07/14 || 100 || 100 || 100 || 100 || 100 || 100 || 100 ||
+
| 03/07/14 || 100 || 100 || 100 || 100 || 100 || 99 || 100 ||
 
|-
 
|-
| 04/07/14 || 100 || 100 || 100 || 100 || 100 || 100 || 100 ||
+
| 04/07/14 || 100 || 100 || 100 || 100 || 100 || 97 || 100 ||
 
|-
 
|-
| 05/07/14 || 100 || 100 || 100 || 100 || 100 || 100 || 100 ||
+
| 05/07/14 || 100 || 100 || 100 || 100 || 100 || 92 || 100 ||
 
|-
 
|-
| 06/07/14 || 100 || 100 || 100 || 100 || 100 || 100 || 100 ||
+
| 06/07/14 || 100 || 100 || 100 || 100 || 100 || 99 || 100 ||
 
|-
 
|-
| 07/07/14 || 100 || 100 || 100 || 100 || 100 || 100 || 100 ||
+
| 07/07/14 || 100 || 100 || 100 || 100 || 100 || 97 || 100 ||
 
|-
 
|-
| 08/07/14 || 100 || 100 || 100 || 100 || 100 || 100 || 100 ||
+
| 08/07/14 || 100 || 100 || style="background-color: lightgrey;" | 41 || 100 || 100 || 100 || 99 || Atlas Castor upgrade.
 
|}
 
|}
 
<!-- **********************End Availability Report************************** ----->
 
<!-- **********************End Availability Report************************** ----->
 
<!-- *********************************************************************** ----->
 
<!-- *********************************************************************** ----->

Latest revision as of 13:17, 9 July 2014

RAL Tier1 Operations Report for 9th July 2014

Review of Issues during the week 2nd to 9th July 2014.
  • There were problems with the SRM (not Castor) for the GEN instance on Thursday and Friday of last week (3/4 July). It was fixed by a database edit.
  • Problems with Atlas multicore jobs on Friday 4th July. We believe it is an Atlas issue.
Resolved Disk Server Issues
  • None
Current operational status and issues
  • We are still investigating xroot access to CMS Castor following the upgrade on the 17th June.
  • There is a problem with the dteam SRM regional nagios tests, which may be caused by how dteam is published by the CIP.
Ongoing Disk Server Issues
  • None
Notable Changes made this last week.
  • Tuesday and Wednesday (8th and 9th July) Atlas Castor instance upgraded to version 2.1.14-13. Castor Atlas was returned to production at 10:40 this morning.
Declared in the GOC DB
  • None
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.
  • We are planning the termination of the FTS2 service (announced for 2nd September) now that almost all use is on FTS3.

Listing by category:

  • Databases:
    • Switch LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • None.
  • Networking:
    • Move switches connecting the 2011 disk servers batches onto the Tier1 mesh network.
    • Make routing changes to allow the removal of the UKLight Router.
  • Fabric
    • We are phasing out the use of the software server used by the small VOs.
    • Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
    • There will be circuit testing of the remaining (i.e. non-UPS) circuits in the machine room during 2014.
Entries in GOC DB starting between the 2nd and 9th July 2014.
Service Scheduled? Outage/At Risk Start End Duration Reason
srm-atlas SCHEDULED OUTAGE 08/07/2014 06:00 09/07/2014 10:40 1 day, 6 hours Atlas Castor instance down for Castor 2.1.14 Stager Update
Castor GEN: srm-alice, srm-biomed, srm-dteam, srm-hone, srm-ilc, srm-mice, srm-minos, srm-na62, srm-snoplus, srm-superb, srm-t2k UNSCHEDULED WARNING 03/07/2014 07:45 03/07/2014 13:00 5 hours and 15 minutes Problem with SRMs for Castor GEN instance. (However Castor itself - e.g. xroot access - working OK).
Whole site SCHEDULED WARNING 02/07/2014 10:00 02/07/2014 11:00 1 hour RAL Tier1 site in warning state due to UPS/generator test.
Open GGUS Tickets (Snapshot during morning of meeting)
GGUS ID Level Urgency State Creation Last Update VO Subject
106753 Green Less Urgent In Progress 2014-07-09 2014-07-09 Atlas Errors in transfers to RAL-LCG2
106695 Green Less Urgent In Progress 2014-07-08 2014-07-08 Ops [Rod Dashboard] Issues detected at RAL-LCG2
106655 Green Less Urgent In Progress 2014-07-04 2014-07-04 Ops [Rod Dashboard] Issues detected at RAL-LCG2 (srm-dteam)
106640 Green Less Urgent In Progress 2014-07-04 2014-07-04 ILC Failure to submit jobs to RAL-LCG2 CEs
106610 Green Less Urgent In Progress 2014-07-02 2014-07-02 HyperK HyperK support
106324 Yellow Urgent In Progress 2014-06-18 2014-07-01 CMS pilots losing network connections at T1_UK_RAL
105405 Red Urgent On Hold 2014-05-14 2014-07-01 please check your Vidyo router firewall configuration
Availability Report

Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 508); CMS HC = CMS HammerCloud

Day OPS Alice Atlas CMS LHCb Atlas HC CMS HC Comment
02/07/14 100 100 100 100 100 98 99
03/07/14 100 100 100 100 100 99 100
04/07/14 100 100 100 100 100 97 100
05/07/14 100 100 100 100 100 92 100
06/07/14 100 100 100 100 100 99 100
07/07/14 100 100 100 100 100 97 100
08/07/14 100 100 41 100 100 100 99 Atlas Castor upgrade.