Difference between revisions of "RAL Tier1 weekly operations castor 29/02/2016"

From GridPP Wiki
Jump to: navigation, search
(Created page with "== Operations News == * No disk server issues this week * globc updates applied, all CASTOR systems rebooted. initial issues with head nodes, 7 failed to reboot due to their ...")
 
 
Line 1: Line 1:
 
== Operations News ==
 
== Operations News ==
 
* No disk server issues this week
 
* No disk server issues this week
* globc updates applied, all CASTOR systems rebooted.  initial issues with head nodes, 7 failed to reboot due to their build history.  ACTION: they need their quattor build revisited so that this does not recur.
+
* glibc updates applied, all CASTOR systems rebooted.  initial issues with head nodes, 7 failed to reboot due to their build history.  ACTION: RA to revisit quattor build so that this does not recur.
* Main CIP system failed, have failed over to test CIP machine.  HW failure to be fixed then will fail back over to production system
+
* 11.2.0.4 DB client update had to be rescheduled, should go ahead Monday 29th, has been running in pre-prod for considerable amount of time.  This should be transparent. ACTION RA
* 11.2.0.4 DB client update had to be rescheduled, should go ahead Monday 29th, has been running in pre-prod for considerable amount of time.  This should be transparent.
+
  
 +
== Operations Problems ==
 +
* Main CIP system failed, have failed over to test CIP machine.  HW failure to be fixed then will fail back over to production system.  ACTION: RA, CC and Fabric
 +
* OPN links. BD investigating what data flow is filling OPN and superjanet at the same time.
 +
* LHCb job failures - GGUS ticket open
 +
* ongoing AAA issues in CMS
 +
* CV '11 gen RAID card controller firmware update required - WD digital drives failing at a high rate ACTION: RA follow up with fabric team
  
* castor 2.1.15 update  
+
== Planned, Scheduled and Cancelled Interventions ==
** ns upgrade on day of 29thFeb-3March; Downtime for all VOs
+
* 2.1.15 update to nameserver will not go ahead.  This is due to slow file open times issues on the stager.  Testing/debugging of stager issue is ongoing.  If these issues are resolved propose nameserver upgrade on 16th March with stager following week, 22nd March. (RA)
** stager upgrade for one VO week commencing 21/3/16
+
* 11.2.04 client updates (running in preprod) - possible change control for prod (see above)
* Repack updated to 2.1.14-15
+
* WAN tuning proposal - possibly put into change control BD
* 2.1.15 works on preprod (RAL xroot rpm build) had not been put under stress yet
+
* CASTOR facilities patching scheduled for next week - detailed schedule to be agreed with fabric team.
* castor 2.1.16 coming soon - SRM integration into CASTOR code base
+
* ATLAS gSoap Errors; JK (SdW advised) restarted SRM front ends
+
*CMS AAA still an issue
+
*LHCb upload still problematic
+
  
 +
== Long-term projects ==
 +
* RA has produced a python script to handle SRM db duplication issue which is causing callouts. This script has been tested and will now be put into production, as a cron job.  This should be a temporary fix, so a bug report should be made to the FTS development team, via AL.
 +
* JJ – Glue 2 for CASTOR, used for publishing information.  RA writing data getting end in python, JJ writing Glue 2 end in LISP.  No schedule as yet.
 +
* Facilities drive re-allocation.  ACTION: RA
 +
* SRM 2.1.14 testing  with SdW on VCERT
  
* VO DiRAC people from Leicester are coming online -
+
== Advanced Planning ==
* 2.1.15 change control had its first airing in change control - 2.1.15 currently not working for us.
+
'''Tasks'''
* new tape backed disk servers for Tier1 - to replace CV11, recommendation made to Martin
+
* CASTOR 2.1.15 implementation and testing
* Merging tape pools wiki created by Shaun
+
'''Interventions'''
* 2.1.15 name server tested
+
 
* New SRM on vcert2
+
== Staffing ==
* New SRM (SL6) with bug fixes available - needs test
+
* Castor on Call person next week
* Gfal-cat command failing for atlas reading of nsdumps form castor: https://ggus.eu/index.php?mode=ticket_info&ticket_id=117846. Developers looking to fix within: https://ggus.eu/index.php?mode=ticket_info&ticket_id=118842
+
** RA until Thursday
* LHCb batch jobs failing to copy results into castor - changes made seems to have improved the situation but not fix (Raja). Increasing the number of connections to the NS db (more threads)  
+
** Propose CP Friday - TBC
* BD looking at porting persistent tests to Ceph
+
 
 +
* Staff absence/out of the office:
 +
** BD - Monday-Friday
 +
** CP - Monday-Tuesday
 +
** SdW - Tuesday-Wednesday
 +
 
 +
== New Actions ==
 +
* RA to revisit quattor build for the head nodes which did not reboot so that this does not recur.
 +
* RA 11.2.0.4 DB client update had to be rescheduled, should go ahead Monday 29th
 +
* RA, CC and Fabric - fix CIP production system and switch back from test server
 +
* RA follow up with Fabric re: CV '11 gen RAID card controller firmware update
 +
 
 +
== Existing Actions ==
 +
 
 +
* BD to coordinate with atlas re bulk deletion before TF starts repack
 +
* GS arrange a meeting to discuss remaining actions on CV11 and V12 (when KH is back)
 +
* BD to clarify if separating the DiRAC data is a necessity
 +
* BD ensure quattorising atlas consistency check
 +
* SdW to send merging tape pools wiki to CERN for review
 +
* RA to deploy a 14 generation into preprod
 +
* BD re. WAN tuning proposal - discuss with GS, does it need a change control?
 +
* RA to try stopping tapeserverd mid-migration to see if it breaks.
 +
* RA (was SdW) to modify cleanlostfiles to log to syslog so we can track its use - under testing
 +
* GS to investigate how/if we need to declare xrootd endpoints in GOCDB BDII - progress

Latest revision as of 11:32, 26 February 2016

Operations News

  • No disk server issues this week
  • glibc updates applied, all CASTOR systems rebooted. initial issues with head nodes, 7 failed to reboot due to their build history. ACTION: RA to revisit quattor build so that this does not recur.
  • 11.2.0.4 DB client update had to be rescheduled, should go ahead Monday 29th, has been running in pre-prod for considerable amount of time. This should be transparent. ACTION RA

Operations Problems

  • Main CIP system failed, have failed over to test CIP machine. HW failure to be fixed then will fail back over to production system. ACTION: RA, CC and Fabric
  • OPN links. BD investigating what data flow is filling OPN and superjanet at the same time.
  • LHCb job failures - GGUS ticket open
  • ongoing AAA issues in CMS
  • CV '11 gen RAID card controller firmware update required - WD digital drives failing at a high rate ACTION: RA follow up with fabric team

Planned, Scheduled and Cancelled Interventions

  • 2.1.15 update to nameserver will not go ahead. This is due to slow file open times issues on the stager. Testing/debugging of stager issue is ongoing. If these issues are resolved propose nameserver upgrade on 16th March with stager following week, 22nd March. (RA)
  • 11.2.04 client updates (running in preprod) - possible change control for prod (see above)
  • WAN tuning proposal - possibly put into change control BD
  • CASTOR facilities patching scheduled for next week - detailed schedule to be agreed with fabric team.

Long-term projects

  • RA has produced a python script to handle SRM db duplication issue which is causing callouts. This script has been tested and will now be put into production, as a cron job. This should be a temporary fix, so a bug report should be made to the FTS development team, via AL.
  • JJ – Glue 2 for CASTOR, used for publishing information. RA writing data getting end in python, JJ writing Glue 2 end in LISP. No schedule as yet.
  • Facilities drive re-allocation. ACTION: RA
  • SRM 2.1.14 testing with SdW on VCERT

Advanced Planning

Tasks

  • CASTOR 2.1.15 implementation and testing

Interventions

Staffing

  • Castor on Call person next week
    • RA until Thursday
    • Propose CP Friday - TBC
  • Staff absence/out of the office:
    • BD - Monday-Friday
    • CP - Monday-Tuesday
    • SdW - Tuesday-Wednesday

New Actions

  • RA to revisit quattor build for the head nodes which did not reboot so that this does not recur.
  • RA 11.2.0.4 DB client update had to be rescheduled, should go ahead Monday 29th
  • RA, CC and Fabric - fix CIP production system and switch back from test server
  • RA follow up with Fabric re: CV '11 gen RAID card controller firmware update

Existing Actions

  • BD to coordinate with atlas re bulk deletion before TF starts repack
  • GS arrange a meeting to discuss remaining actions on CV11 and V12 (when KH is back)
  • BD to clarify if separating the DiRAC data is a necessity
  • BD ensure quattorising atlas consistency check
  • SdW to send merging tape pools wiki to CERN for review
  • RA to deploy a 14 generation into preprod
  • BD re. WAN tuning proposal - discuss with GS, does it need a change control?
  • RA to try stopping tapeserverd mid-migration to see if it breaks.
  • RA (was SdW) to modify cleanlostfiles to log to syslog so we can track its use - under testing
  • GS to investigate how/if we need to declare xrootd endpoints in GOCDB BDII - progress