Difference between revisions of "RAL Tier1 weekly operations castor 29/02/2016"

Latest revision as of 11:32, 26 February 2016

Operations News

No disk server issues this week
glibc updates applied, all CASTOR systems rebooted. initial issues with head nodes, 7 failed to reboot due to their build history. ACTION: RA to revisit quattor build so that this does not recur.
11.2.0.4 DB client update had to be rescheduled, should go ahead Monday 29th, has been running in pre-prod for considerable amount of time. This should be transparent. ACTION RA

Operations Problems

Main CIP system failed, have failed over to test CIP machine. HW failure to be fixed then will fail back over to production system. ACTION: RA, CC and Fabric
OPN links. BD investigating what data flow is filling OPN and superjanet at the same time.
LHCb job failures - GGUS ticket open
ongoing AAA issues in CMS
CV '11 gen RAID card controller firmware update required - WD digital drives failing at a high rate ACTION: RA follow up with fabric team

Planned, Scheduled and Cancelled Interventions

2.1.15 update to nameserver will not go ahead. This is due to slow file open times issues on the stager. Testing/debugging of stager issue is ongoing. If these issues are resolved propose nameserver upgrade on 16th March with stager following week, 22nd March. (RA)
11.2.04 client updates (running in preprod) - possible change control for prod (see above)
WAN tuning proposal - possibly put into change control BD
CASTOR facilities patching scheduled for next week - detailed schedule to be agreed with fabric team.

Long-term projects

RA has produced a python script to handle SRM db duplication issue which is causing callouts. This script has been tested and will now be put into production, as a cron job. This should be a temporary fix, so a bug report should be made to the FTS development team, via AL.
JJ – Glue 2 for CASTOR, used for publishing information. RA writing data getting end in python, JJ writing Glue 2 end in LISP. No schedule as yet.
Facilities drive re-allocation. ACTION: RA
SRM 2.1.14 testing with SdW on VCERT

Advanced Planning

Tasks

CASTOR 2.1.15 implementation and testing

Interventions

Staffing

Castor on Call person next week
- RA until Thursday
- Propose CP Friday - TBC

Staff absence/out of the office:
- BD - Monday-Friday
- CP - Monday-Tuesday
- SdW - Tuesday-Wednesday

New Actions

RA to revisit quattor build for the head nodes which did not reboot so that this does not recur.
RA 11.2.0.4 DB client update had to be rescheduled, should go ahead Monday 29th
RA, CC and Fabric - fix CIP production system and switch back from test server
RA follow up with Fabric re: CV '11 gen RAID card controller firmware update

Existing Actions

BD to coordinate with atlas re bulk deletion before TF starts repack
GS arrange a meeting to discuss remaining actions on CV11 and V12 (when KH is back)
BD to clarify if separating the DiRAC data is a necessity
BD ensure quattorising atlas consistency check
SdW to send merging tape pools wiki to CERN for review
RA to deploy a 14 generation into preprod
BD re. WAN tuning proposal - discuss with GS, does it need a change control?
RA to try stopping tapeserverd mid-migration to see if it breaks.
RA (was SdW) to modify cleanlostfiles to log to syslog so we can track its use - under testing
GS to investigate how/if we need to declare xrootd endpoints in GOCDB BDII - progress

@@ Line 1: / Line 1: @@
 == Operations News ==
 * No disk server issues this week
-* globc updates applied, all CASTOR systems rebooted.  initial issues with head nodes, 7 failed to reboot due to their build history.  ACTION: they need their quattor build revisited so that this does not recur.
+* glibc updates applied, all CASTOR systems rebooted.  initial issues with head nodes, 7 failed to reboot due to their build history.  ACTION: RA to revisit quattor build so that this does not recur.
-* Main CIP system failed, have failed over to test CIP machine.  HW failure to be fixed then will fail back over to production system
+* 11.2.0.4 DB client update had to be rescheduled, should go ahead Monday 29th, has been running in pre-prod for considerable amount of time.  This should be transparent. ACTION RA
-* 11.2.0.4 DB client update had to be rescheduled, should go ahead Monday 29th, has been running in pre-prod for considerable amount of time.  This should be transparent.
+== Operations Problems ==
+* Main CIP system failed, have failed over to test CIP machine.  HW failure to be fixed then will fail back over to production system.  ACTION: RA, CC and Fabric
+* OPN links. BD investigating what data flow is filling OPN and superjanet at the same time.
+* LHCb job failures - GGUS ticket open
+* ongoing AAA issues in CMS
+* CV '11 gen RAID card controller firmware update required - WD digital drives failing at a high rate ACTION: RA follow up with fabric team
-* castor 2.1.15 update
+== Planned, Scheduled and Cancelled Interventions ==
-** ns upgrade on day of 29thFeb-3March; Downtime for all VOs
+* 2.1.15 update to nameserver will not go ahead.  This is due to slow file open times issues on the stager.  Testing/debugging of stager issue is ongoing.  If these issues are resolved propose nameserver upgrade on 16th March with stager following week, 22nd March. (RA)
-** stager upgrade for one VO week commencing 21/3/16
+* 11.2.04 client updates (running in preprod) - possible change control for prod (see above)
-* Repack updated to 2.1.14-15
+* WAN tuning proposal - possibly put into change control BD
-* 2.1.15 works on preprod (RAL xroot rpm build) had not been put under stress yet
+* CASTOR facilities patching scheduled for next week - detailed schedule to be agreed with fabric team.
-* castor 2.1.16 coming soon - SRM integration into CASTOR code base
-* ATLAS gSoap Errors; JK (SdW advised) restarted SRM front ends
-*CMS AAA still an issue
-*LHCb upload still problematic
+== Long-term projects ==
+* RA has produced a python script to handle SRM db duplication issue which is causing callouts. This script has been tested and will now be put into production, as a cron job.  This should be a temporary fix, so a bug report should be made to the FTS development team, via AL.
+* JJ – Glue 2 for CASTOR, used for publishing information.  RA writing data getting end in python, JJ writing Glue 2 end in LISP.  No schedule as yet.
+* Facilities drive re-allocation.  ACTION: RA
+* SRM 2.1.14 testing  with SdW on VCERT
-* VO DiRAC people from Leicester are coming online -
+== Advanced Planning ==
-* 2.1.15 change control had its first airing in change control - 2.1.15 currently not working for us.
+'''Tasks'''
-* new tape backed disk servers for Tier1 - to replace CV11, recommendation made to Martin
+* CASTOR 2.1.15 implementation and testing
-* Merging tape pools wiki created by Shaun
+'''Interventions'''
-* 2.1.15 name server tested
-* New SRM on vcert2
+== Staffing ==
-* New SRM (SL6) with bug fixes available - needs test
+* Castor on Call person next week
-* Gfal-cat command failing for atlas reading of nsdumps form castor: https://ggus.eu/index.php?mode=ticket_info&ticket_id=117846. Developers looking to fix within: https://ggus.eu/index.php?mode=ticket_info&ticket_id=118842
+** RA until Thursday
-* LHCb batch jobs failing to copy results into castor - changes made seems to have improved the situation but not fix (Raja). Increasing the number of connections to the NS db (more threads)
+** Propose CP Friday - TBC
-* BD looking at porting persistent tests to Ceph
+* Staff absence/out of the office:
+** BD - Monday-Friday
+** CP - Monday-Tuesday
+** SdW - Tuesday-Wednesday
+== New Actions ==
+* RA to revisit quattor build for the head nodes which did not reboot so that this does not recur.
+* RA 11.2.0.4 DB client update had to be rescheduled, should go ahead Monday 29th
+* RA, CC and Fabric - fix CIP production system and switch back from test server
+* RA follow up with Fabric re: CV '11 gen RAID card controller firmware update
+== Existing Actions ==
+* BD to coordinate with atlas re bulk deletion before TF starts repack
+* GS arrange a meeting to discuss remaining actions on CV11 and V12 (when KH is back)
+* BD to clarify if separating the DiRAC data is a necessity
+* BD ensure quattorising atlas consistency check
+* SdW to send merging tape pools wiki to CERN for review
+* RA to deploy a 14 generation into preprod
+* BD re. WAN tuning proposal - discuss with GS, does it need a change control?
+* RA to try stopping tapeserverd mid-migration to see if it breaks.
+* RA (was SdW) to modify cleanlostfiles to log to syslog so we can track its use - under testing
+* GS to investigate how/if we need to declare xrootd endpoints in GOCDB BDII - progress

Difference between revisions of "RAL Tier1 weekly operations castor 29/02/2016"

Latest revision as of 11:32, 26 February 2016

Contents

Operations News

Operations Problems

Planned, Scheduled and Cancelled Interventions

Long-term projects

Advanced Planning

Staffing

New Actions

Existing Actions

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools