Difference between revisions of "RAL Tier1 weekly operations castor 29/02/2016"
From GridPP Wiki
(Created page with "== Operations News == * No disk server issues this week * globc updates applied, all CASTOR systems rebooted. initial issues with head nodes, 7 failed to reboot due to their ...") |
|||
Line 1: | Line 1: | ||
== Operations News == | == Operations News == | ||
* No disk server issues this week | * No disk server issues this week | ||
− | * | + | * glibc updates applied, all CASTOR systems rebooted. initial issues with head nodes, 7 failed to reboot due to their build history. ACTION: RA to revisit quattor build so that this does not recur. |
− | + | * 11.2.0.4 DB client update had to be rescheduled, should go ahead Monday 29th, has been running in pre-prod for considerable amount of time. This should be transparent. ACTION RA | |
− | * 11.2.0.4 DB client update had to be rescheduled, should go ahead Monday 29th, has been running in pre-prod for considerable amount of time. This should be transparent. | + | |
+ | == Operations Problems == | ||
+ | * Main CIP system failed, have failed over to test CIP machine. HW failure to be fixed then will fail back over to production system. ACTION: RA, CC and Fabric | ||
+ | * OPN links. BD investigating what data flow is filling OPN and superjanet at the same time. | ||
+ | * LHCb job failures - GGUS ticket open | ||
+ | * ongoing AAA issues in CMS | ||
+ | * CV '11 gen RAID card controller firmware update required - WD digital drives failing at a high rate ACTION: RA follow up with fabric team | ||
− | * | + | == Planned, Scheduled and Cancelled Interventions == |
− | + | * 2.1.15 update to nameserver will not go ahead. This is due to slow file open times issues on the stager. Testing/debugging of stager issue is ongoing. If these issues are resolved propose nameserver upgrade on 16th March with stager following week, 22nd March. (RA) | |
− | + | * 11.2.04 client updates (running in preprod) - possible change control for prod (see above) | |
− | * | + | * WAN tuning proposal - possibly put into change control BD |
− | + | * CASTOR facilities patching scheduled for next week - detailed schedule to be agreed with fabric team. | |
− | * | + | |
− | * | + | |
− | + | ||
− | + | ||
+ | == Long-term projects == | ||
+ | * RA has produced a python script to handle SRM db duplication issue which is causing callouts. This script has been tested and will now be put into production, as a cron job. This should be a temporary fix, so a bug report should be made to the FTS development team, via AL. | ||
+ | * JJ – Glue 2 for CASTOR, used for publishing information. RA writing data getting end in python, JJ writing Glue 2 end in LISP. No schedule as yet. | ||
+ | * Facilities drive re-allocation. ACTION: RA | ||
+ | * SRM 2.1.14 testing with SdW on VCERT | ||
− | + | == Advanced Planning == | |
− | * 2.1.15 | + | '''Tasks''' |
− | * | + | * CASTOR 2.1.15 implementation and testing |
− | * | + | '''Interventions''' |
− | * | + | |
− | * | + | == Staffing == |
− | * | + | * Castor on Call person next week |
− | * | + | ** RA until Thursday |
− | * | + | ** Propose CP Friday - TBC |
− | * | + | |
+ | * Staff absence/out of the office: | ||
+ | ** BD - Monday-Friday | ||
+ | ** CP - Monday-Tuesday | ||
+ | ** SdW - Tuesday-Wednesday | ||
+ | |||
+ | == New Actions == | ||
+ | * RA to revisit quattor build for the head nodes which did not reboot so that this does not recur. | ||
+ | * RA 11.2.0.4 DB client update had to be rescheduled, should go ahead Monday 29th | ||
+ | * RA, CC and Fabric - fix CIP production system and switch back from test server | ||
+ | * RA follow up with Fabric re: CV '11 gen RAID card controller firmware update | ||
+ | |||
+ | == Existing Actions == | ||
+ | |||
+ | * BD to coordinate with atlas re bulk deletion before TF starts repack | ||
+ | * GS arrange a meeting to discuss remaining actions on CV11 and V12 (when KH is back) | ||
+ | * BD to clarify if separating the DiRAC data is a necessity | ||
+ | * BD ensure quattorising atlas consistency check | ||
+ | * SdW to send merging tape pools wiki to CERN for review | ||
+ | * RA to deploy a 14 generation into preprod | ||
+ | * BD re. WAN tuning proposal - discuss with GS, does it need a change control? | ||
+ | * RA to try stopping tapeserverd mid-migration to see if it breaks. | ||
+ | * RA (was SdW) to modify cleanlostfiles to log to syslog so we can track its use - under testing | ||
+ | * GS to investigate how/if we need to declare xrootd endpoints in GOCDB BDII - progress |
Latest revision as of 11:32, 26 February 2016
Contents
Operations News
- No disk server issues this week
- glibc updates applied, all CASTOR systems rebooted. initial issues with head nodes, 7 failed to reboot due to their build history. ACTION: RA to revisit quattor build so that this does not recur.
- 11.2.0.4 DB client update had to be rescheduled, should go ahead Monday 29th, has been running in pre-prod for considerable amount of time. This should be transparent. ACTION RA
Operations Problems
- Main CIP system failed, have failed over to test CIP machine. HW failure to be fixed then will fail back over to production system. ACTION: RA, CC and Fabric
- OPN links. BD investigating what data flow is filling OPN and superjanet at the same time.
- LHCb job failures - GGUS ticket open
- ongoing AAA issues in CMS
- CV '11 gen RAID card controller firmware update required - WD digital drives failing at a high rate ACTION: RA follow up with fabric team
Planned, Scheduled and Cancelled Interventions
- 2.1.15 update to nameserver will not go ahead. This is due to slow file open times issues on the stager. Testing/debugging of stager issue is ongoing. If these issues are resolved propose nameserver upgrade on 16th March with stager following week, 22nd March. (RA)
- 11.2.04 client updates (running in preprod) - possible change control for prod (see above)
- WAN tuning proposal - possibly put into change control BD
- CASTOR facilities patching scheduled for next week - detailed schedule to be agreed with fabric team.
Long-term projects
- RA has produced a python script to handle SRM db duplication issue which is causing callouts. This script has been tested and will now be put into production, as a cron job. This should be a temporary fix, so a bug report should be made to the FTS development team, via AL.
- JJ – Glue 2 for CASTOR, used for publishing information. RA writing data getting end in python, JJ writing Glue 2 end in LISP. No schedule as yet.
- Facilities drive re-allocation. ACTION: RA
- SRM 2.1.14 testing with SdW on VCERT
Advanced Planning
Tasks
- CASTOR 2.1.15 implementation and testing
Interventions
Staffing
- Castor on Call person next week
- RA until Thursday
- Propose CP Friday - TBC
- Staff absence/out of the office:
- BD - Monday-Friday
- CP - Monday-Tuesday
- SdW - Tuesday-Wednesday
New Actions
- RA to revisit quattor build for the head nodes which did not reboot so that this does not recur.
- RA 11.2.0.4 DB client update had to be rescheduled, should go ahead Monday 29th
- RA, CC and Fabric - fix CIP production system and switch back from test server
- RA follow up with Fabric re: CV '11 gen RAID card controller firmware update
Existing Actions
- BD to coordinate with atlas re bulk deletion before TF starts repack
- GS arrange a meeting to discuss remaining actions on CV11 and V12 (when KH is back)
- BD to clarify if separating the DiRAC data is a necessity
- BD ensure quattorising atlas consistency check
- SdW to send merging tape pools wiki to CERN for review
- RA to deploy a 14 generation into preprod
- BD re. WAN tuning proposal - discuss with GS, does it need a change control?
- RA to try stopping tapeserverd mid-migration to see if it breaks.
- RA (was SdW) to modify cleanlostfiles to log to syslog so we can track its use - under testing
- GS to investigate how/if we need to declare xrootd endpoints in GOCDB BDII - progress