Difference between revisions of "RAL Tier1 weekly operations castor 01/02/2016"
From GridPP Wiki
(One intermediate revision by one user not shown) | |||
Line 3: | Line 3: | ||
* 2.1.15 name server tested | * 2.1.15 name server tested | ||
* New SRM on vcert2 | * New SRM on vcert2 | ||
− | |||
− | |||
− | |||
* New SRM (SL6) with bug fixes available - needs test | * New SRM (SL6) with bug fixes available - needs test | ||
− | |||
− | |||
* Gfal-cat command failing for atlas reading of nsdumps form castor: https://ggus.eu/index.php?mode=ticket_info&ticket_id=117846. Developers looking to fix within: https://ggus.eu/index.php?mode=ticket_info&ticket_id=118842 | * Gfal-cat command failing for atlas reading of nsdumps form castor: https://ggus.eu/index.php?mode=ticket_info&ticket_id=117846. Developers looking to fix within: https://ggus.eu/index.php?mode=ticket_info&ticket_id=118842 | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
* LHCb batch jobs failing to copy results into castor - changes made seems to have improved the situation but not fix (Raja). Increasing the number of connections to the NS db (more threads) | * LHCb batch jobs failing to copy results into castor - changes made seems to have improved the situation but not fix (Raja). Increasing the number of connections to the NS db (more threads) | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
* BD looking at porting persistent tests to Ceph | * BD looking at porting persistent tests to Ceph | ||
== Operations Problems == | == Operations Problems == | ||
− | * atlasScratchDisk ... gdss667 triple disk failure, most data lost. 2 x callouts, down on disk servers and hit limit on xroot connections. | + | * atlasScratchDisk ... gdss667 triple disk failure, most data lost. 2 x callouts, down on disk servers and hit limit on xroot connections - SAM test failing. |
* gdss620 - gen tape with fabric still | * gdss620 - gen tape with fabric still | ||
* CV '11 gen RAID card controller firmware update required - WD digital drives failing at a high rate | * CV '11 gen RAID card controller firmware update required - WD digital drives failing at a high rate | ||
− | |||
− | |||
− | |||
− | |||
− | |||
* fac tape drive broken - Tim taken out of tape pool | * fac tape drive broken - Tim taken out of tape pool | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
== Blocking Issues == | == Blocking Issues == | ||
== Planned, Scheduled and Cancelled Interventions == | == Planned, Scheduled and Cancelled Interventions == | ||
− | * | + | * Merge all castor tape backed pool into one |
− | + | ||
− | + | ||
− | + | ||
* 11.2.04 client updates (running in preprod) - possible change control for prod | * 11.2.04 client updates (running in preprod) - possible change control for prod | ||
* WAN tuning proposal - possibly put into change control Brian | * WAN tuning proposal - possibly put into change control Brian | ||
* CASTOR 2.1.15 | * CASTOR 2.1.15 | ||
* Upgrade of Oracle clients to 11.2.0.4 - planning for production rollout Gen first | * Upgrade of Oracle clients to 11.2.0.4 - planning for production rollout Gen first | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
Line 79: | Line 28: | ||
* JJ – Glue 2 for CASTOR, something to do with publishing information??? Not sure there was a specific action associated with this | * JJ – Glue 2 for CASTOR, something to do with publishing information??? Not sure there was a specific action associated with this | ||
* JS – replacing Tier1 CASTOR db hardware, ACTION: RA/JS to discuss disk requirements | * JS – replacing Tier1 CASTOR db hardware, ACTION: RA/JS to discuss disk requirements | ||
− | |||
Line 97: | Line 45: | ||
== New Actions == | == New Actions == | ||
* Rob to double check 667 has had its firmware updated | * Rob to double check 667 has had its firmware updated | ||
− | * Chris to document the Corbin method (i.e. when a diskpool is having issues restart xroot daemons first, then try diskmanagerd, finally use the transfermanager stop - diskmanagerd restart - | + | * Chris to document the Corbin method (i.e. when a diskpool is having issues restart xroot daemons first, then try diskmanagerd, finally use the transfermanager stop - diskmanagerd restart - transfermanager start 'nuclear' method) |
* Rob to send Shaun’s merging tape pools wiki to CERN for review | * Rob to send Shaun’s merging tape pools wiki to CERN for review | ||
* Rob and Shaun to review Alison’s wiki page, docs from castor handover and update castor procedures | * Rob and Shaun to review Alison’s wiki page, docs from castor handover and update castor procedures | ||
Line 111: | Line 59: | ||
* Rob look at DLS / CEDA data rates in relation | * Rob look at DLS / CEDA data rates in relation | ||
* BD re. WAN tuning proposal - discuss with GS, does it need a change control? | * BD re. WAN tuning proposal - discuss with GS, does it need a change control? | ||
− | + | * RA to try stopping tapeserverd mid-migration to see if it breaks. | |
− | + | * RA (was SdW) to modify cleanlostfiles to log to syslog so we can track its use - under testing | |
− | + | ||
− | + | ||
− | * RA to try stopping tapeserverd mid-migration to see if it breaks. | + | |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | * RA (was SdW) to modify cleanlostfiles to log to syslog so we can track its use - under | + | |
− | + | ||
− | testing | + | |
− | + | ||
− | + | ||
− | + | ||
* GS to investigate providing checks for /etc/noquattor on production nodes & checks for | * GS to investigate providing checks for /etc/noquattor on production nodes & checks for | ||
− | + | * GS to investigate how/if we need to declare xrootd endpoints in GOCDB BDII - progress | |
− | + | ||
− | * GS to investigate how/if we need to declare xrootd endpoints in GOCDB BDII - progress | + | |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + |
Latest revision as of 17:45, 29 January 2016
Contents
Operations News
- Merging tape pools wiki created by Shaun
- 2.1.15 name server tested
- New SRM on vcert2
- New SRM (SL6) with bug fixes available - needs test
- Gfal-cat command failing for atlas reading of nsdumps form castor: https://ggus.eu/index.php?mode=ticket_info&ticket_id=117846. Developers looking to fix within: https://ggus.eu/index.php?mode=ticket_info&ticket_id=118842
- LHCb batch jobs failing to copy results into castor - changes made seems to have improved the situation but not fix (Raja). Increasing the number of connections to the NS db (more threads)
- BD looking at porting persistent tests to Ceph
Operations Problems
- atlasScratchDisk ... gdss667 triple disk failure, most data lost. 2 x callouts, down on disk servers and hit limit on xroot connections - SAM test failing.
- gdss620 - gen tape with fabric still
- CV '11 gen RAID card controller firmware update required - WD digital drives failing at a high rate
- fac tape drive broken - Tim taken out of tape pool
Blocking Issues
Planned, Scheduled and Cancelled Interventions
- Merge all castor tape backed pool into one
- 11.2.04 client updates (running in preprod) - possible change control for prod
- WAN tuning proposal - possibly put into change control Brian
- CASTOR 2.1.15
- Upgrade of Oracle clients to 11.2.0.4 - planning for production rollout Gen first
Long-term projects
- RA has produced a python script to handle SRM db duplication issue which is causing callouts. Problem running the python script as version of python on the SRM servers is still at 2.4, however RA will pursue this. SdW has reviewed and confident that this is low risk.
- JJ – Glue 2 for CASTOR, something to do with publishing information??? Not sure there was a specific action associated with this
- JS – replacing Tier1 CASTOR db hardware, ACTION: RA/JS to discuss disk requirements
Advanced Planning
Tasks
- CASTOR 2.1.15 implementation and testing
Interventions
- Remaining D0T1 disk servers
Staffing
- Castor on Call person next week
- Rob
- Staff absence/out of the office:
- Brian out Tues 26th - thurs 4th
- Shaun out 1st Feb for a week, then back in until 1st March
New Actions
- Rob to double check 667 has had its firmware updated
- Chris to document the Corbin method (i.e. when a diskpool is having issues restart xroot daemons first, then try diskmanagerd, finally use the transfermanager stop - diskmanagerd restart - transfermanager start 'nuclear' method)
- Rob to send Shaun’s merging tape pools wiki to CERN for review
- Rob and Shaun to review Alison’s wiki page, docs from castor handover and update castor procedures
- Rob to send email to Kevin O'N re any issues with Fac tape config change level 2
- Tim apply Fac tape config change level 1 - i.e. a DLS read only tape on Monday 1st
- Gareth to share reporting available for fac tape
Existing Actions
- Juan to discuss with Shaun what to do with the workaround applied to “subrequesttodo” procedure in CASTOR 2.1.15.
- Rob to create castor 14 gen build (for Ceph)
- Shaun to present plan to put tape backend pool into one on 29th Jan
- Rob look at DLS / CEDA data rates in relation
- BD re. WAN tuning proposal - discuss with GS, does it need a change control?
- RA to try stopping tapeserverd mid-migration to see if it breaks.
- RA (was SdW) to modify cleanlostfiles to log to syslog so we can track its use - under testing
- GS to investigate providing checks for /etc/noquattor on production nodes & checks for
- GS to investigate how/if we need to declare xrootd endpoints in GOCDB BDII - progress