Difference between revisions of "RAL Tier1 weekly operations castor 10/08/2015"
From GridPP Wiki
(Created page with "== Operations News == * Proposed CASTOR face to face W/C Oct 5th or 12th * New SRM version under testing. Works OK for a single job but hits trouble with many jobs. * SL6 disk...") |
|||
(3 intermediate revisions by one user not shown) | |||
Line 1: | Line 1: | ||
== Operations News == | == Operations News == | ||
− | * | + | * |
− | + | ||
− | + | ||
− | + | ||
== Operations Problems == | == Operations Problems == | ||
* CMS still upset. We believe that the current blocker is the read rate from disk - for this reason we are looking at undoing the 'heart bypass' implemented for CMS and using the scheduler again. | * CMS still upset. We believe that the current blocker is the read rate from disk - for this reason we are looking at undoing the 'heart bypass' implemented for CMS and using the scheduler again. | ||
** AL is trying to reproduce this on preprod. | ** AL is trying to reproduce this on preprod. | ||
− | * The large numbers of checksum tickets seen by the production team are thought to be due to the rebalancer. | + | *Removal of 'heart bypass' may have led to timeouts. |
− | + | **RA has contacted CERN developers fo rfix. | |
+ | * The large numbers of checksum tickets seen by the production team are thought to be due to the rebalancer. | ||
+ | Appears not to be casues by rebalance un-sticking script. | ||
* The gridmap file on the the webdav host lcgcadm04.gridpp.rl.ac.uk is not auto-updating - needed for lhcb and vo.dirac.ac.uk | * The gridmap file on the the webdav host lcgcadm04.gridpp.rl.ac.uk is not auto-updating - needed for lhcb and vo.dirac.ac.uk | ||
− | |||
− | |||
== Blocking Issues == | == Blocking Issues == | ||
* grid ftp bug in SL6 - stops any globus copy if a client is using a particular library. This is a show stopper for SL6 on disk server. | * grid ftp bug in SL6 - stops any globus copy if a client is using a particular library. This is a show stopper for SL6 on disk server. | ||
− | |||
− | |||
== Planned, Scheduled and Cancelled Interventions == | == Planned, Scheduled and Cancelled Interventions == | ||
* Stress test SRM poss deploy week after (Shaun) | * Stress test SRM poss deploy week after (Shaun) | ||
* Upgrade CASTOR disk servers to SL6 | * Upgrade CASTOR disk servers to SL6 | ||
− | * | + | **All servers now PXEBOOT , CV11 have issue over incorrect sdb scheme. |
− | + | * Oracle patching schedule planned. (End 13th October) | |
== Advanced Planning == | == Advanced Planning == | ||
Line 26: | Line 21: | ||
* Proposed CASTOR face to face W/C Oct 5th or 12th | * Proposed CASTOR face to face W/C Oct 5th or 12th | ||
* Discussed CASTOR 2017 planning, see wiki page. | * Discussed CASTOR 2017 planning, see wiki page. | ||
− | |||
'''Interventions''' | '''Interventions''' | ||
− | |||
− | |||
== Staffing == | == Staffing == | ||
* Castor on Call person next week | * Castor on Call person next week | ||
** RA | ** RA | ||
− | |||
* Staff absence/out of the office: | * Staff absence/out of the office: | ||
− | |||
− | |||
− | |||
− | |||
== New Actions == | == New Actions == | ||
− | * | + | * GS to investigate how/if we need to declare xrootd endpoints in GOCDB BDII |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
== Existing Actions == | == Existing Actions == | ||
* SdW to modify cleanlostfiles to log to syslog so we can track its use - under testing | * SdW to modify cleanlostfiles to log to syslog so we can track its use - under testing | ||
* SdW to look into GC improvements - notify if file in inconsistent state | * SdW to look into GC improvements - notify if file in inconsistent state | ||
* SdW to replace canbemigr test - to test for file that have not been migrated to tape (warn 8h / alert 24h) | * SdW to replace canbemigr test - to test for file that have not been migrated to tape (warn 8h / alert 24h) | ||
+ | * SdW testing/working gfalcopy rpms | ||
* RA/JJ to look at information provider re DiRAC (reporting disk only etc) | * RA/JJ to look at information provider re DiRAC (reporting disk only etc) | ||
− | |||
* RA to look into procedural issues with CMS disk server interventions | * RA to look into procedural issues with CMS disk server interventions | ||
+ | * RA to get jobs thought to cause CMS pileup | ||
+ | ** AL to look at optimising lazy download | ||
+ | * RA to undo CMS's 'heart bypass' (unscheduled reads) to see if read rates are improved | ||
+ | * RA to investigate why we are getting partial files | ||
+ | * BC / RA to write change control doc for SL6 disk | ||
* BC to document processes to control services previously controlled by puppet | * BC to document processes to control services previously controlled by puppet | ||
* GS to arrange meeting castor/fab/production to discuss the decommissioning procedures | * GS to arrange meeting castor/fab/production to discuss the decommissioning procedures | ||
* GS to investigate providing checks for /etc/noquattor on production nodes & checks for fetch-crl - ONGOING | * GS to investigate providing checks for /etc/noquattor on production nodes & checks for fetch-crl - ONGOING | ||
− | * | + | * BD to set all non-target ATLAS nodes to R/Only to see if it stops partial file draining problem |
− | * | + | * BD to chase AD about using the space reporting thing we made for him |
− | * | + | * JS, RA and GS to propose dates for Oracle 11.2.0.4 patching. |
* Someone - mice, what access protocol do they use? | * Someone - mice, what access protocol do they use? | ||
− | + | * All to book meeting with Rob re draining / disk deployment / decommissioning ... | |
== Completed actions == | == Completed actions == | ||
* RA/GS to write some new docs to cover oncall procedures for CMS with the introduction of unscheduled xroot reads | * RA/GS to write some new docs to cover oncall procedures for CMS with the introduction of unscheduled xroot reads |
Latest revision as of 14:09, 7 August 2015
Contents
Operations News
Operations Problems
- CMS still upset. We believe that the current blocker is the read rate from disk - for this reason we are looking at undoing the 'heart bypass' implemented for CMS and using the scheduler again.
- AL is trying to reproduce this on preprod.
- Removal of 'heart bypass' may have led to timeouts.
- RA has contacted CERN developers fo rfix.
- The large numbers of checksum tickets seen by the production team are thought to be due to the rebalancer.
Appears not to be casues by rebalance un-sticking script.
- The gridmap file on the the webdav host lcgcadm04.gridpp.rl.ac.uk is not auto-updating - needed for lhcb and vo.dirac.ac.uk
Blocking Issues
- grid ftp bug in SL6 - stops any globus copy if a client is using a particular library. This is a show stopper for SL6 on disk server.
Planned, Scheduled and Cancelled Interventions
- Stress test SRM poss deploy week after (Shaun)
- Upgrade CASTOR disk servers to SL6
- All servers now PXEBOOT , CV11 have issue over incorrect sdb scheme.
- Oracle patching schedule planned. (End 13th October)
Advanced Planning
Tasks
- Proposed CASTOR face to face W/C Oct 5th or 12th
- Discussed CASTOR 2017 planning, see wiki page.
Interventions
Staffing
- Castor on Call person next week
- RA
- Staff absence/out of the office:
New Actions
- GS to investigate how/if we need to declare xrootd endpoints in GOCDB BDII
Existing Actions
- SdW to modify cleanlostfiles to log to syslog so we can track its use - under testing
- SdW to look into GC improvements - notify if file in inconsistent state
- SdW to replace canbemigr test - to test for file that have not been migrated to tape (warn 8h / alert 24h)
- SdW testing/working gfalcopy rpms
- RA/JJ to look at information provider re DiRAC (reporting disk only etc)
- RA to look into procedural issues with CMS disk server interventions
- RA to get jobs thought to cause CMS pileup
- AL to look at optimising lazy download
- RA to undo CMS's 'heart bypass' (unscheduled reads) to see if read rates are improved
- RA to investigate why we are getting partial files
- BC / RA to write change control doc for SL6 disk
- BC to document processes to control services previously controlled by puppet
- GS to arrange meeting castor/fab/production to discuss the decommissioning procedures
- GS to investigate providing checks for /etc/noquattor on production nodes & checks for fetch-crl - ONGOING
- BD to set all non-target ATLAS nodes to R/Only to see if it stops partial file draining problem
- BD to chase AD about using the space reporting thing we made for him
- JS, RA and GS to propose dates for Oracle 11.2.0.4 patching.
- Someone - mice, what access protocol do they use?
- All to book meeting with Rob re draining / disk deployment / decommissioning ...
Completed actions
- RA/GS to write some new docs to cover oncall procedures for CMS with the introduction of unscheduled xroot reads
- RA/AD to clarify what we are doing with 'broken' disk servers
- GS to ensure that there is a ping test etc to the atlas building
- BC to put SL6 on preprod disks.
- RA to remove Facilities disk servers from cedaRetrieve to go back to Fabric for acceptance testing.