Difference between revisions of "RAL Tier1 weekly operations castor 17/08/2015"
From GridPP Wiki
(Created page with "== Operations News == * CMS disk read issues much improved. ** Changed I/O scheduler for cmsDisk nodes from 'cfq' to 'noop'. We are now seeing performance roughly in line with...") |
|||
Line 2: | Line 2: | ||
* CMS disk read issues much improved. | * CMS disk read issues much improved. | ||
** Changed I/O scheduler for cmsDisk nodes from 'cfq' to 'noop'. We are now seeing performance roughly in line with other CMS Tier 1 sites. | ** Changed I/O scheduler for cmsDisk nodes from 'cfq' to 'noop'. We are now seeing performance roughly in line with other CMS Tier 1 sites. | ||
+ | * Disk server draining on hold due to ATLAS being very full | ||
== Operations Problems == | == Operations Problems == | ||
− | * The large numbers of checksum tickets seen by the production team are not thought to be due to the rebalancer. The source of these needs to be identified. | + | * The large numbers of checksum tickets seen by the production team are not thought to be due to the rebalancer. The source of these needs to be identified. Shaun is investigating |
− | * The gridmap file on the the webdav host lcgcadm04.gridpp.rl.ac.uk is not auto-updating | + | * The gridmap file on the the webdav host lcgcadm04.gridpp.rl.ac.uk is not auto-updating. |
== Blocking Issues == | == Blocking Issues == | ||
Line 10: | Line 11: | ||
* Stress test SRM poss deploy week after (Shaun) | * Stress test SRM poss deploy week after (Shaun) | ||
* Upgrade CASTOR disk servers to SL6 | * Upgrade CASTOR disk servers to SL6 | ||
− | ** | + | ** One tape-backed node being upgraded, more tape-backed nodes next week. |
* Oracle patching schedule planned. (End 13th October) | * Oracle patching schedule planned. (End 13th October) | ||
Line 23: | Line 24: | ||
* Staff absence/out of the office: | * Staff absence/out of the office: | ||
== New Actions == | == New Actions == | ||
− | + | ||
== Existing Actions == | == Existing Actions == | ||
* SdW to modify cleanlostfiles to log to syslog so we can track its use - under testing | * SdW to modify cleanlostfiles to log to syslog so we can track its use - under testing | ||
Line 31: | Line 32: | ||
* RA/JJ to look at information provider re DiRAC (reporting disk only etc) | * RA/JJ to look at information provider re DiRAC (reporting disk only etc) | ||
* RA to look into procedural issues with CMS disk server interventions | * RA to look into procedural issues with CMS disk server interventions | ||
− | |||
− | |||
− | |||
* RA to investigate why we are getting partial files | * RA to investigate why we are getting partial files | ||
− | |||
* BC to document processes to control services previously controlled by puppet | * BC to document processes to control services previously controlled by puppet | ||
* GS to arrange meeting castor/fab/production to discuss the decommissioning procedures | * GS to arrange meeting castor/fab/production to discuss the decommissioning procedures | ||
* GS to investigate providing checks for /etc/noquattor on production nodes & checks for fetch-crl - ONGOING | * GS to investigate providing checks for /etc/noquattor on production nodes & checks for fetch-crl - ONGOING | ||
− | |||
* BD to chase AD about using the space reporting thing we made for him | * BD to chase AD about using the space reporting thing we made for him | ||
− | |||
* Someone - mice, what access protocol do they use? | * Someone - mice, what access protocol do they use? | ||
− | * | + | * GS to investigate how/if we need to declare xrootd endpoints in GOCDB BDII |
== Completed actions == | == Completed actions == | ||
− | |||
− | |||
− | |||
− | |||
− |
Latest revision as of 13:26, 14 August 2015
Contents
Operations News
- CMS disk read issues much improved.
- Changed I/O scheduler for cmsDisk nodes from 'cfq' to 'noop'. We are now seeing performance roughly in line with other CMS Tier 1 sites.
- Disk server draining on hold due to ATLAS being very full
Operations Problems
- The large numbers of checksum tickets seen by the production team are not thought to be due to the rebalancer. The source of these needs to be identified. Shaun is investigating
- The gridmap file on the the webdav host lcgcadm04.gridpp.rl.ac.uk is not auto-updating.
Blocking Issues
Planned, Scheduled and Cancelled Interventions
- Stress test SRM poss deploy week after (Shaun)
- Upgrade CASTOR disk servers to SL6
- One tape-backed node being upgraded, more tape-backed nodes next week.
- Oracle patching schedule planned. (End 13th October)
Advanced Planning
Tasks
- Proposed CASTOR face to face W/C Oct 5th or 12th
- Discussed CASTOR 2017 planning, see wiki page.
Interventions
Staffing
- Castor on Call person next week
- SdW
- Staff absence/out of the office:
New Actions
Existing Actions
- SdW to modify cleanlostfiles to log to syslog so we can track its use - under testing
- SdW to look into GC improvements - notify if file in inconsistent state
- SdW to replace canbemigr test - to test for file that have not been migrated to tape (warn 8h / alert 24h)
- SdW testing/working gfalcopy rpms
- RA/JJ to look at information provider re DiRAC (reporting disk only etc)
- RA to look into procedural issues with CMS disk server interventions
- RA to investigate why we are getting partial files
- BC to document processes to control services previously controlled by puppet
- GS to arrange meeting castor/fab/production to discuss the decommissioning procedures
- GS to investigate providing checks for /etc/noquattor on production nodes & checks for fetch-crl - ONGOING
- BD to chase AD about using the space reporting thing we made for him
- Someone - mice, what access protocol do they use?
- GS to investigate how/if we need to declare xrootd endpoints in GOCDB BDII