RAL Tier1 weekly operations castor 17/08/2015
From GridPP Wiki
Revision as of 12:39, 14 August 2015 by Rob Appleyard 7822b28575 (Talk | contribs)
Contents
Operations News
- CMS disk read issues much improved.
- Changed I/O scheduler for cmsDisk nodes from 'cfq' to 'noop'. We are now seeing performance roughly in line with other CMS Tier 1 sites.
Operations Problems
- The large numbers of checksum tickets seen by the production team are not thought to be due to the rebalancer. The source of these needs to be identified.
- The gridmap file on the the webdav host lcgcadm04.gridpp.rl.ac.uk is not auto-updating - needed for lhcb and vo.dirac.ac.uk
Blocking Issues
Planned, Scheduled and Cancelled Interventions
- Stress test SRM poss deploy week after (Shaun)
- Upgrade CASTOR disk servers to SL6
- All servers now PXEBOOT , CV11 have issue over incorrect sdb scheme.
- Oracle patching schedule planned. (End 13th October)
Advanced Planning
Tasks
- Proposed CASTOR face to face W/C Oct 5th or 12th
- Discussed CASTOR 2017 planning, see wiki page.
Interventions
Staffing
- Castor on Call person next week
- SdW
- Staff absence/out of the office:
New Actions
- GS to investigate how/if we need to declare xrootd endpoints in GOCDB BDII
Existing Actions
- SdW to modify cleanlostfiles to log to syslog so we can track its use - under testing
- SdW to look into GC improvements - notify if file in inconsistent state
- SdW to replace canbemigr test - to test for file that have not been migrated to tape (warn 8h / alert 24h)
- SdW testing/working gfalcopy rpms
- RA/JJ to look at information provider re DiRAC (reporting disk only etc)
- RA to look into procedural issues with CMS disk server interventions
- RA to get jobs thought to cause CMS pileup
- AL to look at optimising lazy download
- RA to undo CMS's 'heart bypass' (unscheduled reads) to see if read rates are improved
- RA to investigate why we are getting partial files
- BC / RA to write change control doc for SL6 disk
- BC to document processes to control services previously controlled by puppet
- GS to arrange meeting castor/fab/production to discuss the decommissioning procedures
- GS to investigate providing checks for /etc/noquattor on production nodes & checks for fetch-crl - ONGOING
- BD to set all non-target ATLAS nodes to R/Only to see if it stops partial file draining problem
- BD to chase AD about using the space reporting thing we made for him
- JS, RA and GS to propose dates for Oracle 11.2.0.4 patching.
- Someone - mice, what access protocol do they use?
- All to book meeting with Rob re draining / disk deployment / decommissioning ...
Completed actions
- RA/GS to write some new docs to cover oncall procedures for CMS with the introduction of unscheduled xroot reads
- RA/AD to clarify what we are doing with 'broken' disk servers
- GS to ensure that there is a ping test etc to the atlas building
- BC to put SL6 on preprod disks.
- RA to remove Facilities disk servers from cedaRetrieve to go back to Fabric for acceptance testing.