RAL Tier1 weekly operations castor 10/08/2015

Operations News

CMS still upset. We believe that the current blocker is the read rate from disk - for this reason we are looking at undoing the 'heart bypass' implemented for CMS and using the scheduler again.
- AL is trying to reproduce this on preprod.
Removal of 'heart bypass' may have led to timeouts.
- RA has contacted CERN developers fo rfix.
The large numbers of checksum tickets seen by the production team are thought to be due to the rebalancer.

Appears not to be casues by rebalance un-sticking script.

The gridmap file on the the webdav host lcgcadm04.gridpp.rl.ac.uk is not auto-updating - needed for lhcb and vo.dirac.ac.uk

grid ftp bug in SL6 - stops any globus copy if a client is using a particular library. This is a show stopper for SL6 on disk server.

Stress test SRM poss deploy week after (Shaun)
Upgrade CASTOR disk servers to SL6
- All servers now PXEBOOT , CV11 have issue over incorrect sdb scheme.
Oracle patching schedule planned. (End 13th October)

Tasks

Interventions

SdW to modify cleanlostfiles to log to syslog so we can track its use - under testing
SdW to look into GC improvements - notify if file in inconsistent state
SdW to replace canbemigr test - to test for file that have not been migrated to tape (warn 8h / alert 24h)
SdW testing/working gfalcopy rpms
RA/JJ to look at information provider re DiRAC (reporting disk only etc)
RA to look into procedural issues with CMS disk server interventions
RA to get jobs thought to cause CMS pileup
- AL to look at optimising lazy download
RA to undo CMS's 'heart bypass' (unscheduled reads) to see if read rates are improved
RA to investigate why we are getting partial files
BC / RA to write change control doc for SL6 disk
BC to document processes to control services previously controlled by puppet
GS to arrange meeting castor/fab/production to discuss the decommissioning procedures
GS to investigate providing checks for /etc/noquattor on production nodes & checks for fetch-crl - ONGOING
BD to set all non-target ATLAS nodes to R/Only to see if it stops partial file draining problem
BD to chase AD about using the space reporting thing we made for him
JS, RA and GS to propose dates for Oracle 11.2.0.4 patching.
Someone - mice, what access protocol do they use?
All to book meeting with Rob re draining / disk deployment / decommissioning ...

RA/GS to write some new docs to cover oncall procedures for CMS with the introduction of unscheduled xroot reads
RA/AD to clarify what we are doing with 'broken' disk servers
GS to ensure that there is a ping test etc to the atlas building
BC to put SL6 on preprod disks.
RA to remove Facilities disk servers from cedaRetrieve to go back to Fabric for acceptance testing.