RAL Tier1 weekly operations castor 09/10/2015

Operations News

The big intervention to complete 'step 6' on the 11.2.0.4 DB upgrade (moving the Neptune nodes back to R89 and Pluto to R26) has been completed successfully. The remaining step is to move Pluto back to R89.

---

Draining of Atlas should be complete on Monday - seeing a few more stuck transfers, but could be related to the server type / files

CMS disk read issues much improved.
- Changed I/O scheduler for cmsDisk nodes from 'cfq' to 'noop'. We are now seeing performance roughly in line with other CMS Tier 1 sites.
- CPU efficiency now 85% was 40%

More SRM DB dups. Shaun / Juan possibly looking at making some schema changes to fix this? A Shaun to revisit W/C 12th Oct.

LHCB ticket re. low level of failures likely to be fail in nsmkdir. Look for timeouts in name server logs to prove where timeout is. Mitigation may be to increase thread count on NS. Be a little careful with this one as vastly increasing this could exceed connection limit to DB (number of NS daemons is quite high)

Putdone without a put, 4 threads on Stager for this type of request. Suggest from Shaun (with CERN agreement) increase to 12.

High loading seen on the atlasTape instance due to heavy ATLAS usage. We intend to add another five nodes to the disk buffer to improve throughput early next week.

---

The checksum issue/tickets still present. Thought to be tape recalls/rebalancing (disk to disk) or draining. The source of these needs to be identified. Tickets seem to have increased since June but steady increase so difficult to tie to any particular change.
Atlas moving large amount of data from atlasdisk to tape, slow write to tape experienced. Config changed to allow more tapes to write simultaneously, much improved.
The gridmap file on the the webdav host lcgcadm04.gridpp.rl.ac.uk is not auto-updating.

Tasks

Interventions

Tim to confirm how max number of tape drives is configured
SdW to submit a change control to make I/O scheduler for cmsDisk nodes from 'cfq' to 'noop' change persistent on CMSdisk / into SL6
RA / SdW to contact CERN Dev about checksum issue / partial file creation
SdW to modify cleanlostfiles to log to syslog so we can track its use - under testing
RA/JJ to look at information provider re DiRAC (reporting disk only etc)
All to investigate why we are getting partial files
BC to document processes to control services previously controlled by puppet
GS to arrange meeting castor/fab/production to discuss the decommissioning procedures
GS to investigate providing checks for /etc/noquattor on production nodes & checks for fetch-crl - ONGOING
BD to chase AD about using the space reporting thing we made for him
GS to investigate how/if we need to declare xrootd endpoints in GOCDB BDII

SdW to look into GC improvements - notify if file in inconsistent state
SdW to replace canbemigr test - to test for file that have not been migrated to tape (warn 8h / alert 24h)
SdW testing/working gfalcopy rpms
RA to look into procedural issues with CMS disk server interventions
Someone - mice, what access protocol do they use? A RFIO