RAL Tier1 weekly operations castor 31/08/2015

Operations News

SL6 disk server deployment – successful
Oracle prep for 11.2.0.4 on Neptune completed
Draining – 7 servers left to drain out of Atlas
CMS disk read issues much improved.
- Changed I/O scheduler for cmsDisk nodes from 'cfq' to 'noop'. We are now seeing performance roughly in line with other CMS Tier 1 sites.
- CPU efficiency now 85% was 40%
Disk server draining on hold due to ATLAS being very full

The checksum issue/tickets still present. Thought to be tape recalls/rebalancing (disk to disk) or draining. The source of these needs to be identified.
External access problem – seems to have crept in over the period of the SL6 disk deploy / also network interventions. Note no IP tables on diskservers now, including ability to SSH to directly to disk server – BLOCK.
AAA – cms xroot redirection (DLF node), not been working for some time. Shaun working on.
Some of the ganglia monitoring not working on disk servers updated with SL6
Atlas fax test down since last Friday (xrdproxy)

Atlas moving large amount of data from atlasdisk to tape, slow write to tape experienced. Config changed to allow more tapes to write simultaneously, much improved.
GGUS ticket from Atlas - File recalled from tape resulted in a file slightly bigger than expected (with incorrect checksum), also staged with incorrect checksum – this needs further investigation
The gridmap file on the the webdav host lcgcadm04.gridpp.rl.ac.uk is not auto-updating.

Looking to schedule disk server SL6 deployments for next week
- plan to put at risk all day (wed/thurs)
- all VOs / all castor disks
- action by server number taking highest number first (newest / biggest)
GDSS 720 / 707 draining > CPU / motherboard fixes
Stress test SRM poss deploy week after (Shaun)
Upgrade CASTOR disk servers to SL6
- One tape-backed node being upgraded, more tape-backed nodes next week.
Oracle patching schedule planned. (End 13th October)

Tasks

Interventions

Castor on Call person next week
- RA (will need to confirm when Rob is taking over from Shaun this long weekend)
Staff absence/out of the office:
- Bruno out
- Brian out

GS to review tickets in an attempt to identify when checksum/partial file creation issues started
SdW to submit a change control to make I/O scheduler for cmsDisk nodes from 'cfq' to 'noop' change persistent on CMSdisk / into SL6
RA / SdW to contact CERN Dev about checksum issue / partial file creation
SdW to modify cleanlostfiles to log to syslog so we can track its use - under testing
RA/JJ to look at information provider re DiRAC (reporting disk only etc)
All to investigate why we are getting partial files
BC to document processes to control services previously controlled by puppet
GS to arrange meeting castor/fab/production to discuss the decommissioning procedures
GS to investigate providing checks for /etc/noquattor on production nodes & checks for fetch-crl - ONGOING
BD to chase AD about using the space reporting thing we made for him
GS to investigate how/if we need to declare xrootd endpoints in GOCDB BDII

SdW to look into GC improvements - notify if file in inconsistent state
SdW to replace canbemigr test - to test for file that have not been migrated to tape (warn 8h / alert 24h)
SdW testing/working gfalcopy rpms
RA to look into procedural issues with CMS disk server interventions
Someone - mice, what access protocol do they use? A RFIO