RAL Tier1 weekly operations castor 31/08/2015
From GridPP Wiki
Revision as of 10:36, 28 August 2015 by Christopher Prosser 1e304264ea (Talk | contribs)
Contents
Operations News
- SL6 disk server deployment – successful
- Oracle prep for 11.2.0.4 on Neptune completed
- Draining – 7 servers left to drain out of Atlas
- CMS disk read issues much improved.
- Changed I/O scheduler for cmsDisk nodes from 'cfq' to 'noop'. We are now seeing performance roughly in line with other CMS Tier 1 sites.
- CPU efficiency now 85% was 40%
- Disk server draining on hold due to ATLAS being very full
Operations Problems
- The checksum issue/tickets still present. Thought to be tape recalls/rebalancing (disk to disk) or draining. The source of these needs to be identified.
- External access problem – seems to have crept in over the period of the SL6 disk deploy / also network interventions. Note no IP tables on diskservers now, including ability to SSH to directly to disk server – BLOCK.
- AAA – cms xroot redirection (DLF node), not been working for some time. Shaun working on.
- Some of the ganglia monitoring not working on disk servers updated with SL6
- Atlas fax test down since last Friday (xrdproxy)
- Atlas moving large amount of data from atlasdisk to tape, slow write to tape experienced. Config changed to allow more tapes to write simultaneously, much improved.
- GGUS ticket from Atlas - File recalled from tape resulted in a file slightly bigger than expected (with incorrect checksum), also staged with incorrect checksum – this needs further investigation
- The gridmap file on the the webdav host lcgcadm04.gridpp.rl.ac.uk is not auto-updating.
Blocking Issues
Planned, Scheduled and Cancelled Interventions
- Tape backed node to SL6
- Pluto prep for 11.2.0.4
- Looking to schedule disk server SL6 deployments for next week
- plan to put at risk all day (wed/thurs)
- all VOs / all castor disks
- action by server number taking highest number first (newest / biggest)
- GDSS 720 / 707 draining > CPU / motherboard fixes
- Stress test SRM poss deploy week after (Shaun)
- Upgrade CASTOR disk servers to SL6
- One tape-backed node being upgraded, more tape-backed nodes next week.
- Oracle patching schedule planned. (End 13th October)
Advanced Planning
Tasks
- Proposed CASTOR face to face W/C Oct 5th or 12th
- Discussed CASTOR 2017 planning, see wiki page.
Interventions
Staffing
- Castor on Call person next week
- RA (will need to confirm when Rob is taking over from Shaun this long weekend)
- Staff absence/out of the office:
- Bruno out
- Brian out
New Actions
- Tim to confirm how max number of tape drives is configured
Existing Actions
- GS to review tickets in an attempt to identify when checksum/partial file creation issues started
- SdW to submit a change control to make I/O scheduler for cmsDisk nodes from 'cfq' to 'noop' change persistent on CMSdisk / into SL6
- RA / SdW to contact CERN Dev about checksum issue / partial file creation
- SdW to modify cleanlostfiles to log to syslog so we can track its use - under testing
- RA/JJ to look at information provider re DiRAC (reporting disk only etc)
- All to investigate why we are getting partial files
- BC to document processes to control services previously controlled by puppet
- GS to arrange meeting castor/fab/production to discuss the decommissioning procedures
- GS to investigate providing checks for /etc/noquattor on production nodes & checks for fetch-crl - ONGOING
- BD to chase AD about using the space reporting thing we made for him
- GS to investigate how/if we need to declare xrootd endpoints in GOCDB BDII
Completed actions
- SdW to look into GC improvements - notify if file in inconsistent state
- SdW to replace canbemigr test - to test for file that have not been migrated to tape (warn 8h / alert 24h)
- SdW testing/working gfalcopy rpms
- RA to look into procedural issues with CMS disk server interventions
- Someone - mice, what access protocol do they use? A RFIO