RAL Tier1 weekly operations castor 31/08/2015

From GridPP Wiki
Jump to: navigation, search

Operations News

  • SL6 disk server deployment – successful
  • Oracle prep for 11.2.0.4 on Neptune completed
  • Draining – 7 servers left to drain out of Atlas
  • CMS disk read issues much improved.
    • Changed I/O scheduler for cmsDisk nodes from 'cfq' to 'noop'. We are now seeing performance roughly in line with other CMS Tier 1 sites.
    • CPU efficiency now 85% was 40%

Operations Problems

  • The checksum issue/tickets still present. Thought to be tape recalls/rebalancing (disk to disk) or draining. The source of these needs to be identified.
  • External access problem – seems to have crept in over the period of the SL6 disk deploy / also network interventions. Note no IP tables on diskservers now, can SSH directly to disk server – BLOCK.
  • AAA – cms xroot redirection (DLF node), not been working for some time. Shaun working on.
  • Some of the ganglia monitoring not working on disk servers updated with SL6
  • Atlas fax test down since last Friday (xrdproxy)
  • Atlas moving large amount of data from atlasdisk to tape, slow write to tape experienced. Config changed to allow more tapes to write simultaneously, much improved.
  • GGUS ticket from Atlas - File recalled from tape resulted in a file slightly bigger than expected (with incorrect checksum), also staged with incorrect checksum – this needs further investigation
  • The gridmap file on the the webdav host lcgcadm04.gridpp.rl.ac.uk is not auto-updating.

Blocking Issues

Planned, Scheduled and Cancelled Interventions

  • Tape backed node to SL6
  • Pluto prep for 11.2.0.4
  • Looking to schedule disk server SL6 deployments for next week
    • plan to put at risk all day (wed/thurs)
    • all VOs / all castor disks
    • action by server number taking highest number first (newest / biggest)
  • GDSS 720 / 707 draining > CPU / motherboard fixes
  • Stress test SRM poss deploy week after (Shaun)
  • Upgrade CASTOR disk servers to SL6
    • One tape-backed node being upgraded, more tape-backed nodes next week.
  • Oracle patching schedule planned. (End 13th October)

Advanced Planning

Tasks

  • Proposed CASTOR face to face W/C Oct 5th or 12th
  • Discussed CASTOR 2017 planning, see wiki page.

Interventions

Staffing

  • Castor on Call person next week
    • RA (will need to confirm when Rob is taking over from Shaun this long weekend)
  • Staff absence/out of the office:
    • Bruno out
    • Brian out

New Actions

  • Tim to confirm how max number of tape drives is configured

Existing Actions

  • GS to review tickets in an attempt to identify when checksum/partial file creation issues started
  • SdW to submit a change control to make I/O scheduler for cmsDisk nodes from 'cfq' to 'noop' change persistent on CMSdisk / into SL6
  • RA / SdW to contact CERN Dev about checksum issue / partial file creation
  • SdW to modify cleanlostfiles to log to syslog so we can track its use - under testing
  • RA/JJ to look at information provider re DiRAC (reporting disk only etc)
  • All to investigate why we are getting partial files
  • BC to document processes to control services previously controlled by puppet
  • GS to arrange meeting castor/fab/production to discuss the decommissioning procedures
  • GS to investigate providing checks for /etc/noquattor on production nodes & checks for fetch-crl - ONGOING
  • BD to chase AD about using the space reporting thing we made for him
  • GS to investigate how/if we need to declare xrootd endpoints in GOCDB BDII

Completed actions

  • SdW to look into GC improvements - notify if file in inconsistent state
  • SdW to replace canbemigr test - to test for file that have not been migrated to tape (warn 8h / alert 24h)
  • SdW testing/working gfalcopy rpms
  • RA to look into procedural issues with CMS disk server interventions
  • Someone - mice, what access protocol do they use? A RFIO