RAL Tier1 weekly operations castor 02/10/2015

From GridPP Wiki
Jump to: navigation, search

Operations News

  • GDSS707 / 720 are back in production
  • WAN tuning on preprod disk servers running and not causing any problems – stats to come Brian.
  • Alastair has reviewed disk server commissioning/decommission plan with some updates

---

  • Oracle 11.2.0.4 upgrade on Neptune completed (Atlas/Gen) - currently running on R26 standby until swap back on 6th Oct
  • Draining of Atlas should be complete on Monday - seeing a few more stuck transfers, but could be related to the server type / files
  • GDSS 720 / 707 ready to be returned (following conversation with Fabric)
  • CMS disk read issues much improved.
    • Changed I/O scheduler for cmsDisk nodes from 'cfq' to 'noop'. We are now seeing performance roughly in line with other CMS Tier 1 sites.
    • CPU efficiency now 85% was 40%

Operations Problems

  • More SRM DB dups. Shaun / Juan possibly looking at making some schema changes to fix this? A Shaun to revisit W/C 12th Oct.
  • LHCB ticket re. low level of failures likely to be fail in nsmkdir. Look for timeouts in name server logs to prove where timeout is. Mitigation may be to increase thread count on NS. Be a little careful with this one as vastly increasing this could exceed connection limit to DB (number of NS daemons is quite high)
  • Putdone without a put, 4 threads on Stager for this type of request. Suggest from Shaun (with CERN agreement) increase to 12.
  • Gdss661 removed from castor and back to fabric to gather spares cv11 spec – no further castor action.
  • Was a callout SRM03 as CRLs didn’t get updated. A Production team check srm03 regarding lack of crl updates

---

  • FTS – heavy load from Atlas / DiRAC … new atlas work to test instance to mitigate
  • Some SRM db dups on CMS and Atlas
  • The checksum issue/tickets still present. Thought to be tape recalls/rebalancing (disk to disk) or draining. The source of these needs to be identified. Tickets seem to have increased since June but steady increase so difficult to tie to any particular change.
  • AAA – cms xroot redirection (DLF node), not been working for some time. Possible fix in progress, a dedicated redirector with name server and xroot manager only.
  • Some of the ganglia monitoring not working on disk servers updated with SL6
  • Atlas fax test down since last Friday (xrdproxy)
  • Atlas moving large amount of data from atlasdisk to tape, slow write to tape experienced. Config changed to allow more tapes to write simultaneously, much improved.
  • GGUS ticket from Atlas - File recalled from tape resulted in a file slightly bigger than expected (with incorrect checksum), also staged with incorrect checksum – this needs further investigation
  • The gridmap file on the the webdav host lcgcadm04.gridpp.rl.ac.uk is not auto-updating.

Blocking Issues

Planned, Scheduled and Cancelled Interventions

  • Pluto upgrade and Neptune/Pluto switchback planned for Tues/Thurs -


  • Neptune upgrade to 11.2.0.4 on Tuesday 15th. Downtime for atlas and Gen, running without a stanby for a week
  • Tape backed node to SL6
  • Pluto prep for 11.2.0.4
  • Looking to schedule disk server SL6 deployments for next week
    • plan to put at risk all day (wed/thurs)
    • all VOs / all castor disks
    • action by server number taking highest number first (newest / biggest)
  • GDSS 720 / 707 draining > CPU / motherboard fixes
  • Stress test SRM poss deploy week after (Shaun)
  • Upgrade CASTOR disk servers to SL6
    • One tape-backed node being upgraded, more tape-backed nodes next week.
  • Oracle patching schedule planned. (End 13th October)

Advanced Planning

Tasks

  • Proposed CASTOR face to face W/C Oct 5th or 12th
  • Discussed CASTOR 2017 planning, see wiki page.

Interventions

Staffing

  • Castor on Call person next week
    •  ??
  • Staff absence/out of the office (gridPP week):
    • Shaun All week
    • Brian Tues – Thursday
    • Chris – Friday TBC


New Actions

  • Shaun to revisit schema changes to improve SRM DB dup position W/C 12th Oct.
  • Production team check srm03 regarding lack of crl updates
  • Rob to schedule no-op change to be included in quattor / persistent (CMSdisk)
  • Bruno to doc procedure for HW return to CASTOR

Existing Actions

  • Tim to confirm how max number of tape drives is configured
  • GS to review tickets in an attempt to identify when checksum/partial file creation issues started
  • SdW to submit a change control to make I/O scheduler for cmsDisk nodes from 'cfq' to 'noop' change persistent on CMSdisk / into SL6
  • RA / SdW to contact CERN Dev about checksum issue / partial file creation
  • SdW to modify cleanlostfiles to log to syslog so we can track its use - under testing
  • RA/JJ to look at information provider re DiRAC (reporting disk only etc)
  • All to investigate why we are getting partial files
  • BC to document processes to control services previously controlled by puppet
  • GS to arrange meeting castor/fab/production to discuss the decommissioning procedures
  • GS to investigate providing checks for /etc/noquattor on production nodes & checks for fetch-crl - ONGOING
  • BD to chase AD about using the space reporting thing we made for him
  • GS to investigate how/if we need to declare xrootd endpoints in GOCDB BDII

Completed actions

  • SdW to look into GC improvements - notify if file in inconsistent state
  • SdW to replace canbemigr test - to test for file that have not been migrated to tape (warn 8h / alert 24h)
  • SdW testing/working gfalcopy rpms
  • RA to look into procedural issues with CMS disk server interventions
  • Someone - mice, what access protocol do they use? A RFIO