RAL Tier1 weekly operations castor 01/02/2016

From GridPP Wiki
Jump to: navigation, search

Operations News

  • Merging tape pools wiki created by Shaun
  • 2.1.15 name server tested
  • New SRM on vcert2
  • New SRM (SL6) with bug fixes available - needs test
  • Gfal-cat command failing for atlas reading of nsdumps form castor: https://ggus.eu/index.php?mode=ticket_info&ticket_id=117846. Developers looking to fix within: https://ggus.eu/index.php?mode=ticket_info&ticket_id=118842
  • LHCb batch jobs failing to copy results into castor - changes made seems to have improved the situation but not fix (Raja). Increasing the number of connections to the NS db (more threads)
  • BD looking at porting persistent tests to Ceph

Operations Problems

  • atlasScratchDisk ... gdss667 triple disk failure, most data lost. 2 x callouts, down on disk servers and hit limit on xroot connections - SAM test failing.
  • gdss620 - gen tape with fabric still
  • CV '11 gen RAID card controller firmware update required - WD digital drives failing at a high rate
  • fac tape drive broken - Tim taken out of tape pool

Blocking Issues

Planned, Scheduled and Cancelled Interventions

  • Merge all castor tape backed pool into one
  • 11.2.04 client updates (running in preprod) - possible change control for prod
  • WAN tuning proposal - possibly put into change control Brian
  • CASTOR 2.1.15
  • Upgrade of Oracle clients to 11.2.0.4 - planning for production rollout Gen first


Long-term projects

  • RA has produced a python script to handle SRM db duplication issue which is causing callouts. Problem running the python script as version of python on the SRM servers is still at 2.4, however RA will pursue this. SdW has reviewed and confident that this is low risk.
  • JJ – Glue 2 for CASTOR, something to do with publishing information??? Not sure there was a specific action associated with this
  • JS – replacing Tier1 CASTOR db hardware, ACTION: RA/JS to discuss disk requirements


Advanced Planning

Tasks

  • CASTOR 2.1.15 implementation and testing

Interventions

  • Remaining D0T1 disk servers

Staffing

  • Castor on Call person next week
    • Rob
  • Staff absence/out of the office:
    • Brian out Tues 26th - thurs 4th
    • Shaun out 1st Feb for a week, then back in until 1st March

New Actions

  • Rob to double check 667 has had its firmware updated
  • Chris to document the Corbin method (i.e. when a diskpool is having issues restart xroot daemons first, then try diskmanagerd, finally use the transfermanager stop - diskmanagerd restart - transfermanager start 'nuclear' method)
  • Rob to send Shaun’s merging tape pools wiki to CERN for review
  • Rob and Shaun to review Alison’s wiki page, docs from castor handover and update castor procedures
  • Rob to send email to Kevin O'N re any issues with Fac tape config change level 2
  • Tim apply Fac tape config change level 1 - i.e. a DLS read only tape on Monday 1st
  • Gareth to share reporting available for fac tape


Existing Actions

  • Juan to discuss with Shaun what to do with the workaround applied to “subrequesttodo” procedure in CASTOR 2.1.15.
  • Rob to create castor 14 gen build (for Ceph)
  • Shaun to present plan to put tape backend pool into one on 29th Jan
  • Rob look at DLS / CEDA data rates in relation
  • BD re. WAN tuning proposal - discuss with GS, does it need a change control?
  • RA to try stopping tapeserverd mid-migration to see if it breaks.
  • RA (was SdW) to modify cleanlostfiles to log to syslog so we can track its use - under testing
  • GS to investigate providing checks for /etc/noquattor on production nodes & checks for
  • GS to investigate how/if we need to declare xrootd endpoints in GOCDB BDII - progress