RAL Tier1 weekly operations castor 08/02/2016

From GridPP Wiki
Jump to: navigation, search

Operations News

  • VO DiRAC people from Leicester are coming online -
  • 2.1.15 change control had its first airing in change control - 2.1.15 currently not working for us.
  • new tape backed disk servers for Tier1 - to replace CV11, recommendation made to Martin
  • Repack upgrade 2.1.14-15
  • Merging tape pools wiki created by Shaun
  • 2.1.15 name server tested
  • New SRM on vcert2
  • New SRM (SL6) with bug fixes available - needs test
  • Gfal-cat command failing for atlas reading of nsdumps form castor: https://ggus.eu/index.php?mode=ticket_info&ticket_id=117846. Developers looking to fix within: https://ggus.eu/index.php?mode=ticket_info&ticket_id=118842
  • LHCb batch jobs failing to copy results into castor - changes made seems to have improved the situation but not fix (Raja). Increasing the number of connections to the NS db (more threads)
  • BD looking at porting persistent tests to Ceph

Operations Problems

  • ongoing AAA issues in CMS
  • tape loss - corrupt tape for atlas MC tape. Tim recovered some files. Alastair reports 2012 monticarlo. Tim sending tape off for analysis.
  • gdss667 is back in production
  • 20 GB/s out of atlasstripinput - 2000 jobs. Lots of the connections were rfio


  • CV '11 gen RAID card controller firmware update required - WD digital drives failing at a high rate

Blocking Issues

Planned, Scheduled and Cancelled Interventions

  • Merge all castor tape backed pool into one
  • 11.2.04 client updates (running in preprod) - possible change control for prod
  • WAN tuning proposal - possibly put into change control Brian
  • CASTOR 2.1.15
  • Upgrade of Oracle clients to 11.2.0.4 - planning for production rollout Gen first


Long-term projects

  • RA has produced a python script to handle SRM db duplication issue which is causing callouts. Problem running the python script as version of python on the SRM servers is still at 2.4, however RA will pursue this. SdW has reviewed and confident that this is low risk.
  • JJ – Glue 2 for CASTOR, something to do with publishing information??? Not sure there was a specific action associated with this
  • JS – replacing Tier1 CASTOR db hardware, ACTION: RA/JS to discuss disk requirements


Advanced Planning

Tasks

  • CASTOR 2.1.15 implementation and testing

Interventions

  • Remaining D0T1 disk servers

Staffing

  • Castor on Call person next week
    • Rob
  • Staff absence/out of the office:
    • Shaun in until 1st March

New Actions

  • BD to understand where RFIO connection from Atlas are coming from
  • BD track down current RFIO usage within castor and plan migration
  • GS arrange a meeting to discuss remaining actions on CV11 and V12 (when Kashif is back)
  • BD to clarify if separating the DiRAC data is a necessity
  • RA take VCERT 2.1.14, install new srm on vcert srm and perform functional tests
  • BD ensure quattorising atlas consistency check

Existing Actions

  • Rob to send Shaun’s merging tape pools wiki to CERN for review
  • Rob and Shaun to review Alison’s wiki page, docs from castor handover and update castor procedures
  • Rob to deploy a 14 generation into preprod
  • Shaun to present plan to put tape backend pool into one on 29th Jan
  • Rob look at DLS / CEDA data rates in relation
  • BD re. WAN tuning proposal - discuss with GS, does it need a change control?
  • RA to try stopping tapeserverd mid-migration to see if it breaks.
  • RA (was SdW) to modify cleanlostfiles to log to syslog so we can track its use - under testing
  • GS to investigate how/if we need to declare xrootd endpoints in GOCDB BDII - progress