RAL Tier1 weekly operations castor 15/02/2019

From GridPP Wiki
Revision as of 16:03, 18 February 2019 by Rob Appleyard 7f7797b74a (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Standing agenda

1. Problems encountered this week

2. Upgrades/improvements made this week

3. What are we planning to do next week?

4. Long-term project updates (if not already covered)

5. Special topics

6. Actions

7. Review Fabric tasks

  1.   Link

8. AoTechnicalB

9. Availability for next week

10. On-Call

11. AoOtherB

Operation problems

  • Tape mishap last week explained as being a side effect of an engineer fixing a minor robot problem and was fixed by a power cycle.
  • Problem following deployment of new diamondRecall disk servers - xrootd keys left with incorrect ownership
    • Update procedure to include a check for this.

Operation news

  • New facd0t1 disk servers
    • Ready to go into production from a CASTOR perspective
    • Decommissioning tickets created for all machines occupying the rack (rack 214, row 2) in which these disk servers will be placed
    • Slow progress being made
    • Final deployment waiting on Fabric team to remove the decommissioned machines out of the machine room.
  • Migrated ALICE to WLCGTape.
    • Argo tests are failing for srm-alice since the upgrade.
    • Exposed that all these tests are pointing at the same thing.
    • Action on RA - contact GS to discuss what Argo tests are appropriate.
  • Update to tape accounting to give vo-specific readings of recall efficiency.
  • Facilities tape busy due to an urgent 600TB recall request from CEDA.
  • Decommissioned some 11 generation hardware from preprodTape.
  • 11 d1t0 LHCb disk servers need to be physically moved around the machine room; agreed outline plan with Fabric, intervention date TBD, ~1 day downtime on lhcbDst expected
    • Also some d0t1 disk servers, no downtime needed.
    • We will a kernel roll at the some time for the entire instance.
  • Started decommissioning genTape.

Plans for next few weeks

  • Examine further standardisation of CASTOR pool settings.
    • CASTOR team to generate a list of nonstandard settings and consider whether they are justified.
  • Replacement of Facilities CASTOR d0t1 ingest nodes
    • Now ready to deploy once Fabric team are ready.
  • Outage for LHCb disk server rack change Tuesday 19th.

Long-term projects

  • New CASTOR WLCGTape instance.
    • ALICE now using wlcgTape
    • Still need to move LHCb
  • CASTOR disk server migration to Aquilon.
    • Plan to apply the new Aquilon disk server profile and see what happens.
      • Negotiating some compilation errors.
    • Quattor templates have been reviewed, busy addressing feedback.
  • Need to discuss Ganglia monitoring/metrics for Aquilonised disk servers. GP to try, follow up with JK.
    • No progress made.
    • One of the test InfluxDB servers (05) can be used for CASTOR InfluxDB development.
  • Various bits of monitoring/stats are broken for wlcgTape, investigate further.
    • We need a clear statement from stakeholders of what they need and how we fall short of that.
    • Some scripts need looking at.
  • Deadline of end of April to get Facilities moved to generic VM headnodes and 2.1.17 tape servers.
    • Ticket with Fabric team to make the VMs.
  • RA working with James to sort out the gridmap-file distribution infrastructure and get a machine with a better name for this than castor-function-test1

Actions

  • AD wants us to make sure that experiments cannot write to that part of namespace that was used for d1t0 data: namespace cleanup/deletion of empty dirs.
    • Some discussion about what exactly is required and how this can be actually implemented.
    • CASTOR team proposal is to switch all of these directories to a fileclass with a requirement for a tape copy but no migration route; this will cause an error whenever any writes are attempted.

Staffing

  • CP out Thursday/Friday.

AoB

  • Nodes with names consistent with CASTOR Facilities disk servers are confusing because they often aren't CASTOR Facilities disk servers.
    • Legacy of purchasing decisions, names will not change.

On Call

RA on call