RAL Tier1 weekly operations castor 02/08/2019

From GridPP Wiki
Revision as of 10:02, 2 August 2019 by Rob Appleyard 7f7797b74a (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Parent article

Standing agenda

1. Achievements this week

2. Problems encountered this week

3. What are we planning to do next week?

4. Long-term project updates (if not already covered)

5. Special topics

6. Actions

7. Review Fabric tasks

  1.   Link

8. AoTechnicalB

9. Availability for next week

10. On-Call

11. AoOtherB

Achievements this week

  • Decommissioned the LHCb CASTOR instance.
  • Upgraded the xrootd version on the ALICE CASTOR xrootd redirector to the 4.10.0-1
    • We are now members of the virtuous few.
  • Comparative testing of SL6 and SL7 disk servers using IOZONE completed
    • The Dell hardware did better.
    • Results suggested that the OCF14 performance was significantly worse under SL7
    • Explained by the fact that the OCF14 we used was missing many of its drives and so was unrepresentative.
    • AD feels that while this test was inconclusive, there isn't any point doing more due to OCF14 being decommissioned shortly.
    • OCF14 to left on SL6 until latest Dell hardware is deployed.
  • New robot testing complete!!!!!
    • Next step: complete report.

Operation problems

  • Bellona had some ongoing hardware problems.
    • Intervention expected next Monday for further investigation.

Plans for next few weeks

  • Upgrade to new Facilities headnodes
    • Testing complete
    • Kevin's xrootd error was found to be due to a poorly-performing disk server
  • Sorting out xrootd functional test
    • Plan to create and destroy the robot proxy every time we run the test.

Long-term projects

  • New CASTOR disk servers currently with Martin.
  • Migration of name server to VMs on 2.1.17-xx is waiting until aliceDisk is decommissioned.
  • CASTOR disk server migration to Aquilon.
    • Testing complete, schedule once new hardware deployed.
  • Facilties headnode replacement:
    • SL7 VM headnodes are being tested
  • Turn VCert into a facilities test instance.
  • Migrate CASTOR to Telegraf/Influx/Grafana (aka TIG)

Actions

  • AD wants us to make sure that experiments cannot write to that part of namespace that was used for d1t0 data: namespace cleanup/deletion of empty dirs.
    • AD still wants to delete all the excess directories but is happy to do the migration route fix in the interim.

Staffing

  • Everybody in

AoB

  • Planning for Monday intervention.
    • DT is from 10 till 4.
    • Outline plan: Diamond/CEDA ingest down 9:00, no more recalls at 9:00, then clear CASTOR migration queue.

On Call

GP on Call