RAL Tier1 weekly operations castor 02/08/2019
From GridPP Wiki
Revision as of 10:02, 2 August 2019 by Rob Appleyard 7f7797b74a (Talk | contribs)
Contents
Standing agenda
1. Achievements this week
2. Problems encountered this week
3. What are we planning to do next week?
4. Long-term project updates (if not already covered)
5. Special topics
6. Actions
7. Review Fabric tasks
1. Link
8. AoTechnicalB
9. Availability for next week
10. On-Call
11. AoOtherB
Achievements this week
- Decommissioned the LHCb CASTOR instance.
- Upgraded the xrootd version on the ALICE CASTOR xrootd redirector to the 4.10.0-1
- We are now members of the virtuous few.
- Comparative testing of SL6 and SL7 disk servers using IOZONE completed
- The Dell hardware did better.
- Results suggested that the OCF14 performance was significantly worse under SL7
- Explained by the fact that the OCF14 we used was missing many of its drives and so was unrepresentative.
- AD feels that while this test was inconclusive, there isn't any point doing more due to OCF14 being decommissioned shortly.
- OCF14 to left on SL6 until latest Dell hardware is deployed.
- New robot testing complete!!!!!
- Next step: complete report.
Operation problems
- Bellona had some ongoing hardware problems.
- Intervention expected next Monday for further investigation.
Plans for next few weeks
- Upgrade to new Facilities headnodes
- Testing complete
- Kevin's xrootd error was found to be due to a poorly-performing disk server
- Sorting out xrootd functional test
- Plan to create and destroy the robot proxy every time we run the test.
Long-term projects
- New CASTOR disk servers currently with Martin.
- Migration of name server to VMs on 2.1.17-xx is waiting until aliceDisk is decommissioned.
- CASTOR disk server migration to Aquilon.
- Testing complete, schedule once new hardware deployed.
- Facilties headnode replacement:
- SL7 VM headnodes are being tested
- Turn VCert into a facilities test instance.
- Migrate CASTOR to Telegraf/Influx/Grafana (aka TIG)
Actions
- AD wants us to make sure that experiments cannot write to that part of namespace that was used for d1t0 data: namespace cleanup/deletion of empty dirs.
- AD still wants to delete all the excess directories but is happy to do the migration route fix in the interim.
Staffing
- Everybody in
AoB
- Planning for Monday intervention.
- DT is from 10 till 4.
- Outline plan: Diamond/CEDA ingest down 9:00, no more recalls at 9:00, then clear CASTOR migration queue.
On Call
GP on Call