Difference between revisions of "RAL Tier1 weekly operations castor 26/07/2019"
From GridPP Wiki
(Created page with "[https://www.gridpp.ac.uk/wiki/RAL_Tier1_weekly_operations_castor Parent article] == Standing agenda == 1. Achievements this week 2. Problems encountered this week 3. What...") |
(No difference)
|
Latest revision as of 10:06, 26 July 2019
Contents
Standing agenda
1. Achievements this week
2. Problems encountered this week
3. What are we planning to do next week?
4. Long-term project updates (if not already covered)
5. Special topics
6. Actions
7. Review Fabric tasks
1. Link
8. AoTechnicalB
9. Availability for next week
10. On-Call
11. AoOtherB
Achievements this week
- Decommissioned all the lhcbDst disk servers and main headnodes
- SRMs to be done shortly
- Kernel upgrade on Facilities disk servers.
- Comparative testing of SL6 and SL7 disk servers using IOZONE ongoing
- Test complete for OCF, ongoing for Dell.
- New robot testing: BC ready to do the 'mixed' test.
Operation problems
- Bellona had some hardware problems continuing from last week.
- Intervention attempted on Thursday, which triggered additional problems
- Combination of hardware issues, mostly a failed array controller.
- CASTOR downtime on Thursday due to this. 11-2.
- Still running with one controller, replacement expected Monday or Tuesday.
Plans for next few weeks
- Upgrade to new Facilities headnodes
- Final ET test showed a few errors, need to be checked.
- Pencilled in for Thursday
- Minimum non-CASTOR staff needed for the intervention: Brian, Kevin.
- Kevin found an xrootd error that needs to be checked out.
- Sorting out xrootd functional test
- Plan to create and destroy the robot proxy every time we run the test.
Long-term projects
- New CASTOR disk servers currently with Martin.
- Migration of name server to VMs on 2.1.17-xx is waiting until aliceDisk is decommissioned.
- CASTOR disk server migration to Aquilon.
- Agreed a testing plan with Fabric
- Facilties headnode replacement:
- SL7 VM headnodes are being tested
- Turn VCert into a facilities test instance.
- Migrate CASTOR to Telegraf/Influx/Grafana (aka TIG)
Actions
- AD wants us to make sure that experiments cannot write to that part of namespace that was used for d1t0 data: namespace cleanup/deletion of empty dirs.
- AD still wants to delete all the excess directories but is happy to do the migration route fix in the interim.
Staffing
- Everybody in
AoB
On Call
RA on Call