RAL Tier1 weekly operations castor 12/07/2019
From GridPP Wiki
Revision as of 10:39, 12 July 2019 by Rob Appleyard 7f7797b74a (Talk | contribs)
Contents
Standing agenda
1. Achievements this week
2. Problems encountered this week
3. What are we planning to do next week?
4. Long-term project updates (if not already covered)
5. Special topics
6. Actions
7. Review Fabric tasks
1. Link
8. AoTechnicalB
9. Availability for next week
10. On-Call
11. AoOtherB
Achievements this week
- Cleanup of LHCb data from lhcbDst ongoing.
- Sorting out personal proxy being used to support CASTOR xrootd functional test.
- Test is currently failing, as the proxy ran out.
- GP to work with CC to figure this out.
- Action on Rob and Brian to understand the callout system, what it is supposed to do, and develop a plan of what it should do.
- Not completed, but expected soon.
- New Facilities headnodes on VMWare have been tested in VCert and work for Diamond
- Some problems with ET.
- Configured DUNE to work with WLCG, tests pass.
- Decommissioned a bunch of old HyperV VMs.
Operation problems
- Facilities tape drives went down for about an hour for a handbot replacement on Thursday morning.
Plans for next few weeks
- Decommission lhcbDst hardware.
- Brian C is currently testing StorageD/ET on the new robot
- Replace Facilities headnodes with VMs.
- Waiting until Kevin is back from holiday.
- Scheduled for the 30th July.
Long-term projects
- New CASTOR disk servers currently with Martin.
- Migration of name server to VMs on 2.1.17-xx is waiting until aliceDisk is decommissioned.
- CASTOR disk server migration to Aquilon.
- Agreed a testing plan with Fabric
- Facilties headnode replacement:
- SL7 VM headnodes are being tested
- Turn VCert into a facilities test instance.
Actions
- AD wants us to make sure that experiments cannot write to that part of namespace that was used for d1t0 data: namespace cleanup/deletion of empty dirs.
- Some discussion about what exactly is required and how this can be actually implemented.
- CASTOR team proposal is either:
- to switch all of these directories to a fileclass with a requirement for a tape copy but no migration route; this will cause an error whenever any writes are attempted.
- to run a recursive nschmod on all the unneeded directories to make them read only.
Staffing
- Everybody in
AoB
- Discussion over how to do the upgrade of Facilities
- Idea 1: As planned. Upgrade CASTOR DB schema from 2.1.16 to 2.1.17 and bring in new headnodes as one intervention
- Pro: Smallest number of operations
- Con: Never upgraded 2.1.16 to 2.1.17
- Idea 2: Create a new DB 2.1.17 stager schema on Bellona. Repoint CASTOR stager to use that.
- Pro: This is what we did on Tier 1 (But that was because of necessity rather than choice)
- Con: Complexity of two schemas.
- Con: More disruptive to users.
- Con: CASTOR team would need to add all the config entries and configure everything from scratch.
- Idea 3: Split the interventions. Move to new headnodes, running 2.1.16, then upgrade them to 2.1.17
- Pro: Split into discrete steps, easy to debug issues
- Con: Possible to end up debugging issues specific to 2.1.16/SL7 which is a config we do not expect to use long term.
- Meeting concluded on idea 1, with the need to do a dress rehearsal upgrade from 2.1.16 to 2.1.17.
- Idea 1: As planned. Upgrade CASTOR DB schema from 2.1.16 to 2.1.17 and bring in new headnodes as one intervention
On Call
RA on Call