Difference between revisions of "RAL Tier1 weekly operations castor 12/07/2019"

From GridPP Wiki
Jump to: navigation, search
(Created page with "[https://www.gridpp.ac.uk/wiki/RAL_Tier1_weekly_operations_castor Parent article] == Standing agenda == 1. Achievements this week 2. Problems encountered this week 3. What...")
 
 
Line 37: Line 37:
 
** Some problems with ET.
 
** Some problems with ET.
 
* Configured DUNE to work with WLCG, tests pass.
 
* Configured DUNE to work with WLCG, tests pass.
 +
* Decommissioned a bunch of old HyperV VMs.
  
 
== Operation problems ==
 
== Operation problems ==

Latest revision as of 10:39, 12 July 2019

Parent article

Standing agenda

1. Achievements this week

2. Problems encountered this week

3. What are we planning to do next week?

4. Long-term project updates (if not already covered)

5. Special topics

6. Actions

7. Review Fabric tasks

  1.   Link

8. AoTechnicalB

9. Availability for next week

10. On-Call

11. AoOtherB

Achievements this week

  • Cleanup of LHCb data from lhcbDst ongoing.
  • Sorting out personal proxy being used to support CASTOR xrootd functional test.
    • Test is currently failing, as the proxy ran out.
    • GP to work with CC to figure this out.
      • Action on Rob and Brian to understand the callout system, what it is supposed to do, and develop a plan of what it should do.
    • Not completed, but expected soon.
  • New Facilities headnodes on VMWare have been tested in VCert and work for Diamond
    • Some problems with ET.
  • Configured DUNE to work with WLCG, tests pass.
  • Decommissioned a bunch of old HyperV VMs.

Operation problems

  • Facilities tape drives went down for about an hour for a handbot replacement on Thursday morning.

Plans for next few weeks

  • Decommission lhcbDst hardware.
  • Brian C is currently testing StorageD/ET on the new robot
  • Replace Facilities headnodes with VMs.
    • Waiting until Kevin is back from holiday.
    • Scheduled for the 30th July.

Long-term projects

  • New CASTOR disk servers currently with Martin.
  • Migration of name server to VMs on 2.1.17-xx is waiting until aliceDisk is decommissioned.
  • CASTOR disk server migration to Aquilon.
    • Agreed a testing plan with Fabric
  • Facilties headnode replacement:
    • SL7 VM headnodes are being tested
  • Turn VCert into a facilities test instance.

Actions

  • AD wants us to make sure that experiments cannot write to that part of namespace that was used for d1t0 data: namespace cleanup/deletion of empty dirs.
    • Some discussion about what exactly is required and how this can be actually implemented.
    • CASTOR team proposal is either:
      • to switch all of these directories to a fileclass with a requirement for a tape copy but no migration route; this will cause an error whenever any writes are attempted.
      • to run a recursive nschmod on all the unneeded directories to make them read only.

Staffing

  • Everybody in

AoB

  • Discussion over how to do the upgrade of Facilities
    • Idea 1: As planned. Upgrade CASTOR DB schema from 2.1.16 to 2.1.17 and bring in new headnodes as one intervention
      • Pro: Smallest number of operations
      • Con: Never upgraded 2.1.16 to 2.1.17
    • Idea 2: Create a new DB 2.1.17 stager schema on Bellona. Repoint CASTOR stager to use that.
      • Pro: This is what we did on Tier 1 (But that was because of necessity rather than choice)
      • Con: Complexity of two schemas.
      • Con: More disruptive to users.
      • Con: CASTOR team would need to add all the config entries and configure everything from scratch.
    • Idea 3: Split the interventions. Move to new headnodes, running 2.1.16, then upgrade them to 2.1.17
      • Pro: Split into discrete steps, easy to debug issues
      • Con: Possible to end up debugging issues specific to 2.1.16/SL7 which is a config we do not expect to use long term.
    • Meeting concluded on idea 1, with the need to do a dress rehearsal upgrade from 2.1.16 to 2.1.17.

On Call

RA on Call