RAL Tier1 weekly operations castor 05/07/2019

From GridPP Wiki
Revision as of 14:44, 10 July 2019 by Rob Appleyard 7f7797b74a (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Parent article

Standing agenda

1. Achievements this week

2. Problems encountered this week

3. What are we planning to do next week?

4. Long-term project updates (if not already covered)

5. Special topics

6. Actions

7. Review Fabric tasks

  1.   Link

8. AoTechnicalB

9. Availability for next week

10. On-Call

11. AoOtherB

Achievements this week

  • Cleanup of LHCb data from lhcbDst ongoing.
  • Sorting out personal proxy being used to support CASTOR functional test.
    • Test is currently working, but doesn't appear to call out.
      • Action on Rob and Brian to understand the callout system, what it is supposed to do, and develop a plan of what it should do.
    • Not completed, but expected soon.
    • Personal proxy that was being used expired early afternoon Monday
  • New Facilities headnodes on VMWare have been tested in VCert and work for Diamond
    • Some problems with ET.

Operation problems

  • (Old) physical Facilities headnodes don't seem to be producing tickets. Unclear why.
    • Not going to worry too much about the old ones
    • Going to make sure this works on the new ones.
  • KON has raised a mystery problem with Oracle recalls on the preprod setup (new robot). RA and GP to go and find out what that is.

Plans for next few weeks

  • Decommission lhcbDst hardware.
  • Brian C is currently testing StorageD/ET on the new robot
  • Replace Facilities headnodes with VMs.
    • Waiting until Kevin is back from holiday.
    • Scheduled for the 30th July.
  • Problem with functional test node using a personal proxy which runs out shortly.

Long-term projects

  • New CASTOR disk servers currently with Martin.
  • Migration of name server to VMs on 2.1.17-xx is waiting until aliceDisk is decommissioned.
  • CASTOR disk server migration to Aquilon.
    • Agreed a testing plan with Fabric
  • Facilties headnode replacement:
    • SL7 VM headnodes are being tested
  • Implementing DUNE on Spectralogic robot is paused.
    • Decision pending on how far to proceed with setup with DUNE.
  • Migrate VCert to VMWare.
  • Move VCert into the Facilities domain so we have a facilities test instance.

Actions

  • AD wants us to make sure that experiments cannot write to that part of namespace that was used for d1t0 data: namespace cleanup/deletion of empty dirs.
    • Some discussion about what exactly is required and how this can be actually implemented.
    • CASTOR team proposal is either:
      • to switch all of these directories to a fileclass with a requirement for a tape copy but no migration route; this will cause an error whenever any writes are attempted.
      • to run a recursive nschmod on all the unneeded directories to make them read only.

Staffing

  • Everybody in

AoB

On Call

GP on call