RAL Tier1 weekly operations castor 15/03/2019

Standing agenda

1. Achievements this week

2. Problems encountered this week

3. What are we planning to do next week?

4. Long-term project updates (if not already covered)

5. Special topics

6. Actions

7. Review Fabric tasks

  1.   Link

8. AoTechnicalB

9. Availability for next week

10. On-Call

11. AoOtherB

New facd0t1 disk servers
- 5 into production
- 5 will go in next week
- We will then retire the old servers
- Monitoring also sorted
Facilities headnodes requested on VMWare, ticket not done yet.
- Willing to accept delays on this until ~May.
- Queued behind new disk, tape robot and a number of Diamond ICAT tasks.
Testing of the new tape robot
- New-style tape server installation ongoing.
- Acceptance testing ongoing
- Expect it for CASTOR-side testing next week.
Aquilon disk servers ready to go, also queued behind tape robot.

ATLAS are periodically submitting silly SAM tests that impact availability and cause pointless callouts.
- Rob has created a ticket with Tim.
CASTOR metric reporting for GridPP.
- Looking for clarity on precisely what metrics are relevant, and given CASTOR's changed role, what system RA should report on.
lcgsrm10.gridpp.rl.ac.uk (LHCb) failed and was dropped out the alias. It will (probably) not be fixed.
castor-stager01.gridpp.rl.ac.uk went read-only on Tuesday evening due to a hypervisor load issue. According to Fabric this is a known issue.
- A mitigation measure has been put into place (turning a high-load box on the same HV into a physical host)

Examine further standardisation of CASTOR pool settings.
- CASTOR team to generate a list of nonstandard settings and consider whether they are justified.
New Facilities disk servers
Tape robot testing.

New CASTOR WLCGTape instance.
- LHCb migration is with LHCb at the moment, they are not blocked. Mirroring of lhcbDst to Echo complete.
CASTOR disk server migration to Aquilon.
- Change ready to implement.
Deadline of end of April to get Facilities moved to generic VM headnodes and 2.1.17 tape servers.
- Ticket with Fabric team to make the VMs.
RA working with James to sort out the gridmap-file distribution infrastructure and get a machine with a better name for this than castor-functional-test1
Bellona (new Facilities DB) migration - monitoring fixed.

AD wants us to make sure that experiments cannot write to that part of namespace that was used for d1t0 data: namespace cleanup/deletion of empty dirs.
- Some discussion about what exactly is required and how this can be actually implemented.
- CASTOR team proposal is to switch all of these directories to a fileclass with a requirement for a tape copy but no migration route; this will cause an error whenever any writes are attempted.
RA to look at making all fileclasses have nbcopies >= 1.
Problem with functional test node using a personal proxy which runs out some time in July.
- Rob met with Jens, requested an appropriate certificate.