RAL Tier1 weekly operations castor 12/04/2019

Standing agenda

1. Achievements this week

2. Problems encountered this week

3. What are we planning to do next week?

4. Long-term project updates (if not already covered)

5. Special topics

6. Actions

7. Review Fabric tasks

  1.   Link

8. AoTechnicalB

9. Availability for next week

10. On-Call

11. AoOtherB

Facilities headnodes requested on VMWare, ticket not done yet. Facilities VMWare cluster still under construction
- Willing to accept delays on this until ~May
- Queued behind tape robot and a number of Diamond ICAT tasks
Aquilon disk servers ready to go, also queued behind tape robot

TimF was doing a tape verify for Diamond and encountered issues with all files on the tape being verified (fd1866)
- We would like the test not to run on the read-only Diamond tape server, but this is not possible on facilities instance
- Not going to investigate further unless problem reoccurs
ATLAS are periodically submitting SAM tests that impact availability and cause pointless callouts
- Tim A has updated the ticket, indicating he will raise the issue with the appropriate people
CASTOR metric reporting for GridPP
- Looking for clarity on precisely what metrics are relevant, and given CASTOR's changed role, what system RA should report on.
gdss700 failed with two drives completely failed and a third unhappy drive, had to determine files unique to this server and those file have been moved over to Echo.
- Need to confirm files are present in Echo.
CEDA outbound certificates were going to expire, but Rob H has renewed and installed
SQL error during a nsls on the facilities instance
- This shows up as an error of 'no file on CASTOR'
- No issues seen on Hermes, was there any issue with CASTOR?
gdss811 has a failed system disk, and is not easily repairable, needs a new SSD from another disk server

Examine further standardisation of CASTOR pool settings.
- CASTOR team to generate a list of nonstandard settings and consider whether they are justified.
Castor tape testing to continue after the production tape robot networking is installed
Testing new facilities Oracle DB Bellona in Castor pre-prod

New CASTOR WLCGTape instance.
- LHCb migration to Echo is in progress, being sped up by failing CASTOR disk servers
CASTOR disk server migration to Aquilon.
- Change ready to implement.
- More meaningful stress test needs to be carried out.
Deadline of end of April to get Facilities moved to generic VM headnodes and 2.1.17 tape servers.
- Ticket with Fabric team to make the VMs.
RA working with James to sort out the gridmap-file distribution infrastructure and get a machine with a better name for this than castor-functional-test1

AD wants us to make sure that experiments cannot write to that part of namespace that was used for d1t0 data: namespace cleanup/deletion of empty dirs.
- Some discussion about what exactly is required and how this can be actually implemented.
- CASTOR team proposal is to switch all of these directories to a fileclass with a requirement for a tape copy but no migration route; this will cause an error whenever any writes are attempted.
RA to look at making all fileclasses have nbcopies >= 1.
Problem with functional test node using a personal proxy which runs out some time in July.
- Rob met with Jens, requested an appropriate certificate.