RAL Tier1 weekly operations castor 26/04/2019

Standing agenda

1. Achievements this week

2. Problems encountered this week

3. What are we planning to do next week?

4. Long-term project updates (if not already covered)

5. Special topics

6. Actions

7. Review Fabric tasks

  1.   Link

8. AoTechnicalB

9. Availability for next week

10. On-Call

11. AoOtherB

Facilities headnodes requested on VMWare, ticket not done yet. Facilities VMWare cluster still under construction
- Willing to accept delays on this until ~May
- Queued behind tape robot and a number of Diamond ICAT tasks
Aquilon disk servers ready to go, also queued behind tape robot
New Spectra tape robot
- Finalised configuration for the Tape servers
Produced lots of stats on CASTOR ingest rates

gdss738 (lhcbDst) failed, back in production read-only.
gdss811 (lhcbDst) returned to prod in the pre-Easter week with an HDD for the OS instead of an SSD
T2K issues with finding files on tape (GGUS 140870)
ATLAS are periodically submitting SAM tests that impact availability and cause pointless callouts
- TA has updated the ticket, indicating he will raise the issue with the appropriate people
LHCb raised an issue with xroot access to lhcbUser, believed to be resolved now.

Examine further standardisation of CASTOR pool settings.
- CASTOR team to generate a list of nonstandard settings and consider whether they are justified.
Castor tape testing to continue after the production tape robot networking is installed
Test preprod against Bellona (RT223698)

New CASTOR WLCGTape instance.
- LHCb migration to Echo is in progress, being sped up by failing CASTOR disk servers
CASTOR disk server migration to Aquilon.
- Need to work with Fabric to get a realistic.
Deadline of end of April to get Facilities moved to generic VM headnodes and 2.1.17 tape servers.
- Ticket with Fabric team to make the VMs.
The problem of castor-functional-test1 has been absorbed into the task of sorting out worker node grid-mapfile generation and distribution.

AD wants us to make sure that experiments cannot write to that part of namespace that was used for d1t0 data: namespace cleanup/deletion of empty dirs.
- Some discussion about what exactly is required and how this can be actually implemented.
- CASTOR team proposal is either:
  - to switch all of these directories to a fileclass with a requirement for a tape copy but no migration route; this will cause an error whenever any writes are attempted.
  - to run a recursive nschmod on all the unneeded directories to make them read only.
  - CASTOR team split over the correct approach.
Problem with functional test node using a personal proxy which runs out some time in July.
- RA met with JJ, requested an appropriate certificate.
RA and DM to sit down to sort out storage metric question

RA on call.