RAL Tier1 weekly operations castor 05/04/2019
From GridPP Wiki
Contents
Standing agenda
1. Achievements this week
2. Problems encountered this week
3. What are we planning to do next week?
4. Long-term project updates (if not already covered)
5. Special topics
6. Actions
7. Review Fabric tasks
1. Link
8. AoTechnicalB
9. Availability for next week
10. On-Call
11. AoOtherB
Achievements this week
- Old facd0tl disk servers have been decommissioned
- Proved facilities can recall zero sized file
- Facilities headnodes requested on VMWare, ticket not done yet. Facilities VMWare cluster still under construction
- Willing to accept delays on this until ~May
- Queued behind tape robot and a number of Diamond ICAT tasks
- Acceptance testing of the new tape robot completed
- New-style tape server installation ongoing.
- Tape library for CASTOR-side testing in progress now
- Large scale read testing completed, seems successful, but analysis underway on a few outstanding queries
- Aquilon disk servers ready to go, also queued behind tape robot
Operation problems
- TimF was doing a tape verify for Diamond and encountered issues with all files on the tape being verified (fd1866)
- We would like the test not to run on the read-only Diamond tape server, but this is not possible on facilities instance
- Not going to investigate further unless problem reoccurs
- ATLAS are periodically submitting SAM tests that impact availability and cause pointless callouts
- Tim A has updated the ticket, indicating he will raise the issue with the appropriate people
- CASTOR metric reporting for GridPP
- Looking for clarity on precisely what metrics are relevant, and given CASTOR's changed role, what system RA should report on.
- LHCb are trying to access files using the wrong service class (default)
- IO data traffic on gdss700 stalls, under investigation
- CEDA outbound certificates going to expire, but Rob H is on the case
- SQL error during a nsls on the facilities instance
- This shows up as an error of 'no file on CASTOR'
- No issues seen on Hermes, was there any issue with CASTOR?
Plans for next few weeks
- Examine further standardisation of CASTOR pool settings.
- CASTOR team to generate a list of nonstandard settings and consider whether they are justified.
- Continue CASTOR side tape robot testing.
Long-term projects
- New CASTOR WLCGTape instance.
- LHCb migration is with LHCb at the moment, they are not blocked. Mirroring of lhcbDst to Echo complete.
- CASTOR disk server migration to Aquilon.
- Change ready to implement.
- More meaningful stress test needs to be carried out.
- Deadline of end of April to get Facilities moved to generic VM headnodes and 2.1.17 tape servers.
- Ticket with Fabric team to make the VMs.
- RA working with James to sort out the gridmap-file distribution infrastructure and get a machine with a better name for this than castor-functional-test1
Actions
- AD wants us to make sure that experiments cannot write to that part of namespace that was used for d1t0 data: namespace cleanup/deletion of empty dirs.
- Some discussion about what exactly is required and how this can be actually implemented.
- CASTOR team proposal is to switch all of these directories to a fileclass with a requirement for a tape copy but no migration route; this will cause an error whenever any writes are attempted.
- RA to look at making all fileclasses have nbcopies >= 1.
- Problem with functional test node using a personal proxy which runs out some time in July.
- Rob met with Jens, requested an appropriate certificate.
Staffing
- RA back next week
AoB
On Call
RA on call