Difference between revisions of "RAL Tier1 weekly operations castor 22/03/2019"

Revision as of 10:26, 22 March 2019

1. Achievements this week

2. Problems encountered this week

3. What are we planning to do next week?

4. Long-term project updates (if not already covered)

5. Special topics

6. Actions

7. Review Fabric tasks

  1.   Link

8. AoTechnicalB

9. Availability for next week

10. On-Call

11. AoOtherB

New facd0t1 disk servers
- All new facd0t1 disk servers are in production
- We will then retire the old servers
Facilities headnodes requested on VMWare, ticket not done yet.
- Willing to accept delays on this until ~May.
- Queued behind new disk, tape robot and a number of Diamond ICAT tasks.
Acceptance testing of the new tape robot completed
- New-style tape server installation ongoing.
- Tape library ready for CASTOR-side testing
Aquilon disk servers ready to go, also queued behind tape robot.

ATLAS are periodically submitting silly SAM tests that impact availability and cause pointless callouts.
- Rob has created a ticket with Tim.
CASTOR metric reporting for GridPP.
- Looking for clarity on precisely what metrics are relevant, and given CASTOR's changed role, what system RA should report on.
lcgsrm10.gridpp.rl.ac.uk (LHCb) failed and was dropped out the alias. It will (probably) not be fixed.
castor-stager01.gridpp.rl.ac.uk went read-only on Tuesday evening due to a hypervisor load issue. According to Fabric this is a known issue.
- A mitigation measure has been put into place (turning a high-load box on the same HV into a physical host)

Examine further standardisation of CASTOR pool settings.
- CASTOR team to generate a list of nonstandard settings and consider whether they are justified.
Tape robot testing.

New CASTOR WLCGTape instance.
- LHCb migration is with LHCb at the moment, they are not blocked. Mirroring of lhcbDst to Echo complete.
CASTOR disk server migration to Aquilon.
- Change ready to implement.
Deadline of end of April to get Facilities moved to generic VM headnodes and 2.1.17 tape servers.
- Ticket with Fabric team to make the VMs.
RA working with James to sort out the gridmap-file distribution infrastructure and get a machine with a better name for this than castor-functional-test1
Bellona (new Facilities DB) migration - monitoring fixed.

AD wants us to make sure that experiments cannot write to that part of namespace that was used for d1t0 data: namespace cleanup/deletion of empty dirs.
- Some discussion about what exactly is required and how this can be actually implemented.
- CASTOR team proposal is to switch all of these directories to a fileclass with a requirement for a tape copy but no migration route; this will cause an error whenever any writes are attempted.
RA to look at making all fileclasses have nbcopies >= 1.
Problem with functional test node using a personal proxy which runs out some time in July.
- Rob met with Jens, requested an appropriate certificate.

@@ Line 51: / Line 51: @@
 * Examine further standardisation of CASTOR pool settings.
 ** CASTOR team to generate a list of nonstandard settings and consider whether they are justified.
-* New Facilities disk servers
 * Tape robot testing.