Difference between revisions of "RAL Tier1 weekly operations castor 29/03/2019"

Revision as of 10:58, 29 March 2019

1. Achievements this week

2. Problems encountered this week

3. What are we planning to do next week?

4. Long-term project updates (if not already covered)

5. Special topics

6. Actions

7. Review Fabric tasks

  1.   Link

8. AoTechnicalB

9. Availability for next week

10. On-Call

11. AoOtherB

Old facd0tl disk servers have been in read-only mode for one week, can be decommissioned next week
Facilities headnodes requested on VMWare, ticket not done yet. Haven't heard anything from RIG yet.
- Willing to accept delays on this until ~May.
- Queued behind new disk, tape robot and a number of Diamond ICAT tasks.
Acceptance testing of the new tape robot completed
- New-style tape server installation ongoing.
- Tape library for CASTOR-side testing in progress now
Aquilon disk servers ready to go, also queued behind tape robot.

gdss733 crashed and has been removed from production for two days. Now back into prod, no issues so far.
TimF was doing a tape verify for Diamond (fd1866)
- The verify was supposed to verify first 10, last 10, and random 10 files in the middle, but for some reason the test run on the all tape, therefore taking a lot of time more than the usual 30 minutes. Data has been inaccessible for a long time.
- We would like the test not to run on the read-only Diamond tape server
- Tim is looking at the reason why this happened
ATLAS are periodically submitting silly SAM tests that impact availability and cause pointless callouts.
- Rob has created a ticket with Tim, work is being done.
CASTOR metric reporting for GridPP.
- Looking for clarity on precisely what metrics are relevant, and given CASTOR's changed role, what system RA should report on.

Decommission the old facilities ingest disk servers
Examine further standardisation of CASTOR pool settings.
- CASTOR team to generate a list of nonstandard settings and consider whether they are justified.
CASTOR side tape robot testing.

New CASTOR WLCGTape instance.
- LHCb migration is with LHCb at the moment, they are not blocked. Mirroring of lhcbDst to Echo complete.
CASTOR disk server migration to Aquilon.
- Change ready to implement.
Deadline of end of April to get Facilities moved to generic VM headnodes and 2.1.17 tape servers.
- Ticket with Fabric team to make the VMs.
RA working with James to sort out the gridmap-file distribution infrastructure and get a machine with a better name for this than castor-functional-test1

AD wants us to make sure that experiments cannot write to that part of namespace that was used for d1t0 data: namespace cleanup/deletion of empty dirs.
- Some discussion about what exactly is required and how this can be actually implemented.
- CASTOR team proposal is to switch all of these directories to a fileclass with a requirement for a tape copy but no migration route; this will cause an error whenever any writes are attempted.
RA to look at making all fileclasses have nbcopies >= 1.
Problem with functional test node using a personal proxy which runs out some time in July.
- Rob met with Jens, requested an appropriate certificate.

@@ Line 49: / Line 49: @@
 == Plans for next few weeks ==
+* Decommission the old facilities ingest disk servers
 * Examine further standardisation of CASTOR pool settings.
 ** CASTOR team to generate a list of nonstandard settings and consider whether they are justified.