RAL Tier1 weekly operations castor 12/04/2019

From GridPP Wiki
Revision as of 09:56, 12 April 2019 by Tom Byrne 411f3ad327 (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Standing agenda

1. Achievements this week

2. Problems encountered this week

3. What are we planning to do next week?

4. Long-term project updates (if not already covered)

5. Special topics

6. Actions

7. Review Fabric tasks

  1.   Link

8. AoTechnicalB

9. Availability for next week

10. On-Call

11. AoOtherB

Achievements this week

  • Facilities headnodes requested on VMWare, ticket not done yet. Facilities VMWare cluster still under construction
    • Willing to accept delays on this until ~May
    • Queued behind tape robot and a number of Diamond ICAT tasks
  • Aquilon disk servers ready to go, also queued behind tape robot

Operation problems

  • TimF was doing a tape verify for Diamond and encountered issues with all files on the tape being verified (fd1866)
    • We would like the test not to run on the read-only Diamond tape server, but this is not possible on facilities instance
    • Not going to investigate further unless problem reoccurs
  • ATLAS are periodically submitting SAM tests that impact availability and cause pointless callouts
    • Tim A has updated the ticket, indicating he will raise the issue with the appropriate people
  • CASTOR metric reporting for GridPP
    • Looking for clarity on precisely what metrics are relevant, and given CASTOR's changed role, what system RA should report on.
  • gdss700 failed with two drives completely failed and a third unhappy drive, had to determine files unique to this server and those file have been moved over to Echo.
    • Need to confirm files are present in Echo.
  • CEDA outbound certificates were going to expire, but Rob H has renewed and installed
  • SQL error during a nsls on the facilities instance
    • This shows up as an error of 'no file on CASTOR'
    • No issues seen on Hermes, was there any issue with CASTOR?
  • gdss811 has a failed system disk, and is not easily repairable, needs a new SSD from another disk server

Plans for next few weeks

  • Examine further standardisation of CASTOR pool settings.
    • CASTOR team to generate a list of nonstandard settings and consider whether they are justified.
  • Castor tape testing to continue after the production tape robot networking is installed
  • Testing new facilities Oracle DB Bellona in Castor pre-prod

Long-term projects

  • New CASTOR WLCGTape instance.
    • LHCb migration to Echo is in progress, being sped up by failing CASTOR disk servers
  • CASTOR disk server migration to Aquilon.
    • Change ready to implement.
    • More meaningful stress test needs to be carried out.
  • Deadline of end of April to get Facilities moved to generic VM headnodes and 2.1.17 tape servers.
    • Ticket with Fabric team to make the VMs.
  • RA working with James to sort out the gridmap-file distribution infrastructure and get a machine with a better name for this than castor-functional-test1

Actions

  • AD wants us to make sure that experiments cannot write to that part of namespace that was used for d1t0 data: namespace cleanup/deletion of empty dirs.
    • Some discussion about what exactly is required and how this can be actually implemented.
    • CASTOR team proposal is to switch all of these directories to a fileclass with a requirement for a tape copy but no migration route; this will cause an error whenever any writes are attempted.
  • RA to look at making all fileclasses have nbcopies >= 1.
  • Problem with functional test node using a personal proxy which runs out some time in July.
    • Rob met with Jens, requested an appropriate certificate.

Staffing

  • GP out Monday

AoB

On Call

GP on call