Difference between revisions of "RAL Tier1 weekly operations castor 26/04/2019"

From GridPP Wiki
Jump to: navigation, search
(Created page with "[https://www.gridpp.ac.uk/wiki/RAL_Tier1_weekly_operations_castor Parent article] == Standing agenda == 1. Achievements this week 2. Problems encountered this week 3. What...")
 
(Undo revision 19924 by Rob Appleyard 7f7797b74a (talk))
 
(One intermediate revision by one user not shown)
(No difference)

Latest revision as of 09:50, 3 May 2019

Parent article

Standing agenda

1. Achievements this week

2. Problems encountered this week

3. What are we planning to do next week?

4. Long-term project updates (if not already covered)

5. Special topics

6. Actions

7. Review Fabric tasks

  1.   Link

8. AoTechnicalB

9. Availability for next week

10. On-Call

11. AoOtherB

Achievements this week

  • Facilities headnodes requested on VMWare, ticket not done yet. Facilities VMWare cluster still under construction
    • Willing to accept delays on this until ~May
    • Queued behind tape robot and a number of Diamond ICAT tasks
  • Aquilon disk servers ready to go, also queued behind tape robot
  • New Spectra tape robot
    • Finalised configuration for the Tape servers
  • Produced lots of stats on CASTOR ingest rates

Operation problems

  • gdss738 (lhcbDst) failed, back in production read-only.
  • gdss811 (lhcbDst) returned to prod in the pre-Easter week with an HDD for the OS instead of an SSD
  • T2K issues with finding files on tape (GGUS 140870)
  • ATLAS are periodically submitting SAM tests that impact availability and cause pointless callouts
    • TA has updated the ticket, indicating he will raise the issue with the appropriate people
  • LHCb raised an issue with xroot access to lhcbUser, believed to be resolved now.

Plans for next few weeks

  • Examine further standardisation of CASTOR pool settings.
    • CASTOR team to generate a list of nonstandard settings and consider whether they are justified.
  • Castor tape testing to continue after the production tape robot networking is installed
  • Test preprod against Bellona (RT223698)

Long-term projects

  • New CASTOR WLCGTape instance.
    • LHCb migration to Echo is in progress, being sped up by failing CASTOR disk servers
  • CASTOR disk server migration to Aquilon.
    • Need to work with Fabric to get a realistic.
  • Deadline of end of April to get Facilities moved to generic VM headnodes and 2.1.17 tape servers.
    • Ticket with Fabric team to make the VMs.
  • The problem of castor-functional-test1 has been absorbed into the task of sorting out worker node grid-mapfile generation and distribution.

Actions

  • AD wants us to make sure that experiments cannot write to that part of namespace that was used for d1t0 data: namespace cleanup/deletion of empty dirs.
    • Some discussion about what exactly is required and how this can be actually implemented.
    • CASTOR team proposal is either:
      • to switch all of these directories to a fileclass with a requirement for a tape copy but no migration route; this will cause an error whenever any writes are attempted.
      • to run a recursive nschmod on all the unneeded directories to make them read only.
      • CASTOR team split over the correct approach.
  • Problem with functional test node using a personal proxy which runs out some time in July.
    • RA met with JJ, requested an appropriate certificate.
  • RA and DM to sit down to sort out storage metric question

Staffing

AoB

On Call

RA on call.