Difference between revisions of "RAL Tier1 weekly operations castor 10/05/2019"

From GridPP Wiki
Jump to: navigation, search
(Created page with "[https://www.gridpp.ac.uk/wiki/RAL_Tier1_weekly_operations_castor Parent article] == Standing agenda == 1. Achievements this week 2. Problems encountered this week 3. What...")
 
 
Line 74: Line 74:
 
== Staffing ==
 
== Staffing ==
  
Rob out for next three weeks.
+
* Rob out for next three weeks on A/L
 +
* GP out next week for an ALICE meeting.
  
 
== AoB ==
 
== AoB ==

Latest revision as of 10:27, 10 May 2019

Parent article

Standing agenda

1. Achievements this week

2. Problems encountered this week

3. What are we planning to do next week?

4. Long-term project updates (if not already covered)

5. Special topics

6. Actions

7. Review Fabric tasks

  1.   Link

8. AoTechnicalB

9. Availability for next week

10. On-Call

11. AoOtherB

Achievements this week

  • LHCb fully migrated to Echo :D :D :D
  • Facilities headnodes have been created and have inventory personalities.
    • Agreed that GP and CP to produce a plan on how to do this.
  • Aquilon disk servers ready to go, also queued behind tape robot
    • Designing a stress test based on CC meeting (IOZone on SL6, IOZone on SL7, compare)
  • DUNE have been set up on WLCGTape

Operation problems

  • T2K issues with finding files on tape (GGUS 140870) - Currently on Alastair
  • ATLAS are periodically submitting SAM tests that impact availability and cause pointless callouts - Currently with TA
  • Nagios alert for subject alt names on stagers needs updating. With production team.

Plans for next few weeks

  • Examine further standardisation of CASTOR pool settings.
    • CASTOR team to generate a list of nonstandard settings and consider whether they are justified.
  • Castor tape testing to continue after the production tape robot networking is installed
  • Decommission lhcbDst then move them to wlcgTape
    • Consult with Raja on decommissioning.

Long-term projects

  • New CASTOR WLCGTape instance.
    • Migration of name server to VMs on 2.1.17-xx is waiting until aliceDisk is decommissioned.
  • CASTOR disk server migration to Aquilon.
    • Need to work with Fabric to get a stress test (see above)
  • The problem of castor-functional-test1 has been absorbed into the task of sorting out worker node grid-mapfile generation and distribution.
    • RA to make a VM for James

Actions

  • AD wants us to make sure that experiments cannot write to that part of namespace that was used for d1t0 data: namespace cleanup/deletion of empty dirs.
    • Some discussion about what exactly is required and how this can be actually implemented.
    • CASTOR team proposal is either:
      • to switch all of these directories to a fileclass with a requirement for a tape copy but no migration route; this will cause an error whenever any writes are attempted.
      • to run a recursive nschmod on all the unneeded directories to make them read only.
      • CASTOR team split over the correct approach.
  • Problem with functional test node using a personal proxy which runs out some time in July.
    • RA met with JJ, requested an appropriate certificate.
    • GP to follow up with JJ
  • RA and DM to sit down to sort out storage metric question
    • Plan to create new metrics for GridPP6.

Staffing

  • Rob out for next three weeks on A/L
  • GP out next week for an ALICE meeting.

AoB

On Call

GP on call.