Difference between revisions of "RAL Tier1 weekly operations castor 08/03/2019"

From GridPP Wiki
Jump to: navigation, search
(Created page with "== Standing agenda == 1. Problems encountered this week 2. Upgrades/improvements made this week 3. What are we planning to do next week? 4. Long-term project updates (if n...")
 
 
(One intermediate revision by one user not shown)
Line 1: Line 1:
 +
Parent article: https://www.gridpp.ac.uk/wiki/RAL_Tier1_weekly_operations_castor
 +
 
== Standing agenda ==
 
== Standing agenda ==
  
Line 74: Line 76:
 
* RA to look at making all fileclasses have nbcopies >= 1.
 
* RA to look at making all fileclasses have nbcopies >= 1.
 
* Problem with functional test node using a personal proxy which runs out some time in July.
 
* Problem with functional test node using a personal proxy which runs out some time in July.
** Needs making permanent.
+
** Needs making into a robot certificate
  
 
== Staffing ==
 
== Staffing ==
Line 84: Line 86:
 
== On Call ==
 
== On Call ==
  
GP on call
+
RA on call

Latest revision as of 15:31, 11 March 2019

Parent article: https://www.gridpp.ac.uk/wiki/RAL_Tier1_weekly_operations_castor

Standing agenda

1. Problems encountered this week

2. Upgrades/improvements made this week

3. What are we planning to do next week?

4. Long-term project updates (if not already covered)

5. Special topics

6. Actions

7. Review Fabric tasks

  1.   Link

8. AoTechnicalB

9. Availability for next week

10. On-Call

11. AoOtherB

Operation problems

  • The Great T2K Mystery
    • Some T2K files wouldn't recall
    • They were in fileclass 1 (NS mapped to 'default')
    • Problem was 2 fileclasses in the stager with id=1 which made everything confused.
    • Fixed, problem gone, GGUS resolved.
  • No disk problems.
  • ATLAS are periodically submitting silly SAM tests that impact availability and cause pointless callouts.
    • Rob to have a discussion with Tim about this.

Operation news

  • New facd0t1 disk servers
    • Firmware needed patching
    • Certificates ready
    • 3 are ready to go into production on Monday.
    • 7 are down but waiting on Fabric
  • New Facilities DB environment ready.
  • Facilities headnodes requested on VMWare, ticket not done yet.


Plans for next few weeks

  • Examine further standardisation of CASTOR pool settings.
    • CASTOR team to generate a list of nonstandard settings and consider whether they are justified.
  • Replacement of Facilities CASTOR d0t1 ingest nodes
    • Now ready to deploy once Fabric team are ready.

Long-term projects

  • New CASTOR WLCGTape instance.
    • LHCb migration is with LHCb at the moment, they are not blocked. Mirroring of lhcbDst to Echo complete.
  • CASTOR disk server migration to Aquilon.
    • Change ready to implement.
  • Deadline of end of April to get Facilities moved to generic VM headnodes and 2.1.17 tape servers.
    • Ticket with Fabric team to make the VMs.
  • RA working with James to sort out the gridmap-file distribution infrastructure and get a machine with a better name for this than castor-functional-test1
  • Bellona (new Facilities DB) migration - monitoring needs fixing.
  • New tape robot
    • New tape server has been installed with CASTOR RPMs
      • Undergoing Fabric-level acceptance testing.

Actions

  • AD wants us to make sure that experiments cannot write to that part of namespace that was used for d1t0 data: namespace cleanup/deletion of empty dirs.
    • Some discussion about what exactly is required and how this can be actually implemented.
    • CASTOR team proposal is to switch all of these directories to a fileclass with a requirement for a tape copy but no migration route; this will cause an error whenever any writes are attempted.
  • RA to look at making all fileclasses have nbcopies >= 1.
  • Problem with functional test node using a personal proxy which runs out some time in July.
    • Needs making into a robot certificate

Staffing

  • JK out Tues/Weds next week

AoB

On Call

RA on call