RAL Tier1 weekly operations castor 01/02/2019

From GridPP Wiki
Jump to: navigation, search

Standing agenda

1. Problems encountered this week

2. Upgrades/improvements made this week

3. What are we planning to do next week?

4. Long-term project updates (if not already covered)

5. Special topics

6. Actions

7. Review Fabric tasks

  1.   Link

8. AoTechnicalB

9. Availability for next week

10. On-Call

11. AoOtherB

Operation problems

  * Again the issue of CASTOR marking unmounted tapes as BUSY. Migration backlog
    was created but soon cleared after Tim removed the BUSY flag
  * gdss 739 and gdss811 (lhcbDst) failed and removed from prod.
  * na62 files (~196,000) associated with disk-only file class (na62-tape0) were found in WLCGTape cache
  * GC on WLCGTape is not working properly and as a result the tape buffer ran out of space. 
    Had to initiate deletipon of 581,587 Atlas files elog
    as a temporary mitigation

Operation news

  * New facd0t1 disk servers
     * Ready to go into production from a CASTOR perspective
     * Decommissioning tickets created for all machines occupying the rack (rack 214, row 2) in which 
       these disk servers will be placed 
     * Final deployment waiting on Fabric team to remove the decommissioned machines out of the machine room.
  * Decision to be made about adding excess C tape drives to Facilities.
     * Some have been added, but development work would be needed to add the remainder due to a shortage of servers.
     * The C drives are also running out of maintenance.
     * Action on Chris to discuss with Alison.

Plans for next few weeks

  * Replacement of Facilities CASTOR d0t1 ingest nodes.
     * Agreed to install these with only 1 CASTOR partition. This worked.
     * Now ready to deploy.
  * Migrate ALICE to WLCGTape

Long-term projects

  * New CASTOR WLCGTape instance. Things need doing: Create a seperate xrootd redirector for ALICE.
     * Redirector attached to WLCGTape instance and passes all ALICE external tests
  * CASTOR disk server migration to Aquilon: gdss742 has been compiled with a draft aquilon profile
    but there are problems with the SL7 installation RT216885 
     * Action is with the Fabric team to sort out the SL7 installation. Possible hardware issue, asked for a new, non-broken node.
  * Need to discuss monitoring/metrics for Aquilonised disk servers. GP to try, follow up with JK.
  * Various bits of monitoring/stats are broken for wlcgTape, investigate further.
      * Discussion about the Tier 1 front page. Agreed to try changing the URL on the page to see if that magically produces the correct graph. Noted that Ganglia is going away and we don't want to spend to much time on this (action with production team)

Actions

  * AD wants us to make sure that experiments cannot write to that part of namespace that was used for d1t0 data: namespace cleanup/deletion of empty dirs. Some discussion about what exactly is required and how this can be actually implemented.
  * CASTOR team proposal is to switch all of these directories to a fileclass with a requirement for a tape copy but no migration route; this will cause an error whenever any writes are attempted.

Staffing

  * All in

On Call

RA on call