Difference between revisions of "RAL Tier1 weekly operations castor 14/06/2019"

From GridPP Wiki
Jump to: navigation, search
(Created page with "[https://www.gridpp.ac.uk/wiki/RAL_Tier1_weekly_operations_castor Parent article] == Standing agenda == 1. Achievements this week 2. Problems encountered this week 3. What...")
 
Line 32: Line 32:
 
** If there's a checksum in the extended attributes, it will return that instead of actually checksumming the file.
 
** If there's a checksum in the extended attributes, it will return that instead of actually checksumming the file.
 
* Migrated CASTOR gridmap-files generation away for castor-functional-test1 onto a system.
 
* Migrated CASTOR gridmap-files generation away for castor-functional-test1 onto a system.
 +
* Decommissioned VCert2
  
 
== Operation problems ==
 
== Operation problems ==

Revision as of 10:18, 14 June 2019

Parent article

Standing agenda

1. Achievements this week

2. Problems encountered this week

3. What are we planning to do next week?

4. Long-term project updates (if not already covered)

5. Special topics

6. Actions

7. Review Fabric tasks

  1.   Link

8. AoTechnicalB

9. Availability for next week

10. On-Call

11. AoOtherB

Achievements this week

  • LHCb decommissioning ongoing.
  • Odd behaviour of xrdadler32 for files that have become corrupt on disk explained.
    • If there's a checksum in the extended attributes, it will return that instead of actually checksumming the file.
  • Migrated CASTOR gridmap-files generation away for castor-functional-test1 onto a system.
  • Decommissioned VCert2

Operation problems

  • ATLAS are periodically submitting SAM tests that impact availability and cause pointless callouts - Currently with TA
  • Load issue on wlcgTape.
    • Thought to be due to the actions of a hyperactive T2k user

Plans for next few weeks

  • Proposal: Move VCert into the Facilities domain so we have a facilities test instance.
  • Castor tape testing to continue after the production tape robot networking is installed
  • Decommissioned lhcbDst; hardware awaiting retirement.
  • Kevin has done some storageD functional tests with the new tape robot
    • He has a plan for CEDA to do this testing.

Long-term projects

  • New CASTOR WLCGTape instance.
    • Migration of name server to VMs on 2.1.17-xx is waiting until aliceDisk is decommissioned.
  • CASTOR disk server migration to Aquilon.
    • Need to work with Fabric to get a stress test (see above)
  • SL7 VM headnodes need changes to their personalities for the facilities.
    • SL7 headnodes are being tested by GP
  • Implementing DUNE on Spectralogic robot is paused.
  • Migrate VCert to VMWare.

Actions

  • AD wants us to make sure that experiments cannot write to that part of namespace that was used for d1t0 data: namespace cleanup/deletion of empty dirs.
    • Some discussion about what exactly is required and how this can be actually implemented.
    • CASTOR team proposal is either:
      • to switch all of these directories to a fileclass with a requirement for a tape copy but no migration route; this will cause an error whenever any writes are attempted.
      • to run a recursive nschmod on all the unneeded directories to make them read only.
      • CASTOR team split over the correct approach.
  • Problem with functional test node using a personal proxy which runs out some time in July.
    • RA met with JJ, requested an appropriate certificate.
    • Follow up with JJ or ST next week

Staffing

  • GP is back on the 24th.

AoB

  • CMS user chasing slow tape recall; user has been mollified.

On Call

RA on call