Difference between revisions of "RAL Tier1 weekly operations castor 19/07/2019"

From GridPP Wiki
Jump to: navigation, search
(Created page with "[https://www.gridpp.ac.uk/wiki/RAL_Tier1_weekly_operations_castor Parent article] == Standing agenda == 1. Achievements this week 2. Problems encountered this week 3. What...")
 
 
(One intermediate revision by one user not shown)
Line 34: Line 34:
 
== Operation problems ==
 
== Operation problems ==
  
 +
* Facilities CASTOR DB (Bellona) has one RAC node out of production, being worked on by Fabric.
 
* Facilities tape drives flapping a lot
 
* Facilities tape drives flapping a lot
 
** Also some robot hardware issues.
 
** Also some robot hardware issues.
Line 56: Line 57:
 
** Waiting until Kevin is back from holiday.
 
** Waiting until Kevin is back from holiday.
 
** Scheduled for the 30th July.
 
** Scheduled for the 30th July.
 +
* Snoplus migrating to LFC probably won't need us to do anything but might.
  
 
== Long-term projects ==
 
== Long-term projects ==

Latest revision as of 10:02, 19 July 2019

Parent article

Standing agenda

1. Achievements this week

2. Problems encountered this week

3. What are we planning to do next week?

4. Long-term project updates (if not already covered)

5. Special topics

6. Actions

7. Review Fabric tasks

  1.   Link

8. AoTechnicalB

9. Availability for next week

10. On-Call

11. AoOtherB

Achievements this week

  • Deleted remaining contents of lhcbDst.
  • New Facilities headnodes on VMWare have been tested in VCert and work for Diamond
  • Comparative testing of SL6 and SL7 disk servers using IOZONE ongoing

Operation problems

  • Facilities CASTOR DB (Bellona) has one RAC node out of production, being worked on by Fabric.
  • Facilities tape drives flapping a lot
    • Also some robot hardware issues.
  • CMS Rucio trouble
    • SURLs with double slashes don't work for CMS writing using GFAL.
    • This is like an old CASTOR bug we encountered where double-slashes would break transfers
      • Temporary fix ages ago using Shaun's 'double-slash to single slash' SRM trigger
      • But Giuseppe fixed it properly (so we thought)
      • So we tried reapplying Shaun's trigger to wlcgTape and it didn't help.
    • Investigations will continue. Compare Rucio config with ATLAS.

Plans for next few weeks

  • Sorting out xrootd functional test
    • Plan to create and destroy the robot proxy every time we run the test.
  • Kernel upgrade for SL6 disk servers
    • No specific issue, but hasn't been done in a while.
    • Facilities on Wednesday
  • Decommission lhcbDst hardware.
  • Brian C is currently testing StorageD/ET on the new robot
  • Replace Facilities headnodes with VMs.
    • Waiting until Kevin is back from holiday.
    • Scheduled for the 30th July.
  • Snoplus migrating to LFC probably won't need us to do anything but might.

Long-term projects

  • New CASTOR disk servers currently with Martin.
  • Migration of name server to VMs on 2.1.17-xx is waiting until aliceDisk is decommissioned.
  • CASTOR disk server migration to Aquilon.
    • Agreed a testing plan with Fabric
  • Facilties headnode replacement:
    • SL7 VM headnodes are being tested
  • Turn VCert into a facilities test instance.
  • Migrate CASTOR to Telegraf/Influx/Grafana (aka TIG)

Actions

  • AD wants us to make sure that experiments cannot write to that part of namespace that was used for d1t0 data: namespace cleanup/deletion of empty dirs.
    • Some discussion about what exactly is required and how this can be actually implemented.
    • CASTOR team proposal is either:
      • to switch all of these directories to a fileclass with a requirement for a tape copy but no migration route; this will cause an error whenever any writes are attempted.
      • to run a recursive nschmod on all the unneeded directories to make them read only.

Staffing

  • Everybody in

AoB

On Call

GP on Call