Difference between revisions of "RAL Tier1 weekly operations castor 19/07/2019"
From GridPP Wiki
(Created page with "[https://www.gridpp.ac.uk/wiki/RAL_Tier1_weekly_operations_castor Parent article] == Standing agenda == 1. Achievements this week 2. Problems encountered this week 3. What...") |
|||
(One intermediate revision by one user not shown) | |||
Line 34: | Line 34: | ||
== Operation problems == | == Operation problems == | ||
+ | * Facilities CASTOR DB (Bellona) has one RAC node out of production, being worked on by Fabric. | ||
* Facilities tape drives flapping a lot | * Facilities tape drives flapping a lot | ||
** Also some robot hardware issues. | ** Also some robot hardware issues. | ||
Line 56: | Line 57: | ||
** Waiting until Kevin is back from holiday. | ** Waiting until Kevin is back from holiday. | ||
** Scheduled for the 30th July. | ** Scheduled for the 30th July. | ||
+ | * Snoplus migrating to LFC probably won't need us to do anything but might. | ||
== Long-term projects == | == Long-term projects == |
Latest revision as of 10:02, 19 July 2019
Contents
Standing agenda
1. Achievements this week
2. Problems encountered this week
3. What are we planning to do next week?
4. Long-term project updates (if not already covered)
5. Special topics
6. Actions
7. Review Fabric tasks
1. Link
8. AoTechnicalB
9. Availability for next week
10. On-Call
11. AoOtherB
Achievements this week
- Deleted remaining contents of lhcbDst.
- New Facilities headnodes on VMWare have been tested in VCert and work for Diamond
- Comparative testing of SL6 and SL7 disk servers using IOZONE ongoing
Operation problems
- Facilities CASTOR DB (Bellona) has one RAC node out of production, being worked on by Fabric.
- Facilities tape drives flapping a lot
- Also some robot hardware issues.
- CMS Rucio trouble
- SURLs with double slashes don't work for CMS writing using GFAL.
- This is like an old CASTOR bug we encountered where double-slashes would break transfers
- Temporary fix ages ago using Shaun's 'double-slash to single slash' SRM trigger
- But Giuseppe fixed it properly (so we thought)
- So we tried reapplying Shaun's trigger to wlcgTape and it didn't help.
- Investigations will continue. Compare Rucio config with ATLAS.
Plans for next few weeks
- Sorting out xrootd functional test
- Plan to create and destroy the robot proxy every time we run the test.
- Kernel upgrade for SL6 disk servers
- No specific issue, but hasn't been done in a while.
- Facilities on Wednesday
- Decommission lhcbDst hardware.
- Brian C is currently testing StorageD/ET on the new robot
- Replace Facilities headnodes with VMs.
- Waiting until Kevin is back from holiday.
- Scheduled for the 30th July.
- Snoplus migrating to LFC probably won't need us to do anything but might.
Long-term projects
- New CASTOR disk servers currently with Martin.
- Migration of name server to VMs on 2.1.17-xx is waiting until aliceDisk is decommissioned.
- CASTOR disk server migration to Aquilon.
- Agreed a testing plan with Fabric
- Facilties headnode replacement:
- SL7 VM headnodes are being tested
- Turn VCert into a facilities test instance.
- Migrate CASTOR to Telegraf/Influx/Grafana (aka TIG)
Actions
- AD wants us to make sure that experiments cannot write to that part of namespace that was used for d1t0 data: namespace cleanup/deletion of empty dirs.
- Some discussion about what exactly is required and how this can be actually implemented.
- CASTOR team proposal is either:
- to switch all of these directories to a fileclass with a requirement for a tape copy but no migration route; this will cause an error whenever any writes are attempted.
- to run a recursive nschmod on all the unneeded directories to make them read only.
Staffing
- Everybody in
AoB
On Call
GP on Call