Difference between revisions of "RAL Tier1 weekly operations castor 14/06/2019"
From GridPP Wiki
(Created page with "[https://www.gridpp.ac.uk/wiki/RAL_Tier1_weekly_operations_castor Parent article] == Standing agenda == 1. Achievements this week 2. Problems encountered this week 3. What...") |
|||
(One intermediate revision by one user not shown) | |||
Line 32: | Line 32: | ||
** If there's a checksum in the extended attributes, it will return that instead of actually checksumming the file. | ** If there's a checksum in the extended attributes, it will return that instead of actually checksumming the file. | ||
* Migrated CASTOR gridmap-files generation away for castor-functional-test1 onto a system. | * Migrated CASTOR gridmap-files generation away for castor-functional-test1 onto a system. | ||
+ | * Decommissioned VCert2 | ||
== Operation problems == | == Operation problems == | ||
Line 77: | Line 78: | ||
* CMS user chasing slow tape recall; user has been mollified. | * CMS user chasing slow tape recall; user has been mollified. | ||
+ | * BD requests an inventory of CASTOR HyperV machines. | ||
+ | ** Machines with a good reason to be VMWare should go there. | ||
+ | ** Otherwise Cloudify. | ||
== On Call == | == On Call == | ||
RA on call | RA on call |
Latest revision as of 10:22, 14 June 2019
Contents
Standing agenda
1. Achievements this week
2. Problems encountered this week
3. What are we planning to do next week?
4. Long-term project updates (if not already covered)
5. Special topics
6. Actions
7. Review Fabric tasks
1. Link
8. AoTechnicalB
9. Availability for next week
10. On-Call
11. AoOtherB
Achievements this week
- LHCb decommissioning ongoing.
- Odd behaviour of xrdadler32 for files that have become corrupt on disk explained.
- If there's a checksum in the extended attributes, it will return that instead of actually checksumming the file.
- Migrated CASTOR gridmap-files generation away for castor-functional-test1 onto a system.
- Decommissioned VCert2
Operation problems
- ATLAS are periodically submitting SAM tests that impact availability and cause pointless callouts - Currently with TA
- Load issue on wlcgTape.
- Thought to be due to the actions of a hyperactive T2k user
Plans for next few weeks
- Proposal: Move VCert into the Facilities domain so we have a facilities test instance.
- Castor tape testing to continue after the production tape robot networking is installed
- Decommissioned lhcbDst; hardware awaiting retirement.
- Kevin has done some storageD functional tests with the new tape robot
- He has a plan for CEDA to do this testing.
Long-term projects
- New CASTOR WLCGTape instance.
- Migration of name server to VMs on 2.1.17-xx is waiting until aliceDisk is decommissioned.
- CASTOR disk server migration to Aquilon.
- Need to work with Fabric to get a stress test (see above)
- SL7 VM headnodes need changes to their personalities for the facilities.
- SL7 headnodes are being tested by GP
- Implementing DUNE on Spectralogic robot is paused.
- Migrate VCert to VMWare.
Actions
- AD wants us to make sure that experiments cannot write to that part of namespace that was used for d1t0 data: namespace cleanup/deletion of empty dirs.
- Some discussion about what exactly is required and how this can be actually implemented.
- CASTOR team proposal is either:
- to switch all of these directories to a fileclass with a requirement for a tape copy but no migration route; this will cause an error whenever any writes are attempted.
- to run a recursive nschmod on all the unneeded directories to make them read only.
- CASTOR team split over the correct approach.
- Problem with functional test node using a personal proxy which runs out some time in July.
- RA met with JJ, requested an appropriate certificate.
- Follow up with JJ or ST next week
Staffing
- GP is back on the 24th.
AoB
- CMS user chasing slow tape recall; user has been mollified.
- BD requests an inventory of CASTOR HyperV machines.
- Machines with a good reason to be VMWare should go there.
- Otherwise Cloudify.
On Call
RA on call