Difference between revisions of "RAL Tier1 weekly operations castor 08/03/2019"
From GridPP Wiki
(Created page with "== Standing agenda == 1. Problems encountered this week 2. Upgrades/improvements made this week 3. What are we planning to do next week? 4. Long-term project updates (if n...") |
|||
(One intermediate revision by one user not shown) | |||
Line 1: | Line 1: | ||
+ | Parent article: https://www.gridpp.ac.uk/wiki/RAL_Tier1_weekly_operations_castor | ||
+ | |||
== Standing agenda == | == Standing agenda == | ||
Line 74: | Line 76: | ||
* RA to look at making all fileclasses have nbcopies >= 1. | * RA to look at making all fileclasses have nbcopies >= 1. | ||
* Problem with functional test node using a personal proxy which runs out some time in July. | * Problem with functional test node using a personal proxy which runs out some time in July. | ||
− | ** Needs making | + | ** Needs making into a robot certificate |
== Staffing == | == Staffing == | ||
Line 84: | Line 86: | ||
== On Call == | == On Call == | ||
− | + | RA on call |
Latest revision as of 15:31, 11 March 2019
Parent article: https://www.gridpp.ac.uk/wiki/RAL_Tier1_weekly_operations_castor
Contents
Standing agenda
1. Problems encountered this week
2. Upgrades/improvements made this week
3. What are we planning to do next week?
4. Long-term project updates (if not already covered)
5. Special topics
6. Actions
7. Review Fabric tasks
1. Link
8. AoTechnicalB
9. Availability for next week
10. On-Call
11. AoOtherB
Operation problems
- The Great T2K Mystery
- Some T2K files wouldn't recall
- They were in fileclass 1 (NS mapped to 'default')
- Problem was 2 fileclasses in the stager with id=1 which made everything confused.
- Fixed, problem gone, GGUS resolved.
- No disk problems.
- ATLAS are periodically submitting silly SAM tests that impact availability and cause pointless callouts.
- Rob to have a discussion with Tim about this.
Operation news
- New facd0t1 disk servers
- Firmware needed patching
- Certificates ready
- 3 are ready to go into production on Monday.
- 7 are down but waiting on Fabric
- New Facilities DB environment ready.
- Facilities headnodes requested on VMWare, ticket not done yet.
Plans for next few weeks
- Examine further standardisation of CASTOR pool settings.
- CASTOR team to generate a list of nonstandard settings and consider whether they are justified.
- Replacement of Facilities CASTOR d0t1 ingest nodes
- Now ready to deploy once Fabric team are ready.
Long-term projects
- New CASTOR WLCGTape instance.
- LHCb migration is with LHCb at the moment, they are not blocked. Mirroring of lhcbDst to Echo complete.
- CASTOR disk server migration to Aquilon.
- Change ready to implement.
- Deadline of end of April to get Facilities moved to generic VM headnodes and 2.1.17 tape servers.
- Ticket with Fabric team to make the VMs.
- RA working with James to sort out the gridmap-file distribution infrastructure and get a machine with a better name for this than castor-functional-test1
- Bellona (new Facilities DB) migration - monitoring needs fixing.
- New tape robot
- New tape server has been installed with CASTOR RPMs
- Undergoing Fabric-level acceptance testing.
- New tape server has been installed with CASTOR RPMs
Actions
- AD wants us to make sure that experiments cannot write to that part of namespace that was used for d1t0 data: namespace cleanup/deletion of empty dirs.
- Some discussion about what exactly is required and how this can be actually implemented.
- CASTOR team proposal is to switch all of these directories to a fileclass with a requirement for a tape copy but no migration route; this will cause an error whenever any writes are attempted.
- RA to look at making all fileclasses have nbcopies >= 1.
- Problem with functional test node using a personal proxy which runs out some time in July.
- Needs making into a robot certificate
Staffing
- JK out Tues/Weds next week
AoB
On Call
RA on call