Difference between revisions of "RAL Tier1 weekly operations castor 15/02/2019"
From GridPP Wiki
(One intermediate revision by one user not shown) | |||
Line 55: | Line 55: | ||
* Replacement of Facilities CASTOR d0t1 ingest nodes | * Replacement of Facilities CASTOR d0t1 ingest nodes | ||
** Now ready to deploy once Fabric team are ready. | ** Now ready to deploy once Fabric team are ready. | ||
+ | * Outage for LHCb disk server rack change Tuesday 19th. | ||
== Long-term projects == | == Long-term projects == |
Latest revision as of 16:03, 18 February 2019
Contents
Standing agenda
1. Problems encountered this week
2. Upgrades/improvements made this week
3. What are we planning to do next week?
4. Long-term project updates (if not already covered)
5. Special topics
6. Actions
7. Review Fabric tasks
1. Link
8. AoTechnicalB
9. Availability for next week
10. On-Call
11. AoOtherB
Operation problems
- Tape mishap last week explained as being a side effect of an engineer fixing a minor robot problem and was fixed by a power cycle.
- Problem following deployment of new diamondRecall disk servers - xrootd keys left with incorrect ownership
- Update procedure to include a check for this.
Operation news
- New facd0t1 disk servers
- Ready to go into production from a CASTOR perspective
- Decommissioning tickets created for all machines occupying the rack (rack 214, row 2) in which these disk servers will be placed
- Slow progress being made
- Final deployment waiting on Fabric team to remove the decommissioned machines out of the machine room.
- Migrated ALICE to WLCGTape.
- Argo tests are failing for srm-alice since the upgrade.
- Exposed that all these tests are pointing at the same thing.
- Action on RA - contact GS to discuss what Argo tests are appropriate.
- Update to tape accounting to give vo-specific readings of recall efficiency.
- Facilities tape busy due to an urgent 600TB recall request from CEDA.
- Decommissioned some 11 generation hardware from preprodTape.
- 11 d1t0 LHCb disk servers need to be physically moved around the machine room; agreed outline plan with Fabric, intervention date TBD, ~1 day downtime on lhcbDst expected
- Also some d0t1 disk servers, no downtime needed.
- We will a kernel roll at the some time for the entire instance.
- Started decommissioning genTape.
Plans for next few weeks
- Examine further standardisation of CASTOR pool settings.
- CASTOR team to generate a list of nonstandard settings and consider whether they are justified.
- Replacement of Facilities CASTOR d0t1 ingest nodes
- Now ready to deploy once Fabric team are ready.
- Outage for LHCb disk server rack change Tuesday 19th.
Long-term projects
- New CASTOR WLCGTape instance.
- ALICE now using wlcgTape
- Still need to move LHCb
- CASTOR disk server migration to Aquilon.
- Plan to apply the new Aquilon disk server profile and see what happens.
- Negotiating some compilation errors.
- Quattor templates have been reviewed, busy addressing feedback.
- Plan to apply the new Aquilon disk server profile and see what happens.
- Need to discuss Ganglia monitoring/metrics for Aquilonised disk servers. GP to try, follow up with JK.
- No progress made.
- One of the test InfluxDB servers (05) can be used for CASTOR InfluxDB development.
- Various bits of monitoring/stats are broken for wlcgTape, investigate further.
- We need a clear statement from stakeholders of what they need and how we fall short of that.
- Some scripts need looking at.
- Deadline of end of April to get Facilities moved to generic VM headnodes and 2.1.17 tape servers.
- Ticket with Fabric team to make the VMs.
- RA working with James to sort out the gridmap-file distribution infrastructure and get a machine with a better name for this than castor-function-test1
Actions
- AD wants us to make sure that experiments cannot write to that part of namespace that was used for d1t0 data: namespace cleanup/deletion of empty dirs.
- Some discussion about what exactly is required and how this can be actually implemented.
- CASTOR team proposal is to switch all of these directories to a fileclass with a requirement for a tape copy but no migration route; this will cause an error whenever any writes are attempted.
Staffing
- CP out Thursday/Friday.
AoB
- Nodes with names consistent with CASTOR Facilities disk servers are confusing because they often aren't CASTOR Facilities disk servers.
- Legacy of purchasing decisions, names will not change.
On Call
RA on call