Difference between revisions of "RAL Tier1 weekly operations castor 15/02/2019"

Latest revision as of 16:03, 18 February 2019

Standing agenda

1. Problems encountered this week

2. Upgrades/improvements made this week

3. What are we planning to do next week?

4. Long-term project updates (if not already covered)

5. Special topics

6. Actions

7. Review Fabric tasks

  1.   Link

8. AoTechnicalB

9. Availability for next week

10. On-Call

11. AoOtherB

Operation problems

Tape mishap last week explained as being a side effect of an engineer fixing a minor robot problem and was fixed by a power cycle.
Problem following deployment of new diamondRecall disk servers - xrootd keys left with incorrect ownership
- Update procedure to include a check for this.

Operation news

New facd0t1 disk servers
- Ready to go into production from a CASTOR perspective
- Decommissioning tickets created for all machines occupying the rack (rack 214, row 2) in which these disk servers will be placed
- Slow progress being made
- Final deployment waiting on Fabric team to remove the decommissioned machines out of the machine room.
Migrated ALICE to WLCGTape.
- Argo tests are failing for srm-alice since the upgrade.
- Exposed that all these tests are pointing at the same thing.
- Action on RA - contact GS to discuss what Argo tests are appropriate.
Update to tape accounting to give vo-specific readings of recall efficiency.
Facilities tape busy due to an urgent 600TB recall request from CEDA.
Decommissioned some 11 generation hardware from preprodTape.
11 d1t0 LHCb disk servers need to be physically moved around the machine room; agreed outline plan with Fabric, intervention date TBD, ~1 day downtime on lhcbDst expected
- Also some d0t1 disk servers, no downtime needed.
- We will a kernel roll at the some time for the entire instance.
Started decommissioning genTape.

Plans for next few weeks

Examine further standardisation of CASTOR pool settings.
- CASTOR team to generate a list of nonstandard settings and consider whether they are justified.
Replacement of Facilities CASTOR d0t1 ingest nodes
- Now ready to deploy once Fabric team are ready.
Outage for LHCb disk server rack change Tuesday 19th.

Long-term projects

New CASTOR WLCGTape instance.
- ALICE now using wlcgTape
- Still need to move LHCb
CASTOR disk server migration to Aquilon.
- Plan to apply the new Aquilon disk server profile and see what happens.
  - Negotiating some compilation errors.
- Quattor templates have been reviewed, busy addressing feedback.
Need to discuss Ganglia monitoring/metrics for Aquilonised disk servers. GP to try, follow up with JK.
- No progress made.
- One of the test InfluxDB servers (05) can be used for CASTOR InfluxDB development.
Various bits of monitoring/stats are broken for wlcgTape, investigate further.
- We need a clear statement from stakeholders of what they need and how we fall short of that.
- Some scripts need looking at.
Deadline of end of April to get Facilities moved to generic VM headnodes and 2.1.17 tape servers.
- Ticket with Fabric team to make the VMs.
RA working with James to sort out the gridmap-file distribution infrastructure and get a machine with a better name for this than castor-function-test1

Actions

AD wants us to make sure that experiments cannot write to that part of namespace that was used for d1t0 data: namespace cleanup/deletion of empty dirs.
- Some discussion about what exactly is required and how this can be actually implemented.
- CASTOR team proposal is to switch all of these directories to a fileclass with a requirement for a tape copy but no migration route; this will cause an error whenever any writes are attempted.

Staffing

CP out Thursday/Friday.

AoB

Nodes with names consistent with CASTOR Facilities disk servers are confusing because they often aren't CASTOR Facilities disk servers.
- Legacy of purchasing decisions, names will not change.

On Call

RA on call

@@ Line 55: / Line 55: @@
 * Replacement of Facilities CASTOR d0t1 ingest nodes
 ** Now ready to deploy once Fabric team are ready.
+* Outage for LHCb disk server rack change Tuesday 19th.
 == Long-term projects ==

Difference between revisions of "RAL Tier1 weekly operations castor 15/02/2019"

Latest revision as of 16:03, 18 February 2019

Contents

Standing agenda

Operation problems

Operation news

Plans for next few weeks

Long-term projects

Actions

Staffing

AoB

On Call

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools