Difference between revisions of "RAL Tier1 weekly operations castor 30/09/2016"

From GridPP Wiki
Jump to: navigation, search
(Actions)
(Actions)
 
(One intermediate revision by one user not shown)
Line 48: Line 48:
 
== Actions ==
 
== Actions ==
  
RA disks servers requiring RAID update - locate servers and plan for update with fabric
+
RA disks servers requiring RAID update - locate servers and plan for update with fabric [https://helpdesk.gridpp.rl.ac.uk/Ticket/Display.html?id=175801 RT175801]
  
Follow up the impact of the new WAN parameters deployed on CMS disk servers
+
Follow up the impact of the new WAN parameters deployed on ~50% CMS disk servers
  
Talk to AL about the issue with unrouted files to tape
+
Talk to AL about the issue with unrouted files to tape in CMS
  
 
RA to identify a spare machine to be used for the tape server migration to aquilon
 
RA to identify a spare machine to be used for the tape server migration to aquilon
Line 60: Line 60:
 
RA/GP to deploy the former Ceph OCF14 servers into aliceDisk (see RAL disk server deployment plan by Alastair)
 
RA/GP to deploy the former Ceph OCF14 servers into aliceDisk (see RAL disk server deployment plan by Alastair)
  
John Kelly t0 enquire about the nagios messages on gdss619
+
John Kelly to enquire about the nagios messages on gdss619
  
 
== Completed actions ==
 
== Completed actions ==

Latest revision as of 09:05, 5 October 2016

Draft agenda

1. Problems encountered this week

2. Upgrades/improvements made this week

3. What are we planning to do next week?

4. Long-term project updates (if not already covered)

 1. Castor 2.1.15
 2. SL7 upgrade on tape servers

5. Special topics

6. Actions

7. Anything for CASTOR-Fabric?

8. AoTechnicalB

9. Availability for next week

10. On-Call

11. AoOtherB


Operation problems

gdss677 (cmsTape) and gdss739 (lhcbDst) failed and went out of prod

puppetdev failed

A number of facilities tape drives were down

Operation news

The firmware was upgraded on a number of CV11 servers: gdss662 (atlasTape) and gdss655, gdss656, gdss657 and gdss673 (lhcbRawDst) RT175801

Long-term projects

Castor 2.1.15 upgrade has been postoponed until January 2017

Development continues to migrate castor tape servers to aquilon

Actions

RA disks servers requiring RAID update - locate servers and plan for update with fabric RT175801

Follow up the impact of the new WAN parameters deployed on ~50% CMS disk servers

Talk to AL about the issue with unrouted files to tape in CMS

RA to identify a spare machine to be used for the tape server migration to aquilon

Check if there is a nagios test that checks for facilities tape drives being down

RA/GP to deploy the former Ceph OCF14 servers into aliceDisk (see RAL disk server deployment plan by Alastair)

John Kelly to enquire about the nagios messages on gdss619

Completed actions

Andrey to create a wiki page to capture the details of the DB problem that caused problems in Castor 2.1.15 draining

RA to find a machine with SL6 to be used as a spare head node

GP to come up with a procedure to deal with a failed head node

Staffing

GP on call this with RA as a back up. Hand over to Chris on Friday 7/10.