RAL Tier1 weekly operations castor 30/09/2016

From GridPP Wiki
Jump to: navigation, search

Draft agenda

1. Problems encountered this week

2. Upgrades/improvements made this week

3. What are we planning to do next week?

4. Long-term project updates (if not already covered)

 1. Castor 2.1.15
 2. SL7 upgrade on tape servers

5. Special topics

6. Actions

7. Anything for CASTOR-Fabric?

8. AoTechnicalB

9. Availability for next week

10. On-Call

11. AoOtherB


Operation problems

gdss677 (cmsTape) and gdss739 (lhcbDst) failed and went out of prod

puppetdev failed

A number of facilities tape drives were down

Operation news

The firmware was upgraded on a number of CV11 servers: gdss662 (atlasTape) and gdss655, gdss656, gdss657 and gdss673 (lhcbRawDst) RT175801

Long-term projects

Castor 2.1.15 upgrade has been postoponed until January 2017

Development continues to migrate castor tape servers to aquilon

Actions

RA disks servers requiring RAID update - locate servers and plan for update with fabric RT175801

Follow up the impact of the new WAN parameters deployed on ~50% CMS disk servers

Talk to AL about the issue with unrouted files to tape in CMS

RA to identify a spare machine to be used for the tape server migration to aquilon

Check if there is a nagios test that checks for facilities tape drives being down

RA/GP to deploy the former Ceph OCF14 servers into aliceDisk (see RAL disk server deployment plan by Alastair)

John Kelly to enquire about the nagios messages on gdss619

Completed actions

Andrey to create a wiki page to capture the details of the DB problem that caused problems in Castor 2.1.15 draining

RA to find a machine with SL6 to be used as a spare head node

GP to come up with a procedure to deal with a failed head node

Staffing

GP on call this with RA as a back up. Hand over to Chris on Friday 7/10.