RAL Tier1 weekly operations castor 30/09/2016
Contents
Draft agenda
1. Problems encountered this week
2. Upgrades/improvements made this week
3. What are we planning to do next week?
4. Long-term project updates (if not already covered)
1. Castor 2.1.15 2. SL7 upgrade on tape servers
5. Special topics
6. Actions
7. Anything for CASTOR-Fabric?
8. AoTechnicalB
9. Availability for next week
10. On-Call
11. AoOtherB
Operation problems
gdss677 (cmsTape) and gdss739 (lhcbDst) failed and went out of prod
puppetdev failed
A number of facilities tape drives were down
Operation news
The firmware was upgraded on a number of CV11 servers: gdss662 (atlasTape) and gdss655, gdss656, gdss657 and gdss673 (lhcbRawDst) RT175801
Long-term projects
Castor 2.1.15 upgrade has been postoponed until January 2017
Development continues to migrate castor tape servers to aquilon
Actions
RA disks servers requiring RAID update - locate servers and plan for update with fabric RT175801
Follow up the impact of the new WAN parameters deployed on ~50% CMS disk servers
Talk to AL about the issue with unrouted files to tape in CMS
RA to identify a spare machine to be used for the tape server migration to aquilon
Check if there is a nagios test that checks for facilities tape drives being down
RA/GP to deploy the former Ceph OCF14 servers into aliceDisk (see RAL disk server deployment plan by Alastair)
John Kelly to enquire about the nagios messages on gdss619
Completed actions
Andrey to create a wiki page to capture the details of the DB problem that caused problems in Castor 2.1.15 draining
RA to find a machine with SL6 to be used as a spare head node
GP to come up with a procedure to deal with a failed head node
Staffing
GP on call this with RA as a back up. Hand over to Chris on Friday 7/10.