Difference between revisions of "RAL Tier1 weekly operations castor 09/09/2016"

From GridPP Wiki
Jump to: navigation, search
(Operation problems)
 
(6 intermediate revisions by one user not shown)
Line 10: Line 10:
 
4. Long-term project updates (if not already covered)
 
4. Long-term project updates (if not already covered)
  
     1. Facilities drive reallocation
+
     1. Castor 2.1.15
    2. 2.1.15
+
     2. SL7 upgrade on tape servers
     3. SL7 upgrade on tape servers
+
  
 
5. Special topics
 
5. Special topics
Line 27: Line 26:
  
 
11. AoOtherB  
 
11. AoOtherB  
 
  
 
== Operation problems ==
 
== Operation problems ==
  
 
gdss665 (atlasTape) and gdss776 (lhcbDst) failed and went out of production - see [https://helpdesk.gridpp.rl.ac.uk/Ticket/Display.html?id=175224 RT 175224] and [https://helpdesk.gridpp.rl.ac.uk/Ticket/Display.html?id=175196 RT 175196]
 
gdss665 (atlasTape) and gdss776 (lhcbDst) failed and went out of production - see [https://helpdesk.gridpp.rl.ac.uk/Ticket/Display.html?id=175224 RT 175224] and [https://helpdesk.gridpp.rl.ac.uk/Ticket/Display.html?id=175196 RT 175196]
gdss776 is back to prod
+
 
 +
gdss763 (preprod) is stil down and there is no display when log in via IPMI so it can be memory or motherboard issue.
 +
Chetan will check and report it to the vendor.
 +
 
 +
Offsite ceda run out of tape space. No tapes of that media type were available. Tim was contacted and the problem is solved
 +
 
 +
The nameserver dump script for atlas failed to execute on the scheduled date because the db login credentials are not correnct any more
 +
 
 +
There was a nagios warning on a build up of transfer jobs on the atlas scheduler. It cleared after an hour. Responce procedures warer clarified
  
 
== Operation news ==
 
== Operation news ==
 +
 +
Tim's new version of the check_tape_pools script has been deployed to production with quattor
 +
 +
New host certificates that contain srm-cms-disk.gridpp.rl.ac.uk as additional DNS were deployed in CMS SRM nodes
 +
as required by the latest version of FTS (3.5), see [https://helpdesk.gridpp.rl.ac.uk/Ticket/Display.html?id=175210 RT 175210]
 +
 +
gdss651 (preprod) is back in production
 +
 +
== Long-term projects ==
 +
 +
Stress test on Castor 2.1.15 continues. The problem with the draining persists.
 +
 +
GP to intensify on the tape server SL7 upgrade effort
 +
 +
== Actions ==
 +
 +
RA disks servers requiring RAID update - locate servers and plan for update with fabric
 +
 +
Stress test Castor 2.1.15 on the vCert nameserver
 +
 +
Follow up the impact of the new WAN parameters deployed on CMS disk servers
 +
 +
== Completed actions ==
 +
 +
RA decide what to do with persistent data (for daily test) is still on GenScratch
 +
 +
== Staffing ==
 +
 +
CP on call this weekend and RA for the rest of the week
 +
 +
GP away on Wednesday and Thursday

Latest revision as of 11:09, 9 September 2016

Draft agenda

1. Problems encountered this week

2. Upgrades/improvements made this week

3. What are we planning to do next week?

4. Long-term project updates (if not already covered)

   1. Castor 2.1.15
   2. SL7 upgrade on tape servers

5. Special topics

6. Actions

7. Anything for CASTOR-Fabric?

8. AoTechnicalB

9. Availability for next week

10. On-Call

11. AoOtherB

Operation problems

gdss665 (atlasTape) and gdss776 (lhcbDst) failed and went out of production - see RT 175224 and RT 175196

gdss763 (preprod) is stil down and there is no display when log in via IPMI so it can be memory or motherboard issue. Chetan will check and report it to the vendor.

Offsite ceda run out of tape space. No tapes of that media type were available. Tim was contacted and the problem is solved

The nameserver dump script for atlas failed to execute on the scheduled date because the db login credentials are not correnct any more

There was a nagios warning on a build up of transfer jobs on the atlas scheduler. It cleared after an hour. Responce procedures warer clarified

Operation news

Tim's new version of the check_tape_pools script has been deployed to production with quattor

New host certificates that contain srm-cms-disk.gridpp.rl.ac.uk as additional DNS were deployed in CMS SRM nodes as required by the latest version of FTS (3.5), see RT 175210

gdss651 (preprod) is back in production

Long-term projects

Stress test on Castor 2.1.15 continues. The problem with the draining persists.

GP to intensify on the tape server SL7 upgrade effort

Actions

RA disks servers requiring RAID update - locate servers and plan for update with fabric

Stress test Castor 2.1.15 on the vCert nameserver

Follow up the impact of the new WAN parameters deployed on CMS disk servers

Completed actions

RA decide what to do with persistent data (for daily test) is still on GenScratch

Staffing

CP on call this weekend and RA for the rest of the week

GP away on Wednesday and Thursday