Difference between revisions of "RAL Tier1 weekly operations castor 05/05/2014"

From GridPP Wiki
Jump to: navigation, search
 
Line 1: Line 1:
 
== Operations News ==
 
== Operations News ==
* The atlasHotDisk pool has been eliminated. The files that were in it have been moved to atlasStripInput and attempts to access atlasHotDisk will be redirected.
+
* 3 new V'13 disk servers were deployed into cmsDisk.
* The draining of old storage hardware is now proceeding overnight without incident.
+
* A new version of CASTOR 2.1.14 (2.1.14-12) has been released. This version makes no changes to the DB from the previous one and so can be brought into production with a service restart. The team agreed to proceed with a normal upgrade to 2.1.14-11 and then proceed onwards to 2.1.14-12 at a later date.
+
  
 
== Operations Problems ==
 
== Operations Problems ==
* Quiet week!
+
* cmsDisk was very full, all but three recently added CV'13 disks were full hence resulted in timeouts and a string of callouts. CMS have since deleted many files which has improved matters.
 +
* One of the 3 new V'13 disk servers installed in cmsDisk on 1st May has failed (others of this revision have also failed before going into production). Issue is currently bring investigated by fabric and remaining 2 servers are to stay in cmsDisk for now.
 +
* A few SUM test failures for Atlas WE 26/27th April - cause not obvious and issue not reoccurred.
  
 
== Blocking Issues ==
 
== Blocking Issues ==
Line 11: Line 11:
  
 
== Planned, Scheduled and Cancelled Interventions ==
 
== Planned, Scheduled and Cancelled Interventions ==
'''Entries in/planned to go to GOCDB'''
+
* CASTOR 2.1.14 upgrade for Tier 1. Possible date for first stage of intervention (NS upgrade) is May 27th.
* CASTOR 2.1.14 upgrade for Tier 1.
+
* A CASTOR procedure for Tuesday's network intervention has been agreed. CASTOR will be shut down well in advance and disk servers will be monitored for network issues after the intervention is complete.
+
 
* Deployment of 2013 generation disk servers.
 
* Deployment of 2013 generation disk servers.
  
Line 19: Line 17:
 
'''Tasks'''
 
'''Tasks'''
  
* Atlas would like to store c2 million EVNT monte carlo files – Brian to discuss with Alastair. Other tier 1s are not keen but RAL tier 1 / castor should be able to cope with this.
 
 
* CASTOR 2.1.14 for Tier 1
 
* CASTOR 2.1.14 for Tier 1
  
Line 26: Line 23:
 
== Staffing ==
 
== Staffing ==
 
* Castor on Call person
 
* Castor on Call person
** Matt
+
** Matt until Tuesday / Rob thereafter
 
* Staff absence/out of the office:
 
* Staff absence/out of the office:
** (Thu-Fri) Shaun A/L
+
** Chris out Tues/Wed

Latest revision as of 15:56, 2 May 2014

Operations News

  • 3 new V'13 disk servers were deployed into cmsDisk.

Operations Problems

  • cmsDisk was very full, all but three recently added CV'13 disks were full hence resulted in timeouts and a string of callouts. CMS have since deleted many files which has improved matters.
  • One of the 3 new V'13 disk servers installed in cmsDisk on 1st May has failed (others of this revision have also failed before going into production). Issue is currently bring investigated by fabric and remaining 2 servers are to stay in cmsDisk for now.
  • A few SUM test failures for Atlas WE 26/27th April - cause not obvious and issue not reoccurred.

Blocking Issues

  • none

Planned, Scheduled and Cancelled Interventions

  • CASTOR 2.1.14 upgrade for Tier 1. Possible date for first stage of intervention (NS upgrade) is May 27th.
  • Deployment of 2013 generation disk servers.

Advanced Planning

Tasks

  • CASTOR 2.1.14 for Tier 1

Interventions

Staffing

  • Castor on Call person
    • Matt until Tuesday / Rob thereafter
  • Staff absence/out of the office:
    • Chris out Tues/Wed