Difference between revisions of "RAL Tier1 weekly operations castor 09/06/2014"

From GridPP Wiki
Jump to: navigation, search
(Created page with "== Operations News == * Planning for the 2.1.14 upgrade is complete. NS upgrade and stage 2 (Stagers) all scheduled. However note that 2.1.14-13 will now be deployed as 2.1.14...")
 
Line 4: Line 4:
 
* Elastic Search has been through some testing, others encouraged to use it, see Rob for details.
 
* Elastic Search has been through some testing, others encouraged to use it, see Rob for details.
 
* A bug in the ATLAS deletion system has been identified that may have contributed to the deletion problems on their CASTOR instance. However, the key test of running the ATLAS deletion scripts locally at RAL has still not been done and awaits Alastair and Shaun being in the same place.
 
* A bug in the ATLAS deletion system has been identified that may have contributed to the deletion problems on their CASTOR instance. However, the key test of running the ATLAS deletion scripts locally at RAL has still not been done and awaits Alastair and Shaun being in the same place.
* We are experimenting with using pinning to improve tape recalls on Facilties.
+
* We are experimenting with using pinning to improve tape recalls on Facilities.
 
* We continue to decommission, prep for redeploy and deploy disk servers.
 
* We continue to decommission, prep for redeploy and deploy disk servers.
  
 
== Operations Problems ==
 
== Operations Problems ==
* Fabric are currently testing a RAID firmware upgrade on a few of the V13 servers as a bug that could explain our issues was reported/fixed. These servers are now in acceptance test. Castor team will only deploy V13 servers to non prod until further notice. '''UPDATE'''
+
* Fabric acceptance testing of V13 RAID firmware upgrade has completed. Machines that have been upgraded need further configurations (James) before releasing to castor team.  
 +
V13 machines in production should have firmware update, best approach TBD (requires a reboot).
 +
 
  
 
== Blocking Issues ==
 
== Blocking Issues ==
Line 15: Line 17:
 
== Planned, Scheduled and Cancelled Interventions ==
 
== Planned, Scheduled and Cancelled Interventions ==
 
* CASTOR 2.1.14-13 upgrade for Tier 1. First stage of intervention (NS upgrade) is booked for Tues 10th June, second stage (stagers) in phases over the following weeks.
 
* CASTOR 2.1.14-13 upgrade for Tier 1. First stage of intervention (NS upgrade) is booked for Tues 10th June, second stage (stagers) in phases over the following weeks.
* Deployment of 2013 generation disk servers.
 
  
 
== Advanced Planning ==
 
== Advanced Planning ==

Revision as of 12:34, 9 June 2014

Operations News

  • Planning for the 2.1.14 upgrade is complete. NS upgrade and stage 2 (Stagers) all scheduled. However note that 2.1.14-13 will now be deployed as 2.1.14-11 has some serious issues (silent failure).
  • Facilities 2.1.14-11 to 2.1.14-13 on Wednesday 11th June has been postponed.
  • Elastic Search has been through some testing, others encouraged to use it, see Rob for details.
  • A bug in the ATLAS deletion system has been identified that may have contributed to the deletion problems on their CASTOR instance. However, the key test of running the ATLAS deletion scripts locally at RAL has still not been done and awaits Alastair and Shaun being in the same place.
  • We are experimenting with using pinning to improve tape recalls on Facilities.
  • We continue to decommission, prep for redeploy and deploy disk servers.

Operations Problems

  • Fabric acceptance testing of V13 RAID firmware upgrade has completed. Machines that have been upgraded need further configurations (James) before releasing to castor team.

V13 machines in production should have firmware update, best approach TBD (requires a reboot).


Blocking Issues

  • none

Planned, Scheduled and Cancelled Interventions

  • CASTOR 2.1.14-13 upgrade for Tier 1. First stage of intervention (NS upgrade) is booked for Tues 10th June, second stage (stagers) in phases over the following weeks.

Advanced Planning

Tasks

  • Switch from admin machines: lcgccvm02 to lcgcadm05
  • Replace DLF with Elastic Search
    • Pending scheduling.

Interventions

  • CASTOR 2.1.14-13 Nameserver upgrade for Tier 1 - Tues 10th June
  • CASTOR 2.1.14-13 stager upgrades for Tier 1 - 17th June CMS / 19th June LHCb / 24th June GEN / 26th June Atlas

Staffing

  • Castor on Call person
    • Matt
  • Staff absence/out of the office:
    • Shaun may take a day off TBC