RAL Tier1 weekly operations castor 16/06/2014

From GridPP Wiki
Jump to: navigation, search

Operations News

  • 2.1.14-13 Name server upgrade completed sucessfully on 10th June.
  • 2.1.14-13 upgrade stage 2 (Stagers) dates proposed - Tuesday 17th for CMS, Thursday 19th for LHCb, Tuesday 24th for Gen and Thursday 26th for Atlas. There are some open questions regarding required outage for the Atlas intervention relating to the database upgrade script step - being investigated.
  • Following Fabric acceptance testing of new V13 RAID firmware they we will be upgrading all other V13 servers (with possible exception of 3 x V13 in cmsDisk)
  • All SL10s are out of production
  • Plan to ensure PreProd represents production in terms of hardware generation are underway.
  • Elastic Search has been through some testing, others encouraged to use it, see Rob for details.
  • A bug in the ATLAS deletion system has been identified that may have contributed to the deletion problems on their CASTOR instance. However, the key test of running the ATLAS deletion scripts locally at RAL has still not been done and awaits Alastair and Shaun being in the same place.
  • We continue to decommission, prep for redeploy and deploy disk servers.

Operations Problems

  • A partitioning alignment issue (3rd CASTOR partition) has been identified, proposal is to resolve this for new machines only i.e. not pull machines out of production to correct. James A driving.
  • Gdss586 did not come back after kernel/errata updates, taken out of production and motherboard replaced. Needed some work from Tiju re nagios after as machine ID is calculated with motherboard. Resolved.
  • CIP changes for 2.1.14 – some changes to how the data is presented, need to get the VOs to look at new presentation.

Blocking Issues

  • Issues with server redeployment – 5 into cmsNonProd (for cmsTape). Seems to be related to rmmasterd holding on to shared memory, which needs clearing – looking to test process on preprod or vcert.

Planned, Scheduled and Cancelled Interventions

  • CASTOR 2.1.14-13 upgrade stage 2 (Stagers) for Tier 1 - Tuesday 17th for CMS, Thursday 19th for LHCb, Tuesday 24th for Gen and Thursday 26th for Atlas.

Advanced Planning

Tasks

  • Correct partitioning alighnment issue (3rd CASTOR partition) on new castor disk servers
  • Update RAID firmware on V13s currently not in production
  • Switch from admin machines: lcgccvm02 to lcgcadm05
  • Replace DLF with Elastic Search
    • Pending scheduling.


Interventions

  • CASTOR 2.1.14-13 stager upgrades for Tier 1 - 17th June CMS / 19th June LHCb / 24th June GEN / 26th June Atlas

Staffing

  • Castor on Call person
    • Rob
  • Staff absence/out of the office:
    • Chris off Friday 20th
    • Bruno off Monday 16th
    • Brian off next week