RAL Tier1 weekly operations castor 23/06/2014

From GridPP Wiki
Jump to: navigation, search

Operations News

  • 2.1.14-13 Stager upgrade for CMS. This did not occur without trouble:
    • A problem was encountered where the new rebalancing feature implemented with 2.1.14 proved overly aggressive and DoS-ed the transfer manager. The rebalancer was disabled and the rebalancing threshold increased to 100%.
    • Some problems were also encountered with xroot. Our initial weighting of xroot transfers proved overly restrictive and CMS suffered problems with transfers going to xroot fallback rather than reading data locally.
  • 2.1.14-13 upgrade stage 2 (Stagers) dates proposed - Thursday 19th for LHCb, Tuesday 24th for Gen and Thursday 26th for Atlas. There are some open questions regarding required outage for the Atlas intervention relating to the database upgrade script step and - being investigated.
    • The gen instance will require an additional Change Control review because the Alice xroot configuration is not testable in preproduction.
  • Following Fabric acceptance testing of new V13 RAID firmware they we will be upgrading all other V13 servers (with possible exception of 3 x V13 in cmsDisk)
  • Plan to ensure PreProd represents production in terms of hardware generation are underway.
  • Elastic Search has been through some testing, others encouraged to use it, see Rob for details.
  • A bug in the ATLAS deletion system has been identified that may have contributed to the deletion problems on their CASTOR instance. However, the key test of running the ATLAS deletion scripts locally at RAL has still not been done and awaits Alastair and Shaun being in the same place.
  • Deployment of disk servers is on hold pending completion of the 2.1.14 upgrade.

Operations Problems

  • A partitioning alignment issue (3rd CASTOR partition) has been identified, proposal is to resolve this for new machines only i.e. not pull machines out of production to correct. James A driving.
  • Some SUM test failures on the ATLAS instance occurred on the afternoon of Friday 13th. The issue cleared up by itself, but was well understood.

Blocking Issues

  • None

Planned, Scheduled and Cancelled Interventions

  • CASTOR 2.1.14-13 upgrade stage 2 (Stagers) for Tier 1 - Thursday 19th for LHCb, Tuesday 24th for Gen and Thursday 26th for Atlas.

Advanced Planning

Tasks

  • Correct partitioning alighnment issue (3rd CASTOR partition) on new castor disk servers
  • Update RAID firmware on V13s currently not in production
  • Switch from admin machines: lcgccvm02 to lcgcadm05
  • Replace DLF with Elastic Search
    • Pending scheduling.


Interventions

  • CASTOR 2.1.14-13 stager upgrades for Tier 1 - 17th June CMS / 19th June LHCb / 24th June GEN / 26th June Atlas

Staffing

  • Castor on Call person
    • Shaun
  • Staff absence/out of the office:
    • Bruno off out next week.
    • Matt on special leave Monday 23rd.