RAL Tier1 weekly operations castor 07/07/2014

From GridPP Wiki
Jump to: navigation, search

Operations News

  • 2.1.14-13 upgrade for Atlas Stagers scheduled for Tuesday 8th (published completion 9th at 12:00).
  • Plan to ensure PreProd represents production in terms of hardware generation are underway.
  • Elastic Search has been through some testing, others encouraged to use it, see Rob for details.
  • A bug in the ATLAS deletion system has been identified that may have contributed to the deletion problems on their CASTOR instance. However, the key test of running the ATLAS deletion scripts locally at RAL has still not been done and awaits Alastair and Shaun being in the same place.
  • Deployment of disk servers is on hold pending completion of the 2.1.14 upgrade.

Operations Problems

  • A potential race condition which could result in data loss has been seen on CMS (2.1.14-13) while investigating a file that would not migrate to tape. CERN have been notified.
  • Gen SRM failures (daemon crashing) have been occurring since 3/7/14. Initially resolved by clearing some rogue data from the SRM users table. However failures returned 4/7/14 and are currently under investigation.
  • CMS db locking issue 3/7/14 early hours, resulted in lost CMS test file, castor current shows diskcopy_failed in stager logs. Proposal is to identify if the failure was passed back to user, frequency of failures and if they result in file loss.
  • Atlas SUM test failures have stopped since dark data search ceased. Proposal is to provide a mechanism to query non production copy of db.
  • A partitioning alignment issue (3rd CASTOR partition) has been identified, proposal is to resolve this for new machines only i.e. not pull machines out of production to correct. James A driving.

Blocking Issues

  • None

Planned, Scheduled and Cancelled Interventions

  • CASTOR 2.1.14-13 upgrade stage 2 (Stagers) for Tier 1 - Tuesday 8th (and poss 9th) July for Atlas.

Advanced Planning

Tasks

  • Correct partitioning alignment issue (3rd CASTOR partition) on new castor disk servers
  • Put V13 servers in NonProd into production
  • Switch from admin machines: lcgccvm02 to lcgcadm05
  • Replace DLF with Elastic Search
    • Pending scheduling.


Interventions

  • CASTOR 2.1.14-13 stager upgrades for Tier 1 - 8th July for Atlas

Staffing

  • Castor on Call person
    • Matt
  • Staff absence/out of the office:
    • Matt out Monday.
    • Chris out Friday.
    • Production team limited due to conference/course