RAL Tier1 weekly operations castor 30/06/2014

From GridPP Wiki
Jump to: navigation, search

Operations News

  • 2.1.14-13 Stager upgrade for LHCb was completed successfully, no issues.
  • 2.1.14-13 Stager upgrade for Gen was completed successfully, however there was one issue. An Alice security library was required which required components only available on SL6. CERN provided a solution for SL5.9. We need to consider SL6 upgrade post CASTOR 2.1.14-13 upgrades.
  • Plan to ensure PreProd represents production in terms of hardware generation are underway.
  • Elastic Search has been through some testing, others encouraged to use it, see Rob for details.
  • A bug in the ATLAS deletion system has been identified that may have contributed to the deletion problems on their CASTOR instance. However, the key test of running the ATLAS deletion scripts locally at RAL has still not been done and awaits Alastair and Shaun being in the same place.
  • Deployment of disk servers is on hold pending completion of the 2.1.14 upgrade.

Operations Problems

  • A potential race condition which could result in data loss has been seen on CMS (2.1.14-13) while investigating a file that would not migrate to tape. CERN have been notified.
  • Many more SUM test failures have been seen on Atlas this week, root cause could not be located. We now believe this may have been caused by additional load caused by dark data analysis (now on hold), issues have not reoccurred - MONITORING.
  • Puppet issue when castor config updated (26/6) – CMS request handler did not restart, all jobs went to fallback – everything returned to normal after puppets next run.
  • A partitioning alignment issue (3rd CASTOR partition) has been identified, proposal is to resolve this for new machines only i.e. not pull machines out of production to correct. James A driving.

Blocking Issues

  • None

Planned, Scheduled and Cancelled Interventions

  • CASTOR 2.1.14-13 upgrade stage 2 (Stagers) for Tier 1 - Tuesday 8th July for Atlas.

Advanced Planning

Tasks

  • Correct partitioning alignment issue (3rd CASTOR partition) on new castor disk servers
  • Put V13 servers in NonProd into production
  • Switch from admin machines: lcgccvm02 to lcgcadm05
  • Replace DLF with Elastic Search
    • Pending scheduling.


Interventions

  • CASTOR 2.1.14-13 stager upgrades for Tier 1 - 8th July for Atlas

Staffing

  • Castor on Call person
    • Rob
  • Staff absence/out of the office:
    • Bruno off out next week.
    • Brian off Friday.