Difference between revisions of "RAL Tier1 weekly operations castor 07/07/2014"

From GridPP Wiki
Jump to: navigation, search
(Created page with "== Operations News == * Plan to ensure PreProd represents production in terms of hardware generation are underway. * Elastic Search has been through some testing, others enco...")
 
Line 1: Line 1:
 
== Operations News ==
 
== Operations News ==
 
+
* 2.1.14-13 upgrade for Atlas Stagers scheduled for Tuesday 8th (published completion 9th at 1pm).
 
* Plan to ensure PreProd represents production in terms of hardware generation are underway.
 
* Plan to ensure PreProd represents production in terms of hardware generation are underway.
 
* Elastic Search has been through some testing, others encouraged to use it, see Rob for details.
 
* Elastic Search has been through some testing, others encouraged to use it, see Rob for details.
Line 8: Line 8:
 
== Operations Problems ==
 
== Operations Problems ==
 
* A potential race condition which could result in data loss has been seen on CMS (2.1.14-13) while investigating a file that would not migrate to tape. CERN have been notified.
 
* A potential race condition which could result in data loss has been seen on CMS (2.1.14-13) while investigating a file that would not migrate to tape. CERN have been notified.
* Many more SUM test failures have been seen on Atlas this week, root cause could not be located. We now believe this may have been caused by additional load caused by dark data analysis (now on hold), issues have not reoccurred - MONITORING.
+
* Gen SRM failures (daemon crashing) have been occurnign since 3/7/14. Initally resolved by clearing some rouge data from the SRM users table. However failures returned 4/7/14 and are currently under investiagtion.
* Puppet issue when castor config updated (26/6)  – CMS request handler did not restart, all jobs went to fallback – everything returned to normal after puppets next run.
+
* CMS db locking issue 3/7/14 early hours, resulted in lost CMS test file, castor current shows diskcopy_failed in stager logs. Proposal is to identify if the failure was passed back to user, frequency of failures and if they result in file loss.
 +
* Atlas SUM test failures have stopped since dark data search ceased. Proposal is to provide a mechanism to query non production copy of db.
 
* A partitioning alignment issue (3rd CASTOR partition) has been identified, proposal is to resolve this for new machines only i.e. not pull machines out of production to correct. James A driving.
 
* A partitioning alignment issue (3rd CASTOR partition) has been identified, proposal is to resolve this for new machines only i.e. not pull machines out of production to correct. James A driving.
  

Revision as of 15:22, 4 July 2014

Operations News

  • 2.1.14-13 upgrade for Atlas Stagers scheduled for Tuesday 8th (published completion 9th at 1pm).
  • Plan to ensure PreProd represents production in terms of hardware generation are underway.
  • Elastic Search has been through some testing, others encouraged to use it, see Rob for details.
  • A bug in the ATLAS deletion system has been identified that may have contributed to the deletion problems on their CASTOR instance. However, the key test of running the ATLAS deletion scripts locally at RAL has still not been done and awaits Alastair and Shaun being in the same place.
  • Deployment of disk servers is on hold pending completion of the 2.1.14 upgrade.

Operations Problems

  • A potential race condition which could result in data loss has been seen on CMS (2.1.14-13) while investigating a file that would not migrate to tape. CERN have been notified.
  • Gen SRM failures (daemon crashing) have been occurnign since 3/7/14. Initally resolved by clearing some rouge data from the SRM users table. However failures returned 4/7/14 and are currently under investiagtion.
  • CMS db locking issue 3/7/14 early hours, resulted in lost CMS test file, castor current shows diskcopy_failed in stager logs. Proposal is to identify if the failure was passed back to user, frequency of failures and if they result in file loss.
  • Atlas SUM test failures have stopped since dark data search ceased. Proposal is to provide a mechanism to query non production copy of db.
  • A partitioning alignment issue (3rd CASTOR partition) has been identified, proposal is to resolve this for new machines only i.e. not pull machines out of production to correct. James A driving.

Blocking Issues

  • None

Planned, Scheduled and Cancelled Interventions

  • CASTOR 2.1.14-13 upgrade stage 2 (Stagers) for Tier 1 - Tuesday 8th July for Atlas.

Advanced Planning

Tasks

  • Correct partitioning alignment issue (3rd CASTOR partition) on new castor disk servers
  • Put V13 servers in NonProd into production
  • Switch from admin machines: lcgccvm02 to lcgcadm05
  • Replace DLF with Elastic Search
    • Pending scheduling.


Interventions

  • CASTOR 2.1.14-13 stager upgrades for Tier 1 - 8th July for Atlas

Staffing

  • Castor on Call person
    • Rob
  • Staff absence/out of the office:
    • Bruno off out next week.
    • Brian off Friday.