Difference between revisions of "RAL Tier1 weekly operations castor 19/05/2014"

From GridPP Wiki
Jump to: navigation, search
(Created page with "== Operations News == * Planning for the 2.1.14 upgrade is ongoing. NS upgrade now booked, stage 2 (Stager) needs to be scheduled. ** A decision has been taken that running ...")
 
Line 2: Line 2:
 
* Planning for the 2.1.14 upgrade is ongoing. NS upgrade now booked, stage 2 (Stager) needs to be scheduled.   
 
* Planning for the 2.1.14 upgrade is ongoing. NS upgrade now booked, stage 2 (Stager) needs to be scheduled.   
 
** A decision has been taken that running the upgrade with only one Oracle DBA (Juan) in the office is acceptable. In the event that Juan becomes unavailable, we will postpone the upgrade.
 
** A decision has been taken that running the upgrade with only one Oracle DBA (Juan) in the office is acceptable. In the event that Juan becomes unavailable, we will postpone the upgrade.
* Elastic Search has been through some testing, others encoureged to use it, see Rob for details.
+
* Elastic Search has been through some testing, others encouraged to use it, see Rob for details.
 
* Brian has been stress testing a DDN server on preprod.
 
* Brian has been stress testing a DDN server on preprod.
 
* A bug in the ATLAS deletion system has been identified that may have contributed to the deletion problems on their CASTOR instance. However, the key test of running the ATLAS deletion scripts locally at RAL has still not been done and awaits Alastair and Shaun being in the same place.
 
* A bug in the ATLAS deletion system has been identified that may have contributed to the deletion problems on their CASTOR instance. However, the key test of running the ATLAS deletion scripts locally at RAL has still not been done and awaits Alastair and Shaun being in the same place.
Line 9: Line 9:
 
* Another V13 disk server has failed while in NonProd (gdss767), remains with Fabric team for investigation. Deployment of the remainder of the V13 generation is on hold pending their findings.
 
* Another V13 disk server has failed while in NonProd (gdss767), remains with Fabric team for investigation. Deployment of the remainder of the V13 generation is on hold pending their findings.
 
* Rob currently investigating facilities castor issues, suggestion is that garbage collection is not functioning correctly.  
 
* Rob currently investigating facilities castor issues, suggestion is that garbage collection is not functioning correctly.  
* 6 ATLAS files were found to be lost from gdss479 during draining. We've reviewed the historical draining of two servers (one from Atlas and one from LHCb), all files were accounted for in both cases. Rob investigated the actual file loss incidnet with staff at CERN, file deletion was consistant with the use of the clearAllFiles command i.e. no logging. Draining has restarted, one additional file check was completed for a newly drained server, all files accounted for.   
+
* 6 ATLAS files were found to be lost from gdss479 during draining. We've reviewed the historical draining of two servers (one from Atlas and one from LHCb), all files were accounted for in both cases. Rob investigated the actual file loss incident with staff at CERN, file deletion was consistent with the use of the clearAllFiles command i.e. no logging. Draining has restarted, one additional file check was completed for a newly drained server, all files accounted for.   
  
 
== Blocking Issues ==
 
== Blocking Issues ==

Revision as of 11:05, 19 May 2014

Operations News

  • Planning for the 2.1.14 upgrade is ongoing. NS upgrade now booked, stage 2 (Stager) needs to be scheduled.
    • A decision has been taken that running the upgrade with only one Oracle DBA (Juan) in the office is acceptable. In the event that Juan becomes unavailable, we will postpone the upgrade.
  • Elastic Search has been through some testing, others encouraged to use it, see Rob for details.
  • Brian has been stress testing a DDN server on preprod.
  • A bug in the ATLAS deletion system has been identified that may have contributed to the deletion problems on their CASTOR instance. However, the key test of running the ATLAS deletion scripts locally at RAL has still not been done and awaits Alastair and Shaun being in the same place.

Operations Problems

  • Another V13 disk server has failed while in NonProd (gdss767), remains with Fabric team for investigation. Deployment of the remainder of the V13 generation is on hold pending their findings.
  • Rob currently investigating facilities castor issues, suggestion is that garbage collection is not functioning correctly.
  • 6 ATLAS files were found to be lost from gdss479 during draining. We've reviewed the historical draining of two servers (one from Atlas and one from LHCb), all files were accounted for in both cases. Rob investigated the actual file loss incident with staff at CERN, file deletion was consistent with the use of the clearAllFiles command i.e. no logging. Draining has restarted, one additional file check was completed for a newly drained server, all files accounted for.

Blocking Issues

  • none

Planned, Scheduled and Cancelled Interventions

  • CASTOR 2.1.14 upgrade for Tier 1. First stage of intervention (NS upgrade) is booked for Tues 10th June.
  • Deployment of 2013 generation disk servers.

Advanced Planning

Tasks

  • Switch from admin machines: lcgccvm02 to lcgcadm05
  • Replace DLF with Elastic Search
    • Pending scheduling.

Interventions

  • CASTOR 2.1.14 stager upgrades for Tier 1 - Rob to schedule

Staffing

  • Castor on Call person
    • Rob on-call Monday 19th - Monday 26th Inc. (BH)
  • Staff absence/out of the office:
    • All in