Difference between revisions of "RAL Tier1 weekly operations castor 28/07/2014"

From GridPP Wiki
Jump to: navigation, search
 
(2 intermediate revisions by one user not shown)
Line 3: Line 3:
 
* Elastic Search infrastructure has been fixed by James – we need  to put tools on the admin node.
 
* Elastic Search infrastructure has been fixed by James – we need  to put tools on the admin node.
 
* Plan to ensure PreProd represents production in terms of hardware generation are underway. A student will be starting soon with the task of investigating visualisation and querying solutions for CASTOR use.
 
* Plan to ensure PreProd represents production in terms of hardware generation are underway. A student will be starting soon with the task of investigating visualisation and querying solutions for CASTOR use.
* Deployment of disk servers is due to restart next week.
 
  
  
Line 14: Line 13:
 
* LHCb possible IO problems - ticket raised and investigation started.
 
* LHCb possible IO problems - ticket raised and investigation started.
 
* Need to update dteam and ATLAS VOs' voms-servers  
 
* Need to update dteam and ATLAS VOs' voms-servers  
 +
  
 
== Blocking Issues ==
 
== Blocking Issues ==
* Deployment issues - lsfadmin and amanda backup users clashing / look at removing amanda backup payload
+
* V13 Deployment issues - lsfadmin and amanda backup users clashing / look at removing amanda backup payload
 +
 
  
 
== Planned, Scheduled and Cancelled Interventions ==
 
== Planned, Scheduled and Cancelled Interventions ==
* Switch-off of compatibility mode for Tier 1 Name Server
+
 
* Upgrade of Facilities CASTOR from 2.1.14-11 to 2.1.14-13.
+
 
  
 
== Advanced Planning ==
 
== Advanced Planning ==
 
'''Tasks'''
 
'''Tasks'''
* Put V13 servers in NonProd into production (once name server compatibility mode change complete)
 
 
* Resume draining on the ATLAS instance (again, once name server compatibility mode change complete)
 
* Resume draining on the ATLAS instance (again, once name server compatibility mode change complete)
 
* Switch from admin machines: lcgccvm02 to lcgcadm05
 
* Switch from admin machines: lcgccvm02 to lcgcadm05
Line 30: Line 30:
 
* Replace DLF with Elastic Search
 
* Replace DLF with Elastic Search
 
* Correct partitioning alignment issue (3rd CASTOR partition) on new castor disk servers
 
* Correct partitioning alignment issue (3rd CASTOR partition) on new castor disk servers
* Facilities upgrade – possibly plan for next week
 
  
  
 
'''Interventions'''
 
'''Interventions'''
 +
* Upgrade of Facilities CASTOR from 2.1.14-11 to 2.1.14-13  Wed 30th July
  
  
Line 42: Line 42:
 
* Staff absence/out of the office:
 
* Staff absence/out of the office:
 
** Dataservices away day Monday
 
** Dataservices away day Monday
** Chris Tuesday
+
** Chris out Tuesday

Latest revision as of 15:46, 25 July 2014

Operations News

  • 2.1.14-13 upgrades now complete including switching off the compatibility mode.
  • Elastic Search infrastructure has been fixed by James – we need to put tools on the admin node.
  • Plan to ensure PreProd represents production in terms of hardware generation are underway. A student will be starting soon with the task of investigating visualisation and querying solutions for CASTOR use.


Operations Problems

  • low level db locking issues continue (various VOs) - has been reported to the developers at CERN.
  • A potential race condition which could result in data loss has been seen on CMS (2.1.14-13) while investigating a file that would not migrate to tape. CERN have been notified.
  • Atlas Xrootd proxy issues - investigation starting.
  • CMS xroot issues
  • Facilities castor error
  • LHCb possible IO problems - ticket raised and investigation started.
  • Need to update dteam and ATLAS VOs' voms-servers


Blocking Issues

  • V13 Deployment issues - lsfadmin and amanda backup users clashing / look at removing amanda backup payload


Planned, Scheduled and Cancelled Interventions

Advanced Planning

Tasks

  • Resume draining on the ATLAS instance (again, once name server compatibility mode change complete)
  • Switch from admin machines: lcgccvm02 to lcgcadm05
  • New VM configured to run against the standby CASTOR database will be created as a front-end for dark data etc queries.
  • Replace DLF with Elastic Search
  • Correct partitioning alignment issue (3rd CASTOR partition) on new castor disk servers


Interventions

  • Upgrade of Facilities CASTOR from 2.1.14-11 to 2.1.14-13 Wed 30th July


Staffing

  • Castor on Call person
    • Rob - Weekend 26/27 July and following week
  • Staff absence/out of the office:
    • Dataservices away day Monday
    • Chris out Tuesday