Difference between revisions of "RAL Tier1 weekly operations castor 14/07/2014"

From GridPP Wiki
Jump to: navigation, search
(Created page with "== Operations News == * 2.1.14-13 upgrade for Atlas Stagers completed over the course of last Tuesday/Wednesday (8th-9th July) * Plan to ensure PreProd represents production i...")
 
 
Line 2: Line 2:
 
* 2.1.14-13 upgrade for Atlas Stagers completed over the course of last Tuesday/Wednesday (8th-9th July)
 
* 2.1.14-13 upgrade for Atlas Stagers completed over the course of last Tuesday/Wednesday (8th-9th July)
 
* Plan to ensure PreProd represents production in terms of hardware generation are underway.
 
* Plan to ensure PreProd represents production in terms of hardware generation are underway.
* Elastic Search has been through some testing, others encouraged to use it, see Rob for details.
+
* Elastic Search has been through some testing, others encouraged to use it. A student will be starting next week with the task of investigating visualisation and querying solutions for CASTOR use.
 
* Deployment of disk servers is due to restart next week.
 
* Deployment of disk servers is due to restart next week.
  

Latest revision as of 13:48, 14 July 2014

Operations News

  • 2.1.14-13 upgrade for Atlas Stagers completed over the course of last Tuesday/Wednesday (8th-9th July)
  • Plan to ensure PreProd represents production in terms of hardware generation are underway.
  • Elastic Search has been through some testing, others encouraged to use it. A student will be starting next week with the task of investigating visualisation and querying solutions for CASTOR use.
  • Deployment of disk servers is due to restart next week.

Operations Problems

  • A potential race condition which could result in data loss has been seen on CMS (2.1.14-13) while investigating a file that would not migrate to tape. CERN have been notified.
  • The srmbed daemon on the Gen SRMs was unstable on 2014-07-11 (Friday). The reason for the problem was not identified, but it went away by, apparently by itself.
  • A CMS db locking issue was seen during working hours on 2014-07-11. This will be reported to the developers.
  • Atlas SUM test failures have stopped since dark data search ceased. A new VM configured to run against the standby database will be created as a front-end for such queries. Chris will be leading this.

Blocking Issues

Planned, Scheduled and Cancelled Interventions

  • CASTOR 2.1.14-13 upgrade for Repack - planned for Tuesday or Wednesday this week.
  • Switch-off of compatibility mode for Tier 1 Name Server
  • Upgrade of Facilities CASTOR from 2.1.14-11 to 2.1.14-13.

Advanced Planning

Tasks

  • Correct partitioning alignment issue (3rd CASTOR partition) on new castor disk servers
  • Put V13 servers in NonProd into production (once name server compatibility mode change complete)
  • Resume draining on the ATLAS instance (again, once name server compatibility mode change complete)
  • Switch from admin machines: lcgccvm02 to lcgcadm05
  • Replace DLF with Elastic Search


Interventions

  • CASTOR 2.1.14-13 stager upgrades for Tier 1 - 8th July for Atlas

Staffing

  • Castor on Call person
    • Rob
  • Staff absence/out of the office:
    • Matt interviewing on Monday/Tuesday.