Difference between revisions of "RAL Tier1 weekly operations castor 27/10/2014"

From GridPP Wiki
Jump to: navigation, search
 
(4 intermediate revisions by 2 users not shown)
Line 8: Line 8:
  
 
== Operations Problems ==
 
== Operations Problems ==
* ddss720 / gdss763 are both drained, out of production and waiting for Fabric work on (poss RAID and other work)
+
* gdss720 / gdss763 are both drained, out of production and waiting for Fabric work on (poss RAID and other work)
 
* A few CMS SUM test failures this week, investigations inconclusive
 
* A few CMS SUM test failures this week, investigations inconclusive
  
Line 14: Line 14:
 
== Blocking Issues ==
 
== Blocking Issues ==
 
* grid ftp bug in SL6 - stops any globus copy if a client is using a particular library. This is a show stopper for SL6 on disk server.
 
* grid ftp bug in SL6 - stops any globus copy if a client is using a particular library. This is a show stopper for SL6 on disk server.
* LHCb ‘nonprod’ disk servers – still outstanding / waiting on James/fabric [some were having the same issue as the 673 machine (mellanox N/W card issues)]
 
  
  
Line 29: Line 28:
 
* Switch from admin machines: lcgccvm02 to lcgcadm05
 
* Switch from admin machines: lcgccvm02 to lcgcadm05
 
* New VM configured to run against the standby CASTOR database will be created as a front-end for dark data etc queries.
 
* New VM configured to run against the standby CASTOR database will be created as a front-end for dark data etc queries.
* Replace DLF with Elastic Search
 
 
* Correct partitioning alignment issue (3rd CASTOR partition) on new castor disk servers
 
* Correct partitioning alignment issue (3rd CASTOR partition) on new castor disk servers
  
Line 44: Line 42:
 
** Shaun Monday
 
** Shaun Monday
 
** Bruno Following 2 weeks
 
** Bruno Following 2 weeks
 +
** Chris Tues-Thurs

Latest revision as of 14:50, 27 October 2014

Operations News

  • xrootd security advisory with FAX component within xrootd
  • SL6 Headnode work - tested in vcert, next test in prepord including stress testing
  • Final 5 servers have been deployed into lhcbRawRdst
  • Draining improvement workaround by putting full or almost full disk servers in to Read Only
  • 2-1-14-14 castor upgrade priority dropped as we have a draining workaround. Revisit once SL6 work done (in new year)


Operations Problems

  • gdss720 / gdss763 are both drained, out of production and waiting for Fabric work on (poss RAID and other work)
  • A few CMS SUM test failures this week, investigations inconclusive


Blocking Issues

  • grid ftp bug in SL6 - stops any globus copy if a client is using a particular library. This is a show stopper for SL6 on disk server.


Planned, Scheduled and Cancelled Interventions

  • A Tier 1 Database cleanup is planned so as to eliminate a number of excess tables and other entities left over from previous CASTOR versions. This will be change-controlled in the near future.
  • Juan further patch castor dbs (PSU patches for Pluto and Juno) – standard change ... TBC
  • Functional testing new errata in preprod


Advanced Planning

Tasks

  • Plan to ensure PreProd represents production in terms of hardware generation are underway
  • Possible future upgrade to CASTOR 2.1.14-15 post christmas
  • Switch from admin machines: lcgccvm02 to lcgcadm05
  • New VM configured to run against the standby CASTOR database will be created as a front-end for dark data etc queries.
  • Correct partitioning alignment issue (3rd CASTOR partition) on new castor disk servers

Interventions


Staffing

  • Castor on Call person
    • Matt V


  • Staff absence/out of the office:
    • Shaun Monday
    • Bruno Following 2 weeks
    • Chris Tues-Thurs