Difference between revisions of "RAL Tier1 weekly operations castor 22/04/2016"

From GridPP Wiki
Jump to: navigation, search
(Created page with "== Operations News == * Preprod DBs - new servers start testing with Preprod on Monday (3 node rack). * 2.1.16 on vcert tape servers - in progress, logging different, various...")
 
Line 1: Line 1:
 
== Operations News ==
 
== Operations News ==
* Preprod DBs - new servers start testing with Preprod on Monday (3 node rack).
+
*  
* 2.1.16 on vcert tape servers - in progress, logging different, various other changes 
+
* New MICE user set up
* Rob has changed rsyslog config on castor production
+
*  
* New tapepool for mice
+
* ceda retrieve d0t1 - retrieval only pool
+
  
  
* 2.1.16 castor needs deployment to tape servers
 
* planned  new atlas tape pools and has been documented (Brian)
 
* alice disk will not be supported post Sept 17
 
* CERN steered us not to move to SRM 2.1.14 before castor 2.1.15
 
* CMS requesting new tape pools
 
* Atlas now writing to D drives
 
* A new corrected checksum validator available
 
* NSS patching on Tier1 was successful
 
* 8*2014 generation disk nodes deployed to atlasStripInput to ensure 2016-7 storage pledges are met.
 
* CERN suggested 2.1.16 deployed to tape servers (Steve Murray) 
 
  
  
 
== Operations Problems ==
 
== Operations Problems ==
 
* gfalcat does not work with castor, underlying issue fixed for gfalcopy but not gfalcat (gfal developers responsible) - Tracking  
 
* gfalcat does not work with castor, underlying issue fixed for gfalcopy but not gfalcat (gfal developers responsible) - Tracking  
* several disk server issues
+
 
* A file was declared lost - checksum issues when copied anywhere
+
 
* fts issues causing data duplication (atlas)
+
  
  
 
* AtlasScratch, users from atlas still having problems accessing atlasScratch files - investigations ongoing
 
* AtlasScratch, users from atlas still having problems accessing atlasScratch files - investigations ongoing
 
* GDSS771 crashed - now in draining       
 
* GDSS771 crashed - now in draining       
* preprod - new servers disk array not visible, Fabric working on this (ticketed)
 
* Atlas apparent duplicates due to running job twice
 
 
* draining is not working for atlas (does however seem to work on LHCb) - Brian has changed parameters as recommended by Shaun no improvement. manual method of draining still works - diskServerLs and stager_get (to move file to another disk server)
 
* draining is not working for atlas (does however seem to work on LHCb) - Brian has changed parameters as recommended by Shaun no improvement. manual method of draining still works - diskServerLs and stager_get (to move file to another disk server)
* transfermanager on atlas dlf was not performing TM tasks but was reporting as being up
+
 
* 2.1.15 Problems re config required for production to solve slow file open times - Andrey reports that CERN use 100GB of memory for DB servers in castor to run 2.1.15 (vs our 32GB), Oracle are not providing adequate support at the moment. 2.1.15 deployment will not be scheduled at the moment.
+
  
 
== Planned, Scheduled and Cancelled Interventions ==
 
== Planned, Scheduled and Cancelled Interventions ==
Line 51: Line 35:
  
 
* Castor on Call person next week
 
* Castor on Call person next week
** CP
+
 
  
 
== New Actions ==
 
== New Actions ==

Revision as of 09:30, 6 May 2016

Operations News

  • New MICE user set up



Operations Problems

  • gfalcat does not work with castor, underlying issue fixed for gfalcopy but not gfalcat (gfal developers responsible) - Tracking



  • AtlasScratch, users from atlas still having problems accessing atlasScratch files - investigations ongoing
  • GDSS771 crashed - now in draining
  • draining is not working for atlas (does however seem to work on LHCb) - Brian has changed parameters as recommended by Shaun no improvement. manual method of draining still works - diskServerLs and stager_get (to move file to another disk server)


Planned, Scheduled and Cancelled Interventions

  • CASTOR 2.1.15


Long-term projects

  • RA has produced a python script to handle SRM db duplication issue which is causing callouts. This script has been tested and will now be put into production, as a cron job. This should be a temporary fix, so a bug report should be made to the FTS development team, via AL.
  • JJ – Glue 2 for CASTOR, used for publishing information. RA writing data getting end in python, JJ writing Glue 2 end in LISP. No schedule as yet.

Advanced Planning

Tasks

  • CASTOR 2.1.15 implementation and testing
  • Deployment of SRM 2.14

Staffing

  • All in
  • Castor on Call person next week


New Actions

  • GS ask Kashif re RAID firmware updates on d0t1 v2011 machines and if there are other batches of machines that should upgraded


Existing Actions

  • BD check if D drives have arrived for WLCG
  • BD report draining issues to CERN
  • BD to work with George P (new hire) to hand over the WAN tuning work
  • RA/AS new tool for monitoring srm db dups - the user type
  • RA to get someone to code review his SRM_DB_DUPLICATES blatting script
  • BD mice ticket - asking for a separate tape pool for d0t1 for monticarlo
  • GS is there any documentation re handling broken CIPs (raised following CIP failure at weekend)
  • GS Callout for CIP only in waking hours?
  • RA ensure quattorising atlas consistency check - Rob to talk to Andrew L
  • RA to try stopping tapeserverd mid-migration to see if it breaks - ask Tim.
  • RA (was SdW) to modify cleanlostfiles to log to syslog so we can track its use - under testing
  • GS to investigate how/if we need to declare xrootd endpoints in GOCDB BDII - progress