RAL Tier1 weekly operations castor 01/06/2015

From GridPP Wiki
Jump to: navigation, search

List of CASTOR meetings

Operations News

  • Manual rebalancing of the CMS hot spots complete
  • CIP change for SNOplus
  • Mice (Castor Gen) will be operating overnight and able to call pri oncall


  • All tape servers have now been upgraded to SL6 and are running smoothly
  • On a related note, the disk server SL6 configuration is ready but waiting for the Oracle updates to be completed.
    • We are examining options for running this in a slow-and-steady fashion with CASTOR up.
  • Testing CASTOR rebalancer on preproduction, and developing associated tools. We hope to use the rebalancer to prevent future hotspotting issues.
  • 13-generation disk servers are being prepared for deployment into CASTOR production
  • The move of the standby DB racks to R26 has been successfully completed. Some issues remained with the hardware following the move, resulting in an unplanned at-risk, but these were resolved.
  • We are examining options for the upgrade of the CASTOR DBs to Oracle version 11.2.0.4. The experiments are keen to avoid downtime early in run 2, so some careful scheduling will be necessary.


Operations Problems

  • Disk server out gdss682 (atlasStripInput) - back into production Tuesday
  • A few SRM DB dups on Atlas and CMS
  • CMS remains hot but functioning
  • Possible problem identified when creating new service class for DiRAC, castor external emailed
  • standby DBs for castor are occasionally 10 or 15 mins behind and return to sync
  • xroot redirection - works with our redirector but not others, this started during a past upgrade - shaun debugging.


  • CMS CASTOR file open time issues affecting batch farm efficiency
    • We have determined that the most serious incidence of this problem is due to a number of hot datasets that are located almost entirely on one node. Shaun has implemented a process to redistribute this data across the rest of the cmsDisk pool.
  • Retrieval errors from facilities castor - a number of similar incidents seen in last few weeks.
  • GDSS757 cmsDisk / 763 atlasStripInput - new motherboards, fabric acceptance test complete, ready for deployment.
  • A higher number of checksum errors (Alice) - found to be due to VO actions. This is being cleared up.
  • Atlas are putting files (sonar.test files) into un-routable paths - this looks like an issue with the space token used. Brian working on this.
    • This is related to a problem with latest gfal libraries - not a new problem but Atlas are starting to exercise functionality and identifying these issues
  • castor functional test on lcgccvm02 causing problems - Gareth reviewing.


Blocking Issues

  • grid ftp bug in SL6 - stops any globus copy if a client is using a particular library. This is a show stopper for SL6 on disk server.


Planned, Scheduled and Cancelled Interventions

Advanced Planning

Tasks

  • Create new small VOs for DiRAC (and LIGO?) - for meeting w/c 8th June
  • Disk deployments
  • Bruno working on SL6 disk servers
  • Switch from admin machines: lcgccvm02 to lcgcadm05
  • Correct partitioning alignment issue (3rd CASTOR partition) on new castor disk servers
  • Intervention to upgrade CASTOR DBs to Oracle 11.2.0.4

Interventions

Staffing

  • Castor on Call person next week
    • Rob
  • Staff absence/out of the office:
    • Rob, Bruno, Brian at Hepsysman Monday - Wednesday
    • Chris – A/L Monday


Actions

  • Rob/Alastair to clarify what we are doing with 'broken' disk servers
  • Bruno - castor template documentation/training
  • Someone - mice, what access protocol do they use?
  • Rob rebalancing change control
  • Gareth/Matt - suggestion to change monitoring to call out in daytime only
  • Gareth to ensure that there is a ping test etc to the atlas building
  • Chris raise a ticket in castor queue to track xroot redirection bug
  • Rob/Jens to add SNOPlus to CIP, plus possibly Dirac.
  • Rob/Shaun to try and reproduce SRM DB Dups issue with FTS transfers
  • Bruno to document processes to control services previously controlled by puppet
  • Gareth to arrange meeting castor/fab/production to discuss the decommissioning procedures
  • Gareth to investigate providing checks for /etc/noquatto on production nodes & checks for fetch-crl
  • Rob/Bruno: Rob to send the changes made to xroot timeouts to Bruno for implementation in Quattor.
  • Rob to remove Facilities disk servers from cedaRetrieve to go back to Fabric for acceptance testing.

Completed Actions

  • Brian - On Wed, tell Tim if he can start repacking mctape (Done)
  • Shaun to send plots to Matt to support above action
  • Rob to book meeting to discuss possible workarounds/investigations for CMS issue - xroot timeout / server(read-ahead) or Network tuning / CASTOR bug / deploying more disk servers
  • Rob - talk with Shaun to see if its possible to reject anything that does not have a spacetoken (answer: this is not a good idea)
  • Brian - to discuss unroutable files / spacetokens at at DDM meeting on Tuesday (Done)
  • Rob to pick up DB cleanup change control (Done)
  • Chris/Rob to arrange a meeting to discuss CMS performance/xroot issues (is performance appropriate, if not plan to resolve) - inc. Shaun, Rob, Brian, Gareth (Potential fix implemented)
  • Rob and Shaun to continue fixing CMS