RAL Tier1 weekly operations castor 13/07/2015

From GridPP Wiki
Jump to: navigation, search

List of CASTOR meetings

Operations News

  • Proposed CASTOR face to face W/C Oct 5th or 12th
  • Update on failure of GDSS763 (atlasStrip) - CPUs replaced / drives have been returned to original hardware and GDSS782 (the donor headnode) also returned to original state. Both servers have completed passed acceptance test and are ready to be returned to production.

--old--

  • Juno (CASTOR Facilities) Oracle update to 11.2.0.4
  • xroot update for Alice
  • DiRAC data being written
  • Brain investigating rebalancing changes
  • Change to improve file open times on CASTOR (central db, subrequest todo procedure) - LHCb open times almost halved, additional tuning next week
  • Checksum issues – Shaun’s tool run on all VOs. Currently investigating further changes following discussion with CERN.
  • Draining on Atlas
  • Facilities CASTOR - change to time to write to tape from 30 mins to 5 mins now
  • CASTOR rebalancing underway
  • SL6 - tape x3 in Facilities completed, 9 to go
  • New LHCb tape pools created
  • Mice (Castor Gen) will be operating overnight and able to call pri oncall
  • All tape servers have now been upgraded to SL6 and are running smoothly
  • On a related note, the disk server SL6 configuration is ready but waiting for the Oracle updates to be completed.
    • We are examining options for running this in a slow-and-steady fashion with CASTOR up.
  • 13-generation disk servers are being prepared for deployment into CASTOR production
  • We are examining options for the upgrade of the CASTOR DBs to Oracle version 11.2.0.4. The experiments are keen to avoid downtime early in run 2, so some careful scheduling will be necessary.


Operations Problems

  • GDSS657 (LHCb tape backed) failed. 31 canbemigrs, 6 of which had failed checksums rest were copyed back into castor.

--old--

  • Lots of checksum errors this week – LHCb and some others, under investigation at this point (need to rule out that its not connected to rebalancing reconfiguration)
  • there are some problems with the information provider for DiRAC - reporting disk only etc
  • A flood of SRM DB dups on Atlas - all have been cleared Fri 26th
  • Data loss on castor facilities Diamond which appears to have been caused by failure of fdsdss30, files not written to tape possibly because of a bug in CASTOR (Matt doing post mortem). Files are still present on Diamond's network and will be rewritten.
  • Fermilab 50MB/s some trans and 0.1MB/s others ... might be an issue at fermilab on particular server, Brian investigating
  • The gridmap file on the the webdav host lcgcadm04.gridpp.rl.ac.uk is not auto-updating - needed for lhcb and vo.dirac.ac.uk
  • Atlas running out of space (read-only being included in free space accounting not helping)
  • CLF VO issues - configuration error corrected
  • GDSS711 CMSdisk – issue around statup of xroot needs to be considered further
  • AtlasStripInput is filling up - need to deploy new servers
  • CMS fileopen issue - have allowed unscheduled xroot reads / need to be documented for oncall
  • Atlas files without NS checksums causing file copies to fail (may affect other VOs)
  • Possible problem identified when creating new service class for DiRAC, castor external emailed
  • standby DBs for castor are occasionally 10 or 15 mins behind and return to sync
  • xroot redirection - works with our redirector but not others, this started during a past upgrade - shaun debugging.
    • We have determined that the most serious incidence of this problem is due to a number of hot datasets that are located almost entirely on one node. Shaun has implemented a process to redistribute this data across the rest of the cmsDisk pool.
  • Retrieval errors from facilities castor - a number of similar incidents seen in last few weeks.
  • GDSS757 cmsDisk / 763 atlasStripInput - new motherboards, fabric acceptance test complete, ready for deployment.
  • A higher number of checksum errors (Alice) - found to be due to VO actions. This is being cleared up.
  • Atlas are putting files (sonar.test files) into un-routable paths - this looks like an issue with the space token used. Brian working on this.
    • This is related to a problem with latest gfal libraries - not a new problem but Atlas are starting to exercise functionality and identifying these issues
  • castor functional test on lcgccvm02 causing problems - Gareth reviewing.


Blocking Issues

  • grid ftp bug in SL6 - stops any globus copy if a client is using a particular library. This is a show stopper for SL6 on disk server.


Planned, Scheduled and Cancelled Interventions

  • Stress test SRM poss deploy week after (Shaun)


Advanced Planning

Tasks

  • Proposed CASTOR face to face W/C Oct 5th or 12th

--old--

  • Disk deployments - SL6 disk deployments, dev mostly complete, deployment planning needed
  • srm 2.1.14 - functional testing positive / stress testing required
  • Switch from admin machines: lcgccvm02 to lcgcadm05
  • Correct partitioning alignment issue (3rd CASTOR partition) on new castor disk servers
  • Intervention to upgrade CASTOR DBs to Oracle 11.2.0.4

Interventions


Staffing

  • Castor on Call person next week
    • Rob
  • Staff absence/out of the office:
    • Rob out Monday afternoon
    • Chris out Wed morning


Actions

  • Shaun to look into GC improvements - notify if file in inconsistent state
  • Shaun to replace canbemigr test - to test for file that have not been migrated to tape (warn 8h / alert 24h)
  • Rob/Jens to look at information provider re DiRAC (reporting disk only etc)
  • All to book meeting with Rob re draining / disk deployment / decommissioning ...
  • Rob to look into procedural issues with CMS disk server interventions
  • Bruno to document processes to control services previously controlled by puppet
  • Gareth to arrange meeting castor/fab/production to discuss the decommissioning procedures
  • Gareth to investigate providing checks for /etc/noquatto on production nodes & checks for fetch-crl - ONGOING
  • Rob to remove Facilities disk servers from cedaRetrieve to go back to Fabric for acceptance testing.
  • Rob to get jobs thought to cause CMS pileup
  • Bruno to put SL6 on preprod disk
  • Bruno / Rob to write change control doc for SL6 disk
  • Shaun testing/working gfalcopy rpms
  • Someone - mice, what access protocol do they use?

Completed actions

  • Rob/Gareth to write some new docs to cover oncall procedures for CMS with the introduction of unscheduled xroot reads
  • Rob/Alastair to clarify what we are doing with 'broken' disk servers
  • Gareth to ensure that there is a ping test etc to the atlas building