RAL Tier1 weekly operations castor 21/03/2016

From GridPP Wiki
Revision as of 10:31, 18 March 2016 by Rob Appleyard 7f7797b74a (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Operations News

  • 4 servers into read only in atlasStripInput - as part of plan to decommission 10% after servers out of warrantee (AD & BD), drain next week
  • CERN suggested 2.1.16 deployed to tape servers (Steve Murray)

Operations Problems

  • 2.1.15 Problems re config required for production to solve slow file open times - Andrey reports that CERN use 100GB of memory for DB servers in castor to run 2.1.15 (vs our 32GB), Oracle are not providing adequate support at the moment. 2.1.15 deployment will not be scheduled at the moment.
  • Could not drain gdss702 (castor 2.1.15) in Preprod (all files failed according to draindiskserver -q) - does draining work in 2.1.15?
  • LHCb job failures ticket still open
  • CMS AAA issues
  • OPN links. BD investigating what data flow is filling OPN and superjanet at the same time.
  • LHCb job failures - GGUS ticket open
  • CV '11 gen RAID card controller firmware update required - WD digital drives failing at a high rate

Planned, Scheduled and Cancelled Interventions

  • 2.1.15 update to nameserver will not go ahead due to performance issues on the DB.


Long-term projects

  • RA has produced a python script to handle SRM db duplication issue which is causing callouts. This script has been tested and will now be put into production, as a cron job. This should be a temporary fix, so a bug report should be made to the FTS development team, via AL.
  • JJ – Glue 2 for CASTOR, used for publishing information. RA writing data getting end in python, JJ writing Glue 2 end in LISP. No schedule as yet.
  • Facilities drive re-allocation. ACTION: RA
  • SRM 2.1.14 testing with SdW on VCERT

Advanced Planning

Tasks

  • CASTOR 2.1.15 implementation and testing


Interventions

  • Juno reboot for patching - when?

Staffing

  • Castor on Call person next week
    • RA week of 21/03/16 onwards & discuss with RA on his return
  • Staff absence/out of the office:
    • RA - out monday
    • SdW - out


New Actions

Existing Actions

  • BD check SNO+ transfers fts are migrating to tape if required
  • BD mice ticket
  • CC is there any documentation re handling broken CIPs (raised following CIP failure at weekend)
  • GS Callout for CIP only in waking hours?
  • RA CV11 firmware updates
  • RA follow up with Fabric re: CV '11 gen RAID card controller firmware update
  • BD to clarify if separating the DiRAC data is a necessity
  • BD ensure quattorising atlas consistency check
  • RA to deploy a 14 generation into preprod
  • BD re. WAN tuning proposal - discuss with GS, does it need a change control?
  • RA to try stopping tapeserverd mid-migration to see if it breaks.
  • RA (was SdW) to modify cleanlostfiles to log to syslog so we can track its use - under testing
  • GS to investigate how/if we need to declare xrootd endpoints in GOCDB BDII - progress