RAL Tier1 weekly operations castor 22/02/2010

From GridPP Wiki
Revision as of 16:50, 22 February 2010 by Matt viljoen (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Summary of Previous Week

  • Matthew:
    • Production resiliency investigations
    • CoD work
    • Facilities evaluation support
    • CASTOR database - Disaster recovery coordination
  • Shaun:
    • ..
  • Chris:
    • Testing Maximum number of job slots per protocol basis - still on going
    • Working on LHCB d2d problem
    • Investigating Atlas problem with dispatching jobs to full disk servers and LSF mbatchd memory usage. Testing fixes which might improve this situation
    • Investigating CMS slowness in scheduler
  • Cheney:
    • ..
  • Tim:
    • Investigating CMS migration backlog
    • Working on new hardware
  • Richard:
    • ..
  • Brian:
    • Investigating LSF Pending to Full disk servers in ATLAS.
    • Disk Deployment strategy and prioritisation of outstanding DD tickets.
  • Jens:
    • ..

Developments for this week

  • Matthew:
    • CASTOR database - Disaster recovery coordination
    • Forming our 2.1.8/2.1.9 strategy
    • Coordinating interventions
  • Shaun:
    • SRM development
    • Investigating LHCb disk to disk copy problems
    • COD work
  • Chris:
    • Continue testing number of job slots per protocol basis
    • Do some work with polymorphic machines
    • Concentrate on Quattor Tape Server
    • Start preparing test infrastructure for castor upgrades
    • Finish testing fixes for Atlas
  • Cheney:
    • ..
  • Tim:
    • None, I'm on leave for most of it :-)
  • Richard:
    • ..
  • Brian:
    • Testing solution to LSF Pending to Full disk servers in ATLAS.
  • Jens:
    • ..

Operations Issues

  • LSF slowdowns affecting ATLAS (Sunday, Thursday)
  • Pluto node unscheduled reboot caused 10 minute hangup (Tuesday)
  • Jobs writing into CMSWanIn are taking a disproportionate amount of time. This seems to coincide with high activity writing into CMSWanOut

Blocking issues

  • none

Planned, Scheduled and Cancelled Interventions

Entries in/planned to go to GOCDB

Description Start End Type Affected VO(s)
NFS reconfiguration on database 23/02/2010 10:00 23/02/2010 16:00 At-risk All
Clusterware reconfiguration on database 24/02/2010 10:00 24/02/2010 11:00 Downtime All

Advanced Planning

  • Gen upgrade to 2.1.8 2010Q1
  • Install/enable gridftp-internal on Gen (This year/before 2.1.8 upgrade)

Staffing

  • Castor on Call person: Shaun
  • Staff absences: Cheney (Mon), Tim (Tue-Fri)