RAL Tier1 weekly operations Fabric 20110117

From GridPP Wiki
Revision as of 14:36, 24 January 2011 by James adams (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Editing RAL Tier1 weekly operations Fabric 20110110

Developments

  • All:
  • Martin:
  • Ian:
    • Work on virtualisation
    • Preparatory work on cluster groups in Quattor
    • Setting up acceptance testing on new db nodes
    • Prep for CERN visit
  • Tim:
  • James A:
    • Working through issues with ClusterVision WNs with Dell.
    • Preparation for next Atlas power off.
    • Benchmarking.
    • Annual Leave on Wednesday.


  • James T
    • Prep for ATLAS SL5 x86_64 upgrade
    • RAID controller summary to Sam
    • AFS L&D
  • Cheney
    • amanda performance testing
    • clear down ancient database backups
    • design model for disk server analysis
    • investigate DMF DR
    • investigate DMF rsync
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Decommissioning old batch systems.(R 27)
    • gdss380 received from Streamline and moved into rack.
    • gdss606 fixed for testing.
    • gdss496 Scsi errors. (Intervention)
    • gdss305 and gdss327 given back to Castor team.
    • Fabric Hardware failure metrics.
    • Streamline/areca disk servers crashed due to single faulty drive. (ongoing)
    • gdss576 and gdss577 not in testing. (Informed James T)
    • gdss337 Kernel panic (Faulty memory)
    • gdss283 crashed with File system problem.(Intervention)
    • gdss68 ready for decommission.
    • SL 2010 and Viglen 2010 disk servers in testing.
    • SL 2009 Auto rebuild on hotspare fails.


Operational Issues and Incidents

Index Description Start End Severity Affected VO(s)

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB

Component Description Start End Affected VO(s) Type

Development priorities

  • All
  • Martin:
  • Ian:
    • Visiting CERN
    • Hepix virtualisation working group meeting
    • Meeting with cvmfs developers


  • Tim:
  • Cheney
    • DMF DR
    • DMF rsync
    • Prep for TDG talk


  • James T:
    • ATLAS SL5 x86_64 upgrade
    • First aid course Wednesday and Thursday
    • iSCSI and AFS L&D
  • James A:
    • Working through issues with ClusterVision WNs with Dell.
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • SL 2009 Auto rebuild on hotspare fails.
    • Hardware failure metrics continue.
    • SL08 testing.
    • Continuous decommissioning old batch systems.(R 27)

Absences

  • Ian out Tuesday-Monday - A/L Friday 21st and Monday 24th
  • JRHA out Wednesday (Annual Leave)

Fabric On-Call

  • Monday - Sunday - Kashif

Advanced Warning of Requirements and Blocking issues

Services Issues


RAL Tier1 weekly operations fabric

Category:RAL_Tier1