RAL T1 weekly ops Fabric 20110725

From GridPP Wiki
Jump to: navigation, search

Developments

  • All:
  • Tim:
    • T10KC drive issues
    • T10KC tape server work
    • Castor rack reconsiliation
    • Work experiance student
    • SGI Test kit install
    • Atlas tape pool consolidation
    • CMS tape pool consolidation
  • James A:
    • Added support for Nortel/Avaya switches to Observium.
    • Gave a tour to work experience students from ISIS.
    • Added checks to disk server deployment scripts to avoid multiple template snafus.
    • Changed IPs of Streamline 2008 WNs.
  • Cheney
    • Backups - stability and performance improvements
    • DMF - testing of integrity and performance
    • look thru dns for servers to backup
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Decommissioning old disk servers/batch systems.
    • Appointment with Physio.
    • gdss335 went down. (Kernel panic)
    • SL08 15 disk servers (with no errors) for deployment.
    • EMC PSU failure. (Report)
    • gdss208 re-create raid array.
    • gdss193 back into production.
    • Replaced 10 drives in Viglen 2009 disk servers. (SMART errors)
    • Configured network ports and re-installed SL10 disk servers to fix network.
    • Add SL10 disk servers in Adaptec Storage Manager for monitoring.
    • gdss96 Kernel panic. (Started memory test)
  • Martin:
  • Ian:
    • On Leave


Operational Issues and Incidents

Index Description Start End Severity Affected VO(s)

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB

Component Description Start End Affected VO(s) Type

Development priorities

  • All
  • Tim:
    • Install T10KC tape servers
    • Further T10KC tape drive problem investigation
    • SGI test kit testing
    • ADS shutdown work
    • Amanda install time-line planning
  • Cheney
    • Backups
    • DMF testing
  • James A:
    • Change IPs of Viglen 2008 WNs.
    • Feed back patches to Observium.
    • Request and installation of certificates on all 2010 Storage Nodes.
    • Fixing and repackaging acceptance tests for distribution.
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Hardware failure review and metrics continue.
    • Continuous decommissioning old disk servers/batch systems.(R 27)
    • Continue Labelling racks and systems in UPS and HPD room


  • Martin:
  • Ian:
    • Helping Aslan (Nuffield Bursary student) get started
    • Install additional hypervisor
    • Start on eval of additional Equalogic array

Absences

  • Ian on leave Weds PM Thursday and Friday
  • Tim on leave Thursday and Friday

Fabric On-Call

  • Ian Monday-Tuesday; Kash Wednesday - Sunday

Advanced Warning of Requirements and Blocking issues

Services Issues


RAL Tier1 weekly operations fabric

Category:RAL_Tier1