RAL Tier1 weekly operations Fabric 20100614

From GridPP Wiki
Jump to: navigation, search

Developments

  • All:
  • Martin:
    • Completed Disk ITT document
    • SSC/Oracle training
    • HEPSysMan
      • Site report for Tier-1
      • Disk Failure stats and network configuration
  • Ian:
    • CRISTAL 2
    • Hepsysman & Quattor talk
    • Planning for Facilities Castor instance
  • Tim:
    • T10K migration work
    • Modify check_tape_pools scripts
    • Push 9940 migration
    • Monitor air quality situation
  • Jonathan:
    • updated RPMs on core servers and Nagios slave servers
    • worked on AFS presentation for Fabric team
    • Other Peoples’ Business: SSTD Groundstation Event
    • issued new versions of RPMs tier1-nagios-plugins, and tier1-nrpe-config
    • updated SVN source for RPM tier1-sudo-config (changes made directly on affected servers)
    • 2 Nagios configuration updates
    • HEPSYSMAN
  • James A:
  • James T
    • On leave all week
  • cheney
    • tried and failed to fix sls availability stats
    • fix dmf spool out of space
    • fix samba gone haywire
    • investigate hinode website security alerts
    • added some servers into nagios
    • upgrade tsbn spreadsheet
    • patchign of xen dev servers
    • bring up dcache xen for testing
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Decommissioning old batch systems.(R 27)
    • Streamline 2009 disk servers Testing.
    • gdss67 filesystem problem. James A is investigating it. (Intervention)
    • gdss469, 473, 474 and 476 replaced raid controller cards with Matt V.
    • gdss474 probably faulty backplane arranging Engineer.
    • gdss207 tried installing it twice but didn't work.
    • gdss390 need to replace fan and memory.
    • gdss420 given back to castor.
    • Streamline/areca disk servers crashed due to single faulty drive. (ongoing)

Absences

  • James T away all week
  • Jonathan on partial retirement (not in on Monday and Friday)

Operational Issues and Incidents

Index Description Start End Severity Affected VO(s)

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB

Component Description Start End Affected VO(s) Type

Development priorities

  • All
  • Martin:
    • Prepare CPU ITT document draft
    • Move Repack servers
    • Staff review
    • Spend plans
  • Ian:
    • Virtualisation platform planning
    • Facilities Castor planning for Quattor
    • Away day planning
    • Preparation for CERN visit next week
  • Tim:
    • DMF single copy for BADC backup
    • Plan Facilities castor install
  • Cheney
    • xen dcache testing
    • tsbn upgrade to finish off
  • Jonathan:
  • James T:
    • Catchup after leave
    • Increase size of /var/lib/ganglia (RAM disk) on ganglia01
    • SL09 testing problems
    • Think about quattorising /etc/services changes for SL5 disk servers.
  • James A:
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Continuous decommissioning old batch systems.(R 27)

Absences

  • Jonathan on partial retirement (not in on Monday and Friday)
  • Jonathan on leave 15th (Tuesday) and 17th (Thursday) June
  • Martin A/L Friday PM.
  • Kashif A/L Tuesday and Thursday.

Fabric On-Call

Ian Primary oncall Monday-Sunday

Advanced Warning of Requirements and Blocking issues

Services Issues


RAL Tier1 weekly operations fabric

Category:RAL_Tier1