RAL Tier1 weekly operations Fabric 20100809

From GridPP Wiki
Jump to: navigation, search

Developments

  • All:
  • Martin:
  • Ian:
    • Worked on iSCSI & Virtualisation
    • Out two days for op
  • Tim:
    • On Leave
  • Jonathan:
    • fixed minor atlasbackup problems
    • wrote more archive tapes for old NFS filesystems
    • updated SVN source for RPM tier1-sudo-config
    • prepared spreadsheet of systems still powered on in R27/A5 Lower
    • 1 Nagios configuration update
    • entered job plan into SSC
    • fixed bug in oncall REXX program that was stopping callouts to OPS pager
  • James A:
    • Took over responsibility for minuting e-MROG meetings.
    • Started testing with new Quattor server.
  • James T
    • Catch up
    • Added job plan to SSC
    • Acceptance testing Streamline 2009 kit
    • Work on gdss417
    • TDG talk
    • Disk server IPMI rollout plan
  • Cheney
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Decommissioning old batch systems.(R 27)
    • gdss419 given back to Castor team.
    • Replaced 3 drives in Streamline 2009 (Test) disk servers.
    • bfcar01 replaced drive. (Transtec)
    • gdss475 given back to Castor team.
    • lcg1212 re-installed and batch enabled.
    • lcgfts02 replaced drive (sdb).
    • Hardware failure stats/graphs.
    • gdss452 given back to Castor team.
    • Preparing Viglen 2006 disk servers with new raid configuration for Castor Preprod.
    • Streamline/areca disk servers crashed due to single faulty drive. (ongoing)


Absences

  • Jonathan on partial retirement (not in on Monday and Friday)
  • Kash sick leave Thursday

Operational Issues and Incidents

Index Description Start End Severity Affected VO(s)

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB

Component Description Start End Affected VO(s) Type

Development priorities

  • All
  • Martin:
  • Ian:
    • Virtualisation testbed
    • Castor facilities basics in Quattor
    • Planning for Atlas power outage
  • Tim:
    • Keep eye on repack
    • SSC stuff
    • DMF small files
  • Cheney
  • Jonathan:
  • James T:
    • Disk server IPMI roll out
    • Streamline 2009 acceptance testing
    • Disk server work in Kash's absence
    • Plan to migrate central loggers to disk server hardware
  • James A:
    • Add squid metrics to CVMFS server.
    • Continue testing new Quattor server.
    • Develop migration plan for new Quattor server.
    • Migrate direct thermal event paging to Tiju's new paging system.
    • Move OPN test off dcache-head.
    • Complete network cabling for CASTOR facilities instance.
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • gdss417 crashed again. (Intervention)
    • Update daily status of Streamline 2009 disk servers testing.
    • Continuous decommissioning old batch systems.(R 27)

Absences

  • Jonathan on partial retirement (not in on Monday and Friday)
  • Kash sick leave Monday
  • Advanced Warning: James T on A/L Monday 16 to Tuesday 17 August.
  • Tim on leave 16-20th

Fabric On-Call

  • Ian Fabric on-call Monday-Sunday

Advanced Warning of Requirements and Blocking issues

Services Issues


RAL Tier1 weekly operations fabric

Category:RAL_Tier1