RAL Tier1 weekly operations Fabric 20100906

From GridPP Wiki
Jump to: navigation, search

Developments

  • All:
  • Martin:
  • Ian:
  • Tim:
    • Investiget DMF backup issues
    • Monitor repack progress
    • Monthly stats
    • Investigate enlarging ADS virtual tape size
    • Tape drive issues
  • Jonathan:
  • James A:
  • James T (last fortnight)
    • GridPP 25
    • Returned gdss81 to service.
    • Resolved tuning errors in Quattor disk server configuration.
    • Liaised with Streamline to arrange swap out of all RAID controllers in the Streamline 2009 procurement.
    • Fixed mis-configured chkconfig for (3ware) disk servers in Quattor.
    • Work on new central loggers, required prior to Atlas power shutdown and to cope with increased log activity after CASTOR 2.1.9 upgrade.
  • Cheney
    • untangle quattor gordian knot
    • try to fix multipath error msg on preprod db


  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Decommissioning old batch systems.(R 27)
    • gdss110 replaced 4x1gb memory.
    • Replaced 1 drive in Streamline 2009 (Test) disk servers.
    • gdss468 replaced drive in port 7 and given back to Castor. (Fixed)
    • gdss381 given back to Castor.(Crashed with single faulty drive)
    • gdss280 fixed hardware. (Intervention)
    • gdss81 file system gone read only.
    • gdss470 and gdss475 crashed with heavy load. (Fixed)
    • Hardware failure stats/graphs.
    • Preparing Viglen 2006 disk servers with new raid configuration for Castor Preprod.
    • Streamline/areca disk servers crashed due to single faulty drive. (ongoing)


Absences

  • Jonathan on partial retirement (not in on Monday and Friday)
  • Bank Holiday Monday / Privilege day Tuesday

Operational Issues and Incidents

Index Description Start End Severity Affected VO(s)

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB

Component Description Start End Affected VO(s) Type

Development priorities

  • All
  • Martin:
  • Ian:
  • Tim:
    • Fix DMF backup problems
    • Test ADS virtal tape size increase
    • Speed up repack
    • Facilities castor progress
    • Usage numbers for non-lhc systems
  • Cheney
    • Continue grappling with quattor gordian knot
    • Prep buxton for swap-in
  • Jonathan:
  • James T:
    • Cover disk faults in Kash's absence.
    • Test area risk assessment.
    • Test of CASTOR 2.1.9 upgrade on disk servers.
    • Plan move of disk servers to UPS power.
    • Preparation of new machines needed before Atlas power shutdown.
    • Liaise with Streamline regarding testing of 2009 kit.
  • James A:
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • IPMI web interface configuration with James Thorne.
    • gdss470 and gdss475 chase vendor about logs/fault.
    • Update daily status of Streamline 2009 disk servers testing.
    • Continuous decommissioning old batch systems.(R 27)

Absences

  • Jonathan on partial retirement (not in on Monday and Friday)

Fabric On-Call

  • Kashif Hafeez

Advanced Warning of Requirements and Blocking issues

Services Issues


RAL Tier1 weekly operations fabric

Category:RAL_Tier1