RAL Tier1 weekly operations Fabric 20100927

From GridPP Wiki
Jump to: navigation, search

Developments

  • All:
  • Martin:
    • Final itterations of Disk orders
    • CPU ITT
    • Prep for atlas powerdown w/e 1 Oct.
    • Intervention on Array 2 (Oracle backups array).
  • Ian:
    • Some work on virtualisation evaluation
    • cvmfs evaluation
    • Organising eScience StratusLab talk for October
    • Moving servers in preparation for Atlas power down
  • Tim:
  • Jonathan:
  • James A:
    • Lots of work benchmarking nodes for tender.
    • Re-cabling Service 4 rack in UPS room.
    • Preparing and applying security updates across the farm.
  • James T
    • LHCb CASTOR 2.1.9 upgrade preparation
    • Atlas power off preparation:
      • New loggers built
      • Provided replacement for gdss51 (dteamTest SAM test box)
      • Provided new box to replace csfnfs58 (non-LHC VO software server) and performed initial rsync of data.
    • A/L Wednesday PM


  • Cheney
    • Quatt the castor facilities
    • patching


  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Decommissioning old batch systems.(R 27)
    • gdss110 fsprobe errors. (Acceptance testing)
    • gdss380 crashed during acceptance testing. (Crashed with single faulty drive)
    • gdss417 started acceptance testing. (Crashed with single faulty drive)
    • lcgfts01 crashed because of second drive failure. (sdb)
    • gdss280 acceptance testing. (Intervention)
    • lcglb01 faulty drives reported to Streamline.
    • gdss490 taken by Streamline for fix.
    • Hardware failure stats/graphs.
    • Moved PAT, Wyett, Morgan, Virgil and xrootd systems (602, 603) from Atlas to R89.
    • Preparing Viglen 2006 disk servers with new raid configuration for Castor Preprod.
    • Streamline/areca disk servers crashed due to single faulty drive. (ongoing)


Absences

  • Jonathan on partial retirement (not in on Monday and Friday)
  • James T working at home Monday from 11.00 due to leaking mains water supply.
  • Cheney dentist friday

Operational Issues and Incidents

Index Description Start End Severity Affected VO(s)

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB

Component Description Start End Affected VO(s) Type

Development priorities

  • All
  • Martin:
    • Finalising CPU ITT
    • Common technology discussions
    • Organise HEPiX trip
    • Peparation for Atlas weekend powerdown
  • Ian:
    • Continue work on Virtualisation platform
    • Preparation for Quattor Workshop
    • Work on prospective Ganga bid


  • Tim:
  • Cheney
    • Quatt the facilities
    • Powerdown atlas
    • Prep kiki
  • Jonathan:
  • James T:
    • LHCb CASTOR 2.1.9 upgrade
    • Atlas power off preparation
    • Acceptance tests on SL09 machines
    • Prepare for A/L
  • James A:
    • Preparing replacement cacti box.
    • Cleaning up tail end of security updates on farm.
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Update daily status of Streamline 2009 disk servers testing.
    • Continuous decommissioning old batch systems.(R 27)

Absences

  • Jonathan on partial retirement (not in on Monday and Friday)
  • Advanced warning James T on A/L 2 - 13 October

Fabric On-Call

  • James T

Advanced Warning of Requirements and Blocking issues

Services Issues


RAL Tier1 weekly operations fabric

Category:RAL_Tier1