Difference between revisions of "RAL Tier1 weekly operations Fabric 20100208"

From GridPP Wiki
Jump to: navigation, search
 
(No difference)

Latest revision as of 14:50, 8 February 2010

Summary of week gone

Developments

  • All:
  • Martin:
    • Minor procurements
  • Ian:
    • Carried out first CIP upgrade and prepared second.
    • Tested WN minor upgrade to SL5.4
    • Reviewed and updated some local Quattor docs
    • Made first contribution to SCDB Quattor docs at LAL
    • Reinstituted fabric automation steering group
  • James T:
    • Viglen '08 disk servers
      • 5 installed with production config
      • 22 finished testing over the weekend.
    • Quattorisation of disk servers
      • OPN routing
      • SSH lockdown
      • rc.local tuning
      • Script to import disk servers from hardware database
  • Jonathan:
    • Administrator on Duty (Wednesday)
    • restarted password cracker on enigma; updated iptables configuration to stop logging dropped packets
    • sorted out atlasbackup problems on various nodes
    • fixed ntpd process on lcgfts0423
    • NIS configuration changes
    • installed local NRPE and Ganglia configurations on ccse01
    • fixed access to Bfactory disk servers for userid bbdatsrv
    • Nagios configuration of updates
    • new versions of RPMs tier1-nagios-plugins, tier1-nrpe-config and tier1-sudo-config
    • worked on Quattor configuration of Nagios slave servers; reinstalled new slave server (found Quattor bug)
  • James A:
    • Prepared and shipped equipment for integration at supplier's premises in preparation for delivery.
    • QUATTORising various pieces of hand-configured functionality on quattor01 in order to be able to integrate SINDES.
    • Network cabling for CASTOR team.
    • Kickstart and Quattor trouble-shooting for various people.
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Decommissioning old batch systems.(R 27)
    • gdss211 completed 7 days acceptance test.
    • gdss150 and gdss226 given back to castor. (Fixed)
    • gdss77 no display. (Found faulty memory) - Intervention
    • nc21 (lcg0280) found faulty memory. - Intervention
    • lcglb01 replaced drive with hotswap.
    • lcgvo-alice offline sectors started long smart test. (offline mode)
    • Moved streamline switches and other parts to (R56)logistics.
    • Replaced 9 faulty drives in Viglen 2008 disk servers with Viglen engineer.
    • Working on 2008 Disk servers and working nodes.
    • Working on gdss77, 282 and 364.

Absences

Operational Issues and Incidents

Index Description Start End Severity Affected VO(s)
EMC arrays serving 3D/LFC/FTS databases made unstable by attempts to stabilise the Castor EMC arrays Tuesday 6/0ct am UPS issues to be fixed Catastrophic All

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB

Component Description Start End Affected VO(s) Type

Development priorities

  • All
  • Martin:
    • Minor procurements
  • Ian:
    • Further research into virtualization platforms
    • Plan for rolling upgrade of WNs to SL5.4
    • Further work on integration of Castor fabric management into Fabric team
  • James T:
    • Quattorisation of disk servers.
      • Get CASTOR info directly from Overwatch
      • Testing
    • Re-install latest tranche (22) of Viglen '08 disk servers.
    • Writing nagios checks
  • Jonathan:
    • Administrator on Duty (Wednesday)
    • implement cron job with checks to run daily test restores of home filesystem
    • complete work on installing Nagios slave server via Quattor
    • Nagios configuration updates
  • James A:
    • Continue with SINDES integration where possible.
    • Spend some time developing the Hardware Database with Kash.
    • Prepare machine room to accept deliveries of new hardware.
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • lcgce07 drive replacement. (Hot swap)
    • gdss77 and gdss86 replace 4x 1gb memory. (recently bought by Martin)
    • Continuous work (memory replacement) with Cheney.
    • Viglen 2006 (8) disk servers for decommissioning/prepod. (Label and configure)
    • Continuous decommissioning old batch systems.(R 27)
    • Continuous working on 2008 disk servers and working nodes.
    • Continuous working on gdss77, 130, 282 and 364.

Absences

  • Jonathan - as from this week changing work pattern to 3 days per week (normally Tuesday, Wednesday, Thursday)

Fabric On-Call

Ian Mon-Sun

Advanced Warning of Requirements and Blocking issues

  • Unable to proceed with Atlas TAG migration to 64bit due to arrays being used for 3D systems while EMC kit is flakey.

Services Issues

  • Various requests for hardware.
    • Working on hardware provision for Services team testbeds.

Category:RAL_Tier1

RAL Tier1 weekly operations fabric