RAL Tier1 weekly operations Fabric 20100201

From GridPP Wiki
Jump to: navigation, search

Summary of week gone

Developments

  • All:
    • Strategy meeting
  • Martin:
    • Minor procurements
  • Ian:
    • Upgraded CIp filesystem layouts
    • Upgraded batch server binaries
    • Upgraded kernels on SL5 WNs
    • Planning for handover of fabric management for Castor systems
  • James T:
    • "Mega intervention" preparation/documentation
    • Mega Intervention
    • Fisrt Viglen '08 disk servers out of testing.
    • Ongoing quattorisation of disk servers.
    • Primary on call
  • Jonathan:
    • added new NIS groups and create new pool accounts
    • checked SSH problem on lcgdb05; removed special userids oracle, lsfadmin, stage and corresponding groups oinstall, lsfadmin, st from NIS (NIS entries sometimes take precedence over local entries whatever the setting of /etc/nsswitch,conf; this can cause system problems)
    • updated RPMs on core systems and rebooted where required
    • reconfigured and restarted ntpd on lcgvo0425 (updating ntp RPM can sometimes loose the local NTP configuration)
    • Nagios configuration updates
    • reinstalled and reconfigured nagios04 after disk replacement
  • James A:
    • Networking preparations ahead of mega-intervention.
    • Added snapshotting feature to cacti weather-map.
    • Finished cabling IPMI ports in castor racks B&E.
    • Updated certificate on t1pg0373.
    • Fixed bug in check_spma for handling rotated logs.
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Decommissioning old batch systems.(R 27)
    • gdss211 running 7 days acceptance test.
    • gdss70 given back to castor. (Fixed)
    • gdss77 no display. (Found faulty memory) - Intervention
    • gdss87 given back to castor for testing.
    • nagios04 replaced drive.
    • gdss170 given back to castor.
    • Moved switches and cables from R27 with James A.
    • Working on 2008 Disk servers and working nodes.
    • Working on gdss77, 282 and 364.

Absences

  • Jonathan (1/2 day, domestic reasons)

Operational Issues and Incidents

Index Description Start End Severity Affected VO(s)
EMC arrays serving 3D/LFC/FTS databases made unstable by attempts to stabilise the Castor EMC arrays Tuesday 6/0ct am UPS issues to be fixed Catastrophic All

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB

Component Description Start End Affected VO(s) Type

Development priorities

  • All
  • Martin:
    • Minor procurements
  • Ian:
    • Upgrading and reconfiguring CIPs
    • Work with Catalin on Quattorising further grid services nodes
    • Quattor documentation
    • (Re-)Instituting steering group for Fabric automation project
    • Researching Virtualisation platform options
  • James T:
    • Ongoing quattorisation of disk servers.
    • Install first Viglen '08 disk servers.
    • Writing nagios checks
    • Apply WAN tuning
  • Jonathan:
    • implement cron job with checks to run daily test restores of home filesystem
    • complete work on installing Nagios slave server via Quattor
    • Nagios configuration updates
  • James A:
    • Two days of SINDES integration.
    • Connect uplinks to CASTOR IPMI switches.
    • Ensure IPMI on CASTOR boxes comes up.
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • lcglb01 drive replacement. (Hot swap)
    • Continuous work (memory replacement) with Cheney.
    • Continuous decommissioning old batch systems.(R 27)
    • Continuous working on 2008 disk servers and working nodes.
    • Continuous working on gdss77, 282 and 364.

Absences

  • Jonathan - as from week beginning 8th February, changing work pattern to 3 days per week (normally Tuesday, Wednesday, Thursday)

Fabric On-Call

Ian Primary on call

Advanced Warning of Requirements and Blocking issues

  • Unable to proceed with Atlas TAG migration to 64bit due to arrays being used for 3D systems while EMC kit is flakey.

Services Issues

  • Various requests for hardware.
    • Working on hardware provision for Services team testbeds.

Category:RAL_Tier1

RAL Tier1 weekly operations fabric