RAL Tier1 weekly operations Fabric 20100125

From GridPP Wiki
Jump to: navigation, search

Summary of week gone

Developments

  • All:
    • Strategy away day
    • Metrics discussion
  • Martin:
    • Minor procurements
  • Ian:
    • Updated Quattor server
    • Prepared for CIP upgrade
    • Plans for batch01 updates
    • Plans and checks for WN kernel updates
  • James T:
    • "Mega intervention" preparation
    • CRISTAL2 support group.
    • Progress meeting with Viglen.
    • Updates to some systems.
    • Ongoing quattorisation of disk servers.
  • Jonathan:
    • removed old (redundant) directories from /home/tier1
    • wrote and test ran script that performs automatic test restores of home filesystem
    • updated RPMs on AFS servers and rebooted for new kernel
    • added production disk servers gdss380-417 to NIS
    • created publically-accessible directory for local Nagios plugins and updated reference at www.gridpp.ac.uk to use this directory
    • Nagios configuration updates
    • updated NRPE configuration to add entry for check_log plugin
    • built alert2rss RPM for SL5 and installed on nagger (with dependencies) to allow notification by RSS to work
  • James A:
    • Solved rrdtool dependency problems on thor, causing rrdtool to jump from 1.2.x to 1.3.x
    • Fixed display issues in ARTMEMIS caused by rrdtool version jump.
    • Created diagnostics utility boot menu for Quattorised machines.
    • Ported SINDES-CA to SL5 and co-hosted it on the Quattor server, rebuilt RPMs as RAL-specific version and issued first SINDES certificate!
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Decommissioning old batch systems.(R 27)
    • gdss380 given back to castor.
    • gdss148 added additional raid card. (Upgraded bios and firmware)
    • gdss66 replaced raid card memory and given back to castor.
    • Memory replacement with Castor team. (Cheney)
    • afs2 replaced drive.
    • lcgfts01 replaced two drives.
    • gdss160 and 143 given back to castor.
    • Working on 2008 Disk servers and working nodes.
    • Working on gdss70, 282 and 364.

Absences

  • Jonathan 1 day Sick Leave

Operational Issues and Incidents

Index Description Start End Severity Affected VO(s)
EMC arrays serving 3D/LFC/FTS databases made unstable by attempts to stabilise the Castor EMC arrays Tuesday 6/0ct am UPS issues to be fixed Catastrophic All

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB

Component Description Start End Affected VO(s) Type

Development priorities

  • All
    • Tier1 Strategy Day
    • Batch and Storage interventions
  • Martin:
    • Minor procurements
    • Network interventions
  • Ian:
    • CIP upgrades
    • lcgbatch01 interventions
    • WN kernel upgrades
    • other kernel upgrades
    • hepix virtualisation working group
  • James T:
    • Document procedures for "mega intervention".
    • Mega intervention
    • Ongoing quattorisation of disk servers.
    • Work on nagios checks
  • Jonathan:
    • implement cron job with checks to run daily test restores of home filesystem
    • implement change to restrict SSH login on disk servers
    • complete work on installing Nagios slave server via Quattor
    • update RPMs on various servers
    • Nagios configuration updates
  • James A:
    • Networking moves ahead of mega-intervention.
    • Finish last of IPMI cabling in CASTOR racks.
    • Finish SINDES-CA tweaks in Quattor.
    • Bring up SINDES-secured vhost for Quattor profiles.
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Continuous work (memory replacement) with Cheney.
    • gdss211 run 7 days acceptance test.
    • Continuous decommissioning old batch systems.(R 27)
    • Continuous working on 2008 disk servers and working nodes.
    • Continuous working on gdss70, 282 and 364.

Absences

  • Martin A/L Thursday, Friday pm
  • Kashif (Thursday - A/L)

Fabric On-Call

JamesT Primary on call Monday Sunday


Advanced Warning of Requirements and Blocking issues

  • Unable to proceed with Atlas TAG migration to 64bit due to arrays being used for 3D systems while EMC kit is flakey.

Services Issues

  • Various requests for hardware.
    • Working on hardware provision for Services team testbeds.

Category:RAL_Tier1

RAL Tier1 weekly operations fabric