RAL Tier1 weekly operations Fabric 20090928

From GridPP Wiki
Jump to: navigation, search

Summary of week gone

Developments

  • All
  • Martin:
    • Disk procurement ITT evaluation
    • CPU procurement ITT drafting
    • New RAC for LHCb3D, rebuilding (reformatting) Array2
  • Ian:
    • Work on Quest FP7 bid
    • Updated glite update repositories and templates
    • Fixing problem with batch server with James A
    • Automation of repository updates and maintenance
    • began work on procurements
  • James T:
    • Applied WAN tuning to gen disk servers
    • Enabled extended file system attributes on gdss51 (for CASTOR checksumming)
    • Fixed a problem with gmetric_ipmi where it was trying to create Ganglia RRDs with a '/' in the name.
    • Drive replacements in Kash's absence.
    • Viglen 2008 testing
      • Swapped drives in a _good_ and a _bad_ host and restarted tests to reconfirm that the problem followed the disks
      • Installed and tested new disk firmware on a rack of machines. Ongoing.
      • Telecon with Viglen on Friday
  • Jonathan:
    • Administrator on Duty (Wednesday)
    • deleted 9 databases named cms* from MySQL server
    • filesystem quota updates
    • with James A completed addition of new userid suprnsgm
    • Nagios configuration updates
    • worked on configuration of nagger (replacement Nagios master server)
    • rebooted nagiosdb to use latest kernel and started gmond service
    • updated RPM tier1-nrpe-config for CRL age check on SRM systems
  • James A:
    • Degugging problems with scheduler.
    • Extended QWG system schema to support scaling factors.
    • Modularised service node templates.
    • Created "sysadmin tools" template with usefull tools for diagnosing node problems.
    • Created repo definition for Pakiti 2.
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • gdss110 replaced battery fixed and given back to castor.
    • gdss127 has been given back to castor.
    • gdss243 reinstalled and has been given back to castor.
    • gdss443 and gdss448 swapped drives.
    • Working on 2008 Disk servers and working nodes.
    • Working on gdss67, 78, 85, 86, and 270.

Operational Issues and Incidents

Index Description Start End Severity Affected VO(s)

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB

Component Description Start End Affected VO(s) Type

Development priorities

  • All
  • Martin:
    • Disk procurement ITT evaluation
    • CPU procurement ITT finalising
    • Meeting with Seagate about disk problems
  • Ian:
    • Quest FP7 bid meeting
    • Complete Automation of repository updates and maintenance
    • Work on test batch server
    • Further work on procurements
  • James T:
    • Quattor
      • Disk Servers
      • Ganglia
    • Viglen testing
    • Talk at GridPP storage meeting on our testing setup.
  • Jonathan:
    • complete configuration of nagger and migrate Nagios master service
    • migrate home filesystem to nfs1.gridpp.rl.ac.uk (from csfnfs02.rl.ac.uk)
    • work on moving Nagios slaves to new hosts managed by Quattor
    • work on moving NIS servers to new hosts managed by Quattor
    • Nagios configuration updates as required
  • James A:
    • Look at changes to Nagios database schema for v3.x
    • Start looking at SINDES.
    • Continue trying to diagnose cpu usage errors with asubset of nodes.
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Continuous working on 2008 disk servers and working nodes.
    • Continuous working on gdss67, 78, 85, 86, and 270.

Absences

Fabric On-Call

  • Mon-Sun: Ian (Primary On-Call)

Advanced Warning of Requirements and Blocking issues

  • Kernel patching to fix CVE-2009-2692 and CVE-2009-2698 is required to be completed by the end of this week. For systems where this is not possible, please discuss the issues with me. Kernels required:
    • SL3: 2.4.21-60
    • SL(C)4: 2.6.9-89.0.9 or better
    • SL(C)5: 2.6.21-128.7.1 or better

Services Issues

  • RT# 44835 – non capacity HW for testing (Services)

Category:RAL_Tier1

RAL Tier1 weekly operations fabric