RAL Tier1 weekly operations Fabric 20090921

From GridPP Wiki
Jump to: navigation, search

Summary of week gone

Developments

  • All
  • Martin:
    • catch-up
    • CPU procurement ITT drafting
    • HEPiX travel details
  • Ian:
    • Refining Quattor WN install
    • Updated errata repositories
    • Work on Quest FP7 bid
    • Primary on call for much of the week
  • James T:
    • Continued to track the 2008 Viglen disk server problems, inc. disaster management meeting.
    • gdss164 back to production for BaBar.
    • Built SL5 64-bit s/w server for CMS via Quattor.
    • A/L Thursday, Friday
  • Jonathan:
    • investigated NIS error messages and added some nodes to NIS map netgroup
    • created wiki page describing how to create filesystem archives using Datastore volumes, including list of existing archives
    • reinstalled sv-08-04 using Quattor and IPMI over LAN to fix access problem
    • archived /stage/vo-sw-atlas/atlassgm to Datastore volumes and deleted directory to release space
    • revised list of users able to view yumit
    • Nagios configuration updates
  • James A:
    • Finished migrating ~73% of farm capacity to SL5 64-bit.
    • Migrated remaining SL4 capacity to lcgbatch01.
    • Debugged a few problems with new farm.
    • Wed-Fri Took on responsibility for HW.
  • Kash:
    • On leave.

Operational Issues and Incidents

Index Description Start End Severity Affected VO(s)

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB

Component Description Start End Affected VO(s) Type

Development priorities

  • All
  • Martin:
    • Disk procurement ITT eavaluation
    • Next steps in database migration plan
  • Ian:
    • Further work on Quest FP7 bid
    • Automation of repository updates and maintenance
    • Further work on organisation of Quattor templates
  • James T:
    • Viglen 2008 disk server problems.
    • Gen disk server tuning.
    • Quattorisation of disk servers and ganglia configs.
    • New disk servers to Overwatch.
  • Jonathan:
    • complete work for adding SGM userid for SuperNemo VO
    • work on moving Nagios slaves to new hosts managed by Quattor
    • work on migration on home filesystem to new hardware and new version of SL
    • work on moving NIS servers to new hosts managed by Quattor
    • Nagios configuration updates as required
  • James A:
    • Debug and fix routine segfaults of torque on batch01.
    • Start work on SINDES for secure credential distribution.
  • Kash:

Absences

Fabric On-Call

Advanced Warning of Requirements and Blocking issues

Services Issues

  • RT# 44835 – non capacity HW for testing (Services)

Category:RAL_Tier1

RAL Tier1 weekly operations fabric