Production Team Report 2010-03-29

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Production Team Report for 29th March 2010.

AoD This Week

Mon & Tues: Tiju Wed: Ian Thu: Gareth Fri: N/A

Last Week (15-19 March)

  • All: Training on batch system, training on CMS monitoring.
  • Gareth: AoD (<1 day), Still preparing presentation for OpenDay
  • John: AoD (3+ days) Tour training, Fixed CMS migrations, Looked at atlas file transfer problems, Fixed weathermap emails, Added disk servers to nagios,

Discovered ld.so.conf problems on gdss346, Deployed a disk server.

  • Tiju: Dashboard work, Deployed disk server, updated Tier1 oncall contact list.

Changes to Operating procedures

  • The Tier1 contacts list has been separated out from the on-call rota. Please check at: https://wiki.e-science.cclrc.ac.uk/web1/bin/view/EScienceInternal/ContactList
  • Open Day (Tuesday):
    • Significant number of staff either giving talks or leading tours. Ask for one member of each team to be present in office area keeping an eye on things.
    • All staff to check contact details on Wiki up-to-date.
    • Staff (primarily those not giving talks or tours) to remain vigilant, looking for failures in their areas to ensure rapid response should there be any issues.
    • Staff (where appropriate) have mobile phones on (and charged!)
    • Restart Puppetmaster before the Open Day itself (Monday early afternoon) as this is known to degrade over time.
    • Note that racks in machine room will be locked for tours. (Fabric staff have keys.)
  • Easter preparations.
    • This is effectively a four-day weekend. Run ‘on-call’ as per a usual weekend.
    • Fabric on-call (James A) will make a check (remotely unless on-site intervention needed) on disk failures part way through weekend.
    • On Thursday afternoon carry out the usual checks ahead of a weekend, checking there are no ‘loose ends’ etc. Check migration candidates flushed out.

Declared Outages in GOC DB

  • Today (Monday) - At Risk on Castor for LSF license key update.

Advanced Warning

  • Ongoing: Upgrade SL5 worker nodes to SL 5.4.