Production Team Report 2010-03-29
From GridPP Wiki
Revision as of 11:15, 29 March 2010 by Gareth smith (Talk | contribs)
Contents
RAL Tier1 Production Team Report for 29th March 2010.
AoD This Week
Mon & Tues: Tiju Wed: Ian Thu: Gareth Fri: N/A
Last Week (15-19 March)
- All: Training on batch system, training on CMS monitoring.
- Gareth: AoD (<1 day), Still preparing presentation for OpenDay
- John: AoD (3+ days) Tour training, Fixed CMS migrations, Looked at atlas file transfer problems, Fixed weathermap emails, Added disk servers to nagios,
Discovered ld.so.conf problems on gdss346, Deployed a disk server.
- Tiju: Dashboard work, Deployed disk server, updated Tier1 oncall contact list.
Changes to Operating procedures
- The Tier1 contacts list has been separated out from the on-call rota. Please check at: https://wiki.e-science.cclrc.ac.uk/web1/bin/view/EScienceInternal/ContactList
- Open Day (Tuesday):
- Significant number of staff either giving talks or leading tours. Ask for one member of each team to be present in office area keeping an eye on things.
- All staff to check contact details on Wiki up-to-date.
- Staff (primarily those not giving talks or tours) to remain vigilant, looking for failures in their areas to ensure rapid response should there be any issues.
- Staff (where appropriate) have mobile phones on (and charged!)
- Restart Puppetmaster before the Open Day itself (Monday early afternoon) as this is known to degrade over time.
- Note that racks in machine room will be locked for tours. (Fabric staff have keys.)
- Easter preparations.
- This is effectively a four-day weekend. Run ‘on-call’ as per a usual weekend.
- Fabric on-call (James A) will make a check (remotely unless on-site intervention needed) on disk failures part way through weekend.
- On Thursday afternoon carry out the usual checks ahead of a weekend, checking there are no ‘loose ends’ etc. Check migration candidates flushed out.
Declared Outages in GOC DB
- Today (Monday) - At Risk on Castor for LSF license key update.
Advanced Warning
- Ongoing: Upgrade SL5 worker nodes to SL 5.4.