RAL Tier1 weekly operations Fabric 20100419

From GridPP Wiki
Revision as of 10:02, 21 April 2010 by Jonathan wheeler (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Developments

  • All:
  • Martin:
  • Ian:
  • Tim:
    • Installing new ADS servers (3 down one to go)
    • Configuring new disk on DMF service.
    • New enlarged DMF area put into production to improve stability of system
    • Investigating CMS migration problems
  • Cheney:
    • racked up the infortrends
    • wrote some docco on castor
    • started on docco for DR
    • wrote up next years job plan
    • planned a new backups server for db
    • wrote change control docco
  • James T:
    • GridPP storage workshop Monday and Tuesday
    • GridPP 24 Wednesday and Thursday
    • Worked on SL5 disk server build with Ian
    • Played with Hadoop Distributed File System
  • Jonathan:
    • On leave
  • James A:
    • Swapped RAM in gdss370 and 405 to expedite 405's return to service.
    • Updated ARTEMIS room images.
    • Ran HEPSPEC2006 ten times on all new Viglen 2009 WNs to verify performance.
    • Generated Weathermap video for Gareth.
    • Wrote a Nagios plugin to check for web connectivity.
    • Prepared QUATTOR for Streamline 2009 WNs.
    • Switched ARTEMIS to xml feed monitoring of temperatures for 100x precision increase.
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Decommissioning old batch systems.(R 27)
    • gdss405 kernel panic .(Faulty memory) Fixed by James A.
    • propod1 reported single psu failure. (Transtec)
    • lcg1235/1236 bios settings updated by Streamline engineer.
    • ccse03 faulty PSU reported.(Intervention)

Absences

  • Jonathan on leave all week
  • Kashif A/L (Tuesday)
  • James T at GridPP until Thursday
  • Tim working at home Monday

Operational Issues and Incidents

Index Description Start End Severity Affected VO(s)

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB

Component Description Start End Affected VO(s) Type

Development priorities

  • All
  • Martin:
  • Ian:
  • Tim:
    • Finish installing ADS servers.
    • Finish configuring new ADS servers (Nagios/Ganglia etc)
    • Continue investigating CMS migtration problems
    • Investigate Atlas recall problems
    • APR/Job plans
    • DB Backup procedures
    • SSC finance training Thursday
  • Cheney:
    • Last year APR/job plan
    • Evaluate mgt utilities for SRB guys
    • Write more docco
    • install nagios on new ads servers
  • James T:
    • Fix disk server kickstart issues
    • Quattor disk server build tweaks
    • Follow up problems with new disk servers with the vendor
    • APR and job plan
    • SSC finance training Tuesday morning
  • Jonathan:
    • Catch up after being away
    • APR and job plan
    • SSC finance training Tuesday morning
  • James A:
    • Sort out flaky ARTEMIS base unit in LPD room.
    • Begin acceptance testing Streamline 2009 WNs when handed over.
    • Establish Viglen 2009 WN compatibility with SL4.
    • Reclaim Nortel 5510 from CASTOR rack.
    • Put Viglen 2009 WNs into production.
    • Investigate cool-sounding PBS idle WN control.
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Continuous decommissioning old batch systems.(R 27)

Absences

  • Jonathan on partial retirement (not in on Monday and Friday)

Fabric On-Call

Advanced Warning of Requirements and Blocking issues

Services Issues


RAL Tier1 weekly operations fabric

Category:RAL_Tier1