RAL Tier1 weekly operations Fabric 20100426

From GridPP Wiki
Jump to: navigation, search

Developments

  • All:
  • Martin:
    • @ HEPiX (virtually)
    • @ HEPiX Virtualisation F2F (virtually)
  • Ian:
    • Attending HEPiX remotely
    • Some work on Castor tape server & quattor
  • Tim:
    • CMS migration issue
    • ADS hardware install finished.
    • Talking to IBM about maintanance contracts for above
    • DMF tape problems, draining some tapes.
    • Configing new tape server for T10KB testing. Now working OK
  • Cheney:
    • testing of new backup server for database backups
    • docco - tsbn & sls instructions / added c2probe / hardware info errors fixed
    • did some investigations for srb chaps
    • installed nagios on aix (ads)
    • mucho fiddling with nagios on aix to get it working
    • set up config files for new backup server
    • sort out array crash on preprod db
  • James T
    • Worked on adding SL4.8 to Quattor (for Viglen '09 disk)
    • SL5 disk server build
    • Met face to face with Streamline regarding disk problems
    • Resolved kickstart problems (thanks to John K. for building a new RPM)
    • Fixed tuning errors on cmsFarmRead
    • SSC finance training
    • Security group task planning
  • Jonathan:
    • fixed space problem on lcgui01 (/tmp) by arranging deletion of old files
    • fixed load problems on install01/02 by killing old lftp processes
    • fixed atlasbackup problem on several nodes
    • updated iptables on several systems to fix connection problems for new Nagios slave
    • fixed yum problem on enigma
    • added AFS userid atlas147 and increased quota for volume atlassw for Atlas software installation testing
    • updated NIS netgroup to add new batch workers
    • created directory /scratch/jacksonj for James Jackson on lcgui02 for CMS development work
    • Nagios configuration updates
    • Oracle Finance Self-Service course
  • James A:
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Decommissioning old batch systems.(R 27)
    • Daily hardware failures status of Streamline 2009 disk servers to James T.
    • Cabling in HPD room with James A.
    • Fax Viglen copies of dispatch notes for delivery proof.
    • Castor preprod replaced 1 U Power supply. (Fixed)
    • Castor ccse03 replaced Motherboard, Memory and Power distribution board. Moved back into rack.(Fixed by Engineer)
    • Castor C2certdb reported faulty drive.

Absences

  • Jonathan on partial retirement (not in on Monday and Friday)

Operational Issues and Incidents

Index Description Start End Severity Affected VO(s)

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB

Component Description Start End Affected VO(s) Type

Development priorities

  • All
    • APRs
  • Martin:
    • Interviewing
    • Preparation work for Tier1 Supplier day (Thursday)
      • Disk server specifications for ITT
      • Presentations
  • Ian:
    • APR
    • Quattor support for Castor
    • Virtualisation work (Hepix vwq and Tier1 services)
    • Atlas SW server
    • CMS vobox
  • Tim:
    • APR/Job plans
    • Finish T10KB testing
    • Install remaining new tape servers
  • Cheney:
    • APR
    • backup server intervention
    • more testing of backup server
    • docco
  • James T:
    • Quattor SL4.8/SL5 disk servers
    • Streamline '09 testing problems
      • Telecon with Streamline, LSI and Western Digital on Wednesday
    • AoD on Thursday
    • APR stuff
    • Security group
  • Jonathan:
    • Stop pacman mirror on csfmove02
    • start regular check restores of home filesystem
    • final checks of new Nagios slave
    • continue to work on setting up AFS directory as Atlas software server
    • APR
    • Nagios configuration updates
  • James A:
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Continuous decommissioning old batch systems.(R 27)
    • Viglen Engineers service call on Wednesday 28th April 2010.
    • gdss290 fs errors and probably data lost. (Intervention)
    • gdss312 and gdss337 replace IPMI card.
    • gdss420 replace 24 ports raid controller card.
    • Daily hardware failures status of Streamline 2009 disk servers to James T.

Absences

  • Jonathan on partial retirement (not in on Monday and Friday)

Fabric On-Call

Ian primary on call Monday-Sunday

Advanced Warning of Requirements and Blocking issues

Services Issues


RAL Tier1 weekly operations fabric

Category:RAL_Tier1