RAL Tier1 weekly operations Fabric 20100222

From GridPP Wiki
Jump to: navigation, search

Summary of week gone

Developments

  • All:
  • Martin:
    • Delivery of Viglen 09 Disk servers part 1 (26/38) and Viglen 09 CPU.
    • Minor procurements
      • iSCSI arrays
      • Work on additional C300 switch
      • Work on Services/Virtualisation/Quattor servers
  • Ian:
    • Work on Castor-Fabric handover
    • Prototype core castor server in Quattor


  • James T:
    • Accepted Viglen 2008 disk.
    • Assisted Viglen with the installation of 2009 disk.
    • Quattor.
      • Some progress on ironing out disk server installation problems.
      • Fixed stage/st user/group not existing when CASTOR RPMs installed.
      • Thought about 64-bit + XFS.
    • Worked on tour for Tier1 open day with James Adams.
    • Dealt with disk faults in Kash's absence.
  • Jonathan:
    • nuked lcgdb98 and removed from yumit, mimic, atlasbackup
    • updated RPM tier1-nrpe-config on many farm nodes for new Nagios slave server
    • sent message to user about cracked password
    • issued Change Control for new Nagios server
    • started preparing new versions of RPMs tier1-nrpe.config, tier1-sudo-config and tier1-nagios-plugins
    • Nagios configuration updates
    • worked on Quattor configuration of Nagios slave servers
  • James A:
    • Supported Viglen during delivery.
    • Completed preparation for Streamline delivery.
    • Blanked loaned hardware.
    • Started work on Tier1 tours with JIT.
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Decommissioning old batch systems.(R 27)
    • gdss211 reinstalling.
    • gdss294 replaced 4x2gb memory and given back to castor. (Fixed)
    • gdss282 replaced 16 ports raid card (Fixed)
    • nc21 (lcg0280) found faulty memory. - Intervention
    • Viglen 2006 eight disk servers for decommissioning/prepod. (Labeled and configured)
    • lcgce07 faulty drive.(scheduled for Monday 22nd Feb 2010)
    • gdss171 given back to castor.
    • Cabling for new systems in HPD room with James A.
    • Working on 2008 Disk servers and working nodes.
    • Working on gdss211 and 295.

Absences

  • Jonathan on partial retirement, worked Tuesday, Thursday and Friday

Operational Issues and Incidents

Index Description Start End Severity Affected VO(s)
EMC arrays serving 3D/LFC/FTS databases made unstable by attempts to stabilise the Castor EMC arrays Tuesday 6/0ct am UPS issues to be fixed Catastrophic All
gdss364 disk controller sick Friday ~20:00 Ongoing Severe CMS (FarmRead)

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB

Component Description Start End Affected VO(s) Type

Development priorities

  • All
  • Martin:
    • Delivery of remaining Viglen 09 disk (12/38) - Wednesday
    • Delivery of Streamline 09 disk (40/60) - Thursday/Friday
    • Minor procurements:
      • Additional C300
      • Services/Virtualisation/Quattor servers
  • Ian:
    • Helping JAmes T with Quattorised disk server
    • Castor handover work
    • Further work on Quattorised CAstor servers
    • Investigate options for further hardware monitoring
  • James T:
    • Quattor:
      • Resolve disk server install problems.
      • Document seployment procedure using Quattor
    • Work on Tier1 tour
    • Identify 5 machines to deploy for ATLAS
    • ATLAS WAN tuning
    • RA training Monday 11.00 to 14.00
    • Admin on duty Tuesday and Wednesday
    • Fabric on call Monday 22nd to Monday 1st
  • Jonathan:
    • complete work on installing Nagios slave servers via Quattor
    • implement cron job with checks to run daily test restores of home filesystem
    • Nagios configuration updates
    • release new versions of RPMs
  • James A:
    • Focusing on last of SINDES integration.
    • Annual leave on Friday.
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • lcgce07 drive replacement. (Hot swap)
    • Replacing memory in nc21 (lcg280).
    • Continuous decommissioning old batch systems.(R 27)
    • Continuous working on 2008 disk servers and working nodes.
    • Continuous working on gdss211 and 295.

Absences

  • Jonathan: working Tuesday, Wednesday and Thursday
  • Kashif: A/L Wednesday and Thursday.
  • James T: Training 11.00 - 14.00 Monday
  • Martin: A/L Friday pm

Fabric On-Call

James T. Monday - Monday

Advanced Warning of Requirements and Blocking issues

  • Unable to proceed with Atlas TAG migration to 64bit due to arrays being used for 3D systems while EMC kit is flakey.

Services Issues

  • Various requests for hardware.
    • Working on hardware provision for Services team testbeds.

Category:RAL_Tier1

RAL Tier1 weekly operations fabric