RAL Tier1 weekly operations Fabric 20100111

From GridPP Wiki
Jump to: navigation, search

Summary of week gone

Developments

  • All:
  • Martin:
    • Work on IPMI networking
    • Procurements
    • Networking plans for capacity procurements
    • UPS bypass test and related activities
  • Ian:
    • Catching up
    • Worked on Quattor managed vobox config with Catalin
    • Planning for interventions later in January
    • Further updates to CIP configs in Quattor
  • James T:
    • Christmas catch up
    • Viglen 2008 disk progress check up.
      • All drives should be swapped out
      • Some servers testing over Christmas, the rest started last week.
    • Quattorisation of disk servers
  • Jonathan:
    • compiled list of Fabric core services and servers for GRIDPP4 submission
    • cleared space in root directory on lcgsql0363 by removing many kernel RPMs
    • fixed problem with atlasbackup on lcgsql0363, lcgdb06, lcgvo0598
    • restarted enigma using new version of password cracker
    • fixed problem with exclude option of atlasbackup on lcgwms01
    • Nagios configuration updates
    • corrected problem with Quattor configuration of Nagios slaves
    • worked on Nagios active check of database status files on peaceful (to replace existing passive checks)
    • 1 day at home due to weather
  • James A:
    • Working from home due to weather:
      • Solved ARTEMIS CA issues and got mutual authentication and (internal) certificate issuing working.
      • Corrected job-slot numbers on SL5-SL06 WNs which were causing low job efficiencies. RT#54999
      • Created read-only overwatch account for Tier1 dashboard. RT#54960
      • Created database and handlers on thor for Building Management System SNMP trap logging, hooks for NAGIOS possible.
      • Started sorting and checking user contact data for new database. RT#29217
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Decommissioning old batch systems.(R 27)
    • gdss79 given back to castor. (Replaced 3 faulty drives)
    • gdss94 and 127 given back to castor.
    • Arranged collection for 30 boxes (faulty parts during christmas/newyear) with Viglen.
    • lcgdb14 re-arranged Enginner from Streamline.
    • gdss169 given back to castor.
    • Working on 2008 Disk servers and working nodes.
    • Working on gdss70, 282 and 364.

Absences

  • Jonathan (1 day, snow)

Operational Issues and Incidents

Index Description Start End Severity Affected VO(s)
EMC arrays serving 3D/LFC/FTS databases made unstable by attempts to stabilise the Castor EMC arrays Tuesday 6/0ct am UPS issues to be fixed Catastrophic All

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB

Component Description Start End Affected VO(s) Type

Development priorities

  • All
  • Martin:
    • Minor procurements
    • complete IPMI networking
    • planning abd cahnage control activities for pre-datataking period
  • Ian:
    • Further work on vobox
    • glite and repository update configuration in Quattor
    • Work on Virtualisation platform for Tier1
  • James T:
    • Viglen 2008 disk progress check up.
    • Post-Christmas security status assessment with James A. (not done last week due to James not getting in).
    • Quattorisation of disk servers.
    • Fix various ganglia problems
    • Work on deploying WAN tuning where not applied for Brian.
  • Jonathan:
    • final checks of change to restrict SSH login on disk servers
    • implement active checking of database status on peaceful
    • complete work on installing Nagios slave server via Quattor
    • Nagios configuration updates
  • James A:
    • Try to get SINDES information dissemination functional.
    • Continue work on user contact database. RT#29217
    • Attempt security status assessment with James T.
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • lcgdb14 replace faulty part/memory with engineer.
    • Continuous decommissioning old batch systems.(R 27)
    • Continuous working on 2008 disk servers and working nodes.
    • Continuous working on gdss70, 282 and 364.

Absences

  • Martin: A/L Fri pm

Fabric On-Call

  • Ian primary on call all week.

Advanced Warning of Requirements and Blocking issues

  • Unable to proceed with Atlas TAG migration to 64bit due to arrays being used for 3D systems while EMC kit is flakey.

Services Issues

  • Various requests for hardware.
    • Working on hardware provision for Services team testbeds.

Category:RAL_Tier1

RAL Tier1 weekly operations fabric