RAL Tier1 weekly operations Fabric 20100301

From GridPP Wiki
Jump to: navigation, search

Summary of week gone

Developments

  • All:
  • Martin:
    • Minor procurements
    • Castor databases disaster planning
    • New hardware unpacking
  • Ian:
    • Worked on Quattor core castor server
    • Adapted Quattor disk server to use core quattor server base
    • Castor handover planning
  • James T:
    • eScience CA RA course
    • Quattor disk servers (problem eventually fixed by Ian)
    • Installed all 60 Viglen 08 disk servers with Quattor
    • Admin on Duty (2 days)
    • 5 x disk servers for deployment for ATLAS
    • Created CASTOR_PreProd ganglia instance
    • fix for vdt_globus_data_server and grid FTP external kickstart install problems
  • Jonathan:
    • sorted out atlasbackup problems for several nodes
    • rebooted lcgui0358 (user front-end) to solve mount problem
    • replaced failed drive on afs3 and despatched to DNUK
    • Nagios configuration updates
  • James A:
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Decommissioning old batch systems.(R 27)
    • gdss211 reinstalling.
    • gdss295 given back to castor.
    • gdss364 replaced 16 ports raid card (Borrowed from gdss338)
    • lcgce07/nc21 replaced system with spare twin system. (Streamline)
    • gdss128 and gdss403 given back to castor.
    • Castor servers (cdbc13/cdbd03) moved into test area. (Intervention)
    • Moved systems/parts to Atlas A5 lower machine room.
    • gdss160 given back to castor.
    • Working on gdss211 and 295.

Absences

  • Jonathan on partial retirement
    • medical appointment/annual leave Tuesday
    • sick leave Thursday

Operational Issues and Incidents

Index Description Start End Severity Affected VO(s)
EMC arrays serving 3D/LFC/FTS databases made unstable by attempts to stabilise the Castor EMC arrays Tuesday 6/0ct am UPS issues to be fixed Catastrophic All
gdss364 disk controller sick Friday ~20:00 Ongoing Severe CMS (FarmRead)

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB

Component Description Start End Affected VO(s) Type

Development priorities

  • All
  • Martin:
    • Castor databases planning
    • Decommissioning A1 Upper
    • Moving development network hardware
  • Ian:
    • Second installation server
    • Deployment of Sindes with James A
    • Further work on Quattorisation of castor servers with Chris
  • James T:
    • Checking over of "lean" disk server with Chris and Ian
    • Tier1 Tour preparation
    • Deploy drained Viglen 06 to pre-prod (with re-configured arrays)
    • Helpdesk ticket blitz
  • Jonathan:
    • change controls for replacement Nagios slave servers and decommissioned web site
    • implement cron job with checks to run daily test restores of home filesystem
    • Nagios configuration updates
  • James A:
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • gdss347 replace 4x2gb memory.
    • Clear Atlas A5 upper test area.
    • Continuous decommissioning old batch systems.(R 27)
    • Continuous working on gdss211 and 208.

Absences

Ian out on Thursday

Fabric On-Call

Ian: Primary on call Mon-Sun

Advanced Warning of Requirements and Blocking issues

  • Unable to proceed with Atlas TAG migration to 64bit due to arrays being used for 3D systems while EMC kit is flakey.
    • Update (2010/03/01): new hardware is now on site and ready to be installed in the rack in R26 A5L.

Services Issues

  • Various requests for hardware.
    • Working on hardware provision for Services team testbeds.

Category:RAL_Tier1

RAL Tier1 weekly operations fabric