RAL Tier1 weekly operations Fabric 20100104

From GridPP Wiki
Jump to: navigation, search

Summary of week gone

Developments

  • All:
  • Martin:
    • Minor procurements
    • Migrating hardware out of A1 Upper
    • Rebooted AFS servers
  • Ian:
    • A/L
  • James T:
    • Disk server kernel updates on 22 December.
    • Job plan updates and review.
    • Primary on call over Christmas.
    • Fabric on call (on site cover) for the rest of the break.
  • Jonathan:
    • checked web service on csfmove02 and removed (non-working) configuration for lc experiment
    • stopped export of /home/csf on csfnfs02
    • retrieved new host certificates for afs1, afs2, afs3, nfs1 (RT# 54098/6/7/5)
    • Nagios configuration updates
    • worked on active method checking databases monitored by peaceful
    • shutdown nincom
    • updated RPMs on Nagios slave servers before they were moved to A5Lower machine room)
  • James A:
    • Snow and A/L.
  • Kash:

Absences

  • Ian: A/L 21-24/12
  • Jonathan: A/L 22/12

Operational Issues and Incidents

Index Description Start End Severity Affected VO(s)
EMC arrays serving 3D/LFC/FTS databases made unstable by attempts to stabilise the Castor EMC arrays Tuesday 6/0ct am UPS issues to be fixed Catastrophic All

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB

Component Description Start End Affected VO(s) Type

Development priorities

  • All
  • Martin:
    • Minor procurements
    • Migrating hardware out of A1 Upper
    • GridPP4 work
    • UPS tests
  • Ian:
    • test new Quattor config for vobox with Catalin
    • Plan lcgbatch01 upgrade for next week
    • Assist James T with disk server deployment with Quattor
  • James T:
    • Catch up
    • Viglen 2008 disk progress check up.
    • Post-Christmas security status assessment with James A.
    • Quattorisation of disk servers.
  • Jonathan:
    • final checks of change to restrict SSH login on disk servers
    • implement active checking of database status on peaceful
    • complete work on installing Nagios slave server via Quattor
    • Nagios configuration updates
  • James A:
    • Continue to working to get SINDES operational before mid February.
    • Scan of security incidents with JIT.
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Catch up with James T.
    • Reporting faulty parts/drives which we had during Christmas/New year holidays.
    • Arranging collection of faulty parts.
    • Continuous decommissioning old batch systems. (R 26)
    • Continuous working on 2008 disk servers and working nodes.
    • Continuous working on gdss70, 94, 127 and 282.

Absences

  • Martin: A/L Friday pm

Fabric On-Call

    • Ian all week

Advanced Warning of Requirements and Blocking issues

  • Unable to proceed with Atlas TAG migration to 64bit due to arrays being used for 3D systems while EMC kit is flakey.

Services Issues

  • Various requests for hardware.
    • Working on hardware provision for Services team testbeds.

Category:RAL_Tier1

RAL Tier1 weekly operations fabric