RAL Tier1 weekly operations Fabric 20100510

From GridPP Wiki
Jump to: navigation, search

Developments

  • All:
    • APRs
  • Martin:
  • Ian:
  • Tim:
  • Cheney:
    • Sort out problem with openssl and websites
    • Tweaking DB backups
    • Tweaks to nagios
    • Improve dmf backups
  • Jonathan:
    • shutdown restarted netnag server
    • 3 Nagios configuration updates
  • James A:
    • Working on newish Atlas Software Server(s)
    • Investigating quattorfs from MS.
  • James T
    • AoD Thursday
    • Drive changes for Kash
    • gdss397
    • Phone conference with Streamline. All machines bar two through vendor testing; acceptance testing starting this week.
    • APR
    • Applied for new certificates for gdss87-367 and gdss478-575
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Decommissioning old batch systems.(R 27)
    • Daily hardware failures status of Streamline 2009 disk servers to James T.
    • gdss397 crashed with single drive failure.(Intervention)
    • APR.
    • Streamline Engineer service call.
    • Boston Engineers service call.

Absences

  • Jonathan on partial retirement (not in on Monday and Friday)
  • Jonathan absent 2 days due to personal and family sickness
  • Kashif Annual Leave (Wednesday and Thursday)

Operational Issues and Incidents

Index Description Start End Severity Affected VO(s)

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB

Component Description Start End Affected VO(s) Type

Development priorities

  • All
  • Martin:
  • Ian:
  • Tim:
  • Cheney
    • Improvements to dmf backups
    • Add in new tape servers to tsbn/sls
  • Jonathan:
    • start regular check restores of home filesystem
    • final checks of new Nagios slave and finally stop nagios01/05
    • continue investigations on setting up AFS directory as Atlas software server
    • Nagios configuration updates
  • James T:
    • Streamline '09 testing
    • APR/JOb plan
    • Update certificates on disk servers where needed
    • Define disk server benchmarking procedure
    • Fill in for Kash on disk server maintenance
  • James A:
    • Duplicating software from current Atlas software server to new standby server.
    • Returning Streamline Storage Node networking to original configuration so acceptance testing can be started.
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Continuous decommissioning old batch systems.(R 27)
    • Daily hardware failures status of Streamline 2009 disk servers to James T.

Absences

  • Jonathan on partial retirement (not in on Monday and Friday)
  • Jonathan 1 day A/L - Tuesday
  • Ian @CERN
  • Martin @CERN

Fabric On-Call

Advanced Warning of Requirements and Blocking issues

Services Issues


RAL Tier1 weekly operations fabric

Category:RAL_Tier1