RAL Tier1 weekly operations Fabric 20100628

From GridPP Wiki
Revision as of 13:32, 1 July 2010 by Jonathan wheeler (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Developments

  • All:
    • At e-Science Away Day (except Ian at CERN)
  • Martin:
    • CRISTAL 1 (monday)
    • CPU ITT
    • Desktop updates
  • Ian:
    • WLCG Multi-core & Virtualisation Workshop@CERN
    • Information exchange about Services Virtualisation with CERN IT
    • Met Castor team at CERN to discuss sharing Quattor configurations
  • Tim:
  • Jonathan:
    • updated tier1-sudo-config SVN source for lcgce03
    • 1 Nagios configuration updates
    • finished presentation about AFS and gave it to Fabric Team
  • James A:
    • CRISTAL 1 Course
    • Annual Leave
  • James T
    • Primary on call Monday - Thursday
    • Streamline 2009 testing
      • Progress meeting
      • Re-cabling machines to their head node.
    • Quattor changes for RFIO port problems on SL5 disk servers
    • Away day
    • STEM networking event
    • Updated lcg-CA on non-quattorised disk servers
  • Cheney
    • entered suggestions to Mr Cameron's Spending Challenge website
    • testing of new amanda rpm
    • check infosec on hinode/solarb
    • fix tsrb01 array controller
    • set up sudo
    • install wireshark for srb
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Decommissioning old batch systems.(R 27)
    • gdss67 filesystem problem. Replaced Raid card. (Intervention)
    • gdss239 replaced 8x1gb memory. Back to castor. (Fixed)
    • gdss474 faulty backplane arranging Engineer. (Waiting for parts)
    • gdss207 finally managed to install. Presently verifying array.
    • Streamline 2009 disk servers network cabling with James T.
    • gdss220 replaced 8x1gb memory. Back to castor. (Fixed)
    • gdss420 low voltage on battery. (waiting for battery)
    • Streamline/areca disk servers crashed due to single faulty drive. (ongoing)

Absences

  • Jonathan on partial retirement (not in on Monday and Friday)
  • Tim in Amsterdam Mon & Tues

Operational Issues and Incidents

Index Description Start End Severity Affected VO(s)

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB

Component Description Start End Affected VO(s) Type

Development priorities

  • All
  • Martin:
    • Disk ITT clarifications
    • CPU ITT finalisation
  • Ian:
    • Services virtualisation planning
    • Facilities Castor instance planning & implementation
    • Convening work on atlas software server
  • Tim:
  • Cheney
    • new amanda rpm
  • Jonathan:
    • continue work on shutting down csfnfs58 (old NFS server)
    • Nagios configuration updates
    • work on replacement paging system
  • James T:
    • LHCb WAN tuning
    • RFIO port changes on SL5 disk servers
    • Support for Adaptec in TAVS (plus bug fixes)
    • Streamline 2009 testing
  • James A:
    • Cabling up Dell NC nodes.
    • Working on Database plans.
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • gdss78 need re-install and create array from scratch.
    • gdss380 run 7 days acceptance test.
    • gdss67 create array from scratch and install.
    • Continuous decommissioning old batch systems.(R 27)

Absences

  • Jonathan on partial retirement (not in on Monday and Friday)
    • Tim in Amsterdam Mon & Tues
    • Ian out Thursday am
    • Kashif A/L Tuesday - Thursday

Fabric On-Call

Kashif fabric oncall Monday-Sunday

Advanced Warning of Requirements and Blocking issues

Services Issues


RAL Tier1 weekly operations fabric

Category:RAL_Tier1