RAL Tier1 weekly operations Fabric 20100308

From GridPP Wiki
Jump to: navigation, search

Summary of week gone

Developments

  • All:
  • Martin:
    • C300 procurement
    • Development and management network planning
    • Decommissioning R26 A1 Upper
    • Meetings
  • Ian:
    • Refined Quattor Disk server config
    • Refined prototype Quattor core castor server config
    • Provision of additional h/w for LFC boxes plus initial setup
    • Set up second repository server


  • James T:
    • Draft blog post on the Viglen 08 disk procurement saga (not to be published until we've agreed wording with Viglen, et al.)
    • Fixed some problems with older kickstarts
    • Worked on documentation for Quattor deployment of disk servers
    • Created second draft of Tier1 tour structure
    • Fixed /var on gdss208
  • Jonathan:
    • sorted out atlasbackup problems for several servers
    • wrote added change control forms for planned changes
    • added install02 to NIS netgroup and Nagios
    • dropped MySQL database csf_monitor from lcgsql0363
    • started work on disposals from A1 Upper
    • started work on clearing out old filesystems on csfnfs58
    • deleted 8K+ old backups of AFS volumes from Datastore
    • Nagios configuration updates
  • James A:
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Decommissioning old batch systems.(R 27)
    • gdss211 given back to castor.
    • gdss208 replaced ‘power distribution board’ with Viglen engineer fixed and given back to castor.
    • Cleared A5 upper test room.
    • gdss347 replaced 4x2gb memory fixed and back into production.
    • gdss135 given back to castor.
    • Castor servers (cdbc13/cdbd03) still working. (Intervention)
    • Replaced drive in afs1. (Fixed)
    • Replaced memory in lcgftm0430. (fixed)

Absences

  • Jonathan on partial retirement (not in on Monday and Friday)

Operational Issues and Incidents

Index Description Start End Severity Affected VO(s)
EMC arrays serving 3D/LFC/FTS databases made unstable by attempts to stabilise the Castor EMC arrays Tuesday 6/0ct am UPS issues to be fixed Catastrophic All

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB

Component Description Start End Affected VO(s) Type

Development priorities

  • All
  • Martin:
    • Networking
    • Completing C300 procurement
    • Installing kit in R26 A5L and R89 UPS room
  • Ian:
    • Bring second repository server into production
    • Further work on Quattorised Castor servers
    • Further assistance for Catalin on configuring LFC boxes
    • Researching better tests of raid hardware in nagios
  • James T:
    • Quattor
      • Finish deployment documentation after changes discussed with Chris
      • Updates to lean disk server with Chris and Ian
    • Tier1 tour planning
    • Viglen '09 disk testing
    • TOASTER prep
    • ATLAS WAN tuning
    • Ticket tidy up
  • Jonathan:
    • continue working on disposals
    • continue clearing out old filesystems on csfnfs58
    • implement cron job with checks to run daily test restores of home filesystem
    • Nagios configuration updates
  • James A:
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Continuous decommissioning old batch systems.(R 27)

Absences

  • Jonathan on partial retirement (not in on Monday and Friday)

Fabric On-Call

Ian Mon-Sun

Advanced Warning of Requirements and Blocking issues

  • Unable to proceed with Atlas TAG migration to 64bit due to arrays being used for 3D systems while EMC kit is flakey.
    • Update (2010/03/01): new hardware is now on site and ready to be installed in the rack in R26 A5L.

Services Issues

  • Various requests for hardware.
    • Working on hardware provision for Services team testbeds.

Category:RAL_Tier1

RAL Tier1 weekly operations fabric