RAL Tier1 weekly operations Fabric 20100215

From GridPP Wiki
Revision as of 15:50, 15 February 2010 by James thorne (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Summary of week gone

Developments

  • All:
  • Martin:
    • Minor procurements
    • Preparation for deliveries
  • Ian:
    • Change control plan for WN OS update
    • Work on Castor-Fabric integration
    • Debugging filesystem problems on Nagios slave
    • Learning about Quattor core development
  • James T:
    • All Viglen 08 disk servers finished testing & re-installed with holding configuration.
    • Quattor
      • Script to create hardware templates and machine profiles for disk servers based on the hardware database and overwatch.
      • Added LSF accounts
    • Fixed ganglia after network intervention.
  • Jonathan:
    • Administrator on Duty (Wednesday)
    • sorted out atlasbackup problems on several nodes
    • updated /etc/mail/local-host-names on pat
    • updated RPMs on Nagios slave servers and rebooted for new kernel
    • updated RPMs on core servers
    • Nagios configuration updates
    • new versions of RPM tier1-nrpe-config
    • worked on Quattor configuration of Nagios slave servers
  • James A:
    • Bulk of time spent preparing floor-space and networking for Viglen and Streamline deliveries.
    • Added Hardware database content for Quattorised disk-servers and new deliveries.
    • Some progress on SINDES integration.
    • Made some fixes to the hardware database.
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Decommissioning old batch systems.(R 27)
    • gdss211 completed 7 days acceptance test.
    • gdss93 and gdss161 given back to castor. (Fixed)
    • gdss77 replaced 4x1gb memory also two drives.(Fixed)
    • nc21 (lcg0280) found faulty memory. - Intervention
    • lcgce07 faulty drive.
    • gdss130 and gdss172 given back to castor.
    • gdss86 replaced 4x1gb memory and raid card memory.
    • gdss364 replaced 16 ports raid card. (Fixed)
    • gdss294 kernel panic. (faulty memory) - Intervention
    • Cabling for new systems in HPD room with James A.
    • Working on 2008 Disk servers and working nodes.
    • Working on gdss77, 282 and 294.

Absences

  • Jonathan on partial retirement, worked Tuesday, Wednesday and Thursday

Operational Issues and Incidents

Index Description Start End Severity Affected VO(s)
EMC arrays serving 3D/LFC/FTS databases made unstable by attempts to stabilise the Castor EMC arrays Tuesday 6/0ct am UPS issues to be fixed Catastrophic All

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB

Component Description Start End Affected VO(s) Type

Development priorities

  • All
  • Martin:
    • Minor procurements
    • More prep for deliveries
  • Ian:
    • TDG talk on Quattor (done)
    • Further work on integrating Castor HW in fabric workflow
    • Setting up Quattor testbed
  • James T:
    • First Aid course on Tuesday
    • Viglen 09 deliveries
    • Work on Tier1 tour for open day
    • First aid course on Tuesday
    • Quattor
      • Fix python dependency problem during fresh installs
      • SL5 64-bit disk server build
  • Jonathan:
    • complete work on installing Nagios slave servers via Quattor
    • implement cron job with checks to run daily test restores of home filesystem
    • Nagios configuration updates
  • James A:
    • Continued preparation for deliveries.
    • Liaison with suppliers during installation.
    • Continue with SINDES integration.
    • Continue with work on Hardware database.
    • Blank and return loaned hardware.
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • lcgce07 drive replacement. (Hot swap)
    • Continuous work (memory replacement) with Cheney.
    • Viglen 2006 eight disk servers for decommissioning/prepod. (Label and configure)
    • Continuous decommissioning old batch systems.(R 27)
    • Continuous working on 2008 disk servers and working nodes.
    • Continuous working on gdss77, 282 and 294.

Absences

  • Jonathan working Tuesday, Thursday and Friday
  • James out on a first aid course on Tuesday

Fabric On-Call

Ian Primary OnCall Mon-Thurs

Advanced Warning of Requirements and Blocking issues

  • Unable to proceed with Atlas TAG migration to 64bit due to arrays being used for 3D systems while EMC kit is flakey.

Services Issues

  • Various requests for hardware.
    • Working on hardware provision for Services team testbeds.

Category:RAL_Tier1

RAL Tier1 weekly operations fabric