RAL Tier1 weekly operations Fabric 20090824

From GridPP Wiki
Jump to: navigation, search

Summary of week gone

Developments

  • All
  • Martin:
    • Procurements
    • Installs for new Atlas3D databases
    • Work on resillience for LFC/FTS databases
    • Discussed future of gdss51 with Castor team
    • Security incident overview
  • Ian:
    • Primary on call most days
    • Work on Quattor filesystem configurations
    • Deployment of production SL5 batch server
    • Quattor FP7 bid planning
  • James T:
    • Updated disk and 3ware firmware on some of the new Viglen hardware and started testing
    • Met with Viglen to get a status update on their end of things and to give update on our end.
    • Worked on Quattor gmond configuration
  • Jonathan:
    • sorted out atlasbackup problems on many systems
    • replied to query sent wrongly to security@gridpp.rl.ac.uk (RT #49177)
    • added sv-08 systems to NIS group csffarm
    • switched AFS glite-sw directory to read-only
    • obtained and installed renewed host certificate for csfnfs58 and gdss51
    • restarted ntpd on gdss250/289/360
    • created Tier1 team userids for new members of staff
    • Nagios configuration updates
    • issued new versions of RPMs tier1-nagios-plugins and tier1-nrpe-config
    • manually edited nrpe.cfg on lcgbatch01 after installation by Quattor
  • James A:
    • Monday: Brought batch farm back up after air-con failure.
    • Rest of week: Off Sick
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • gdss169 double disk failure (Data lost). Replaced new drives. (Intervention)
    • gdss95, 151, 154, 168, 169, 214, 256, 280 and 288 have been given back to castor.
    • lcgpx0620 moved in Test Area (R89) for further intervention.
    • Working on 2008 Disk servers and working nodes.
    • Working on gdss67, 73, 152, 169, 243 and 202.

Operational Issues and Incidents

Index Description Start End Severity Affected VO(s)
Kernel security problem 13/08/09 Ongoing Critical All

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB

Component Description Start End Affected VO(s) Type
UIs, user accessed hosts Reboots As soon as new kernels are available When they've all been updated All Down Time (short)

Development priorities

  • All
  • Martin:
    • Procurements
  • Ian:
    • Further Quattor FP7 bid planning
    • Work with Michel Jouvin on QWG filesystem templates
    • Preparations to deploy New hardware as SL5 batch workers
  • James T:
    • Talk at TDG, 11.00
    • Restart acceptance testing
    • Liaise with Viglen on disk issues. Next meeting 11.00 on Tuesday.
    • Quattor
    • Primary on call Mon - Thurs
  • Jonathan:
    • add public SSH keys for new members of staff
    • Quattor work for Nagios
    • Nagios configuration updates
  • James A:
    • Compile patched kernels for worker-nodes.
    • Deploy additional SL5 64bit capacity.
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • gdss87 add additional raid card.
    • Create graph of drives failure. (Daily, Weekly and Monthly)
    • Continuous working on 2008 disk servers and working nodes.
    • Working on gdss67, 73, 152, 169, 243 and 202.

Absenses

  • Jonathan, Kash:
    • A/L Tuesday

Fabric On-Call

  • Mon-Sun:

Advanced Warning of Requirements and Blocking issues

Services Issues

  • RT# 44835 – non capacity HW for testing (Services)

Category:RAL_Tier1

RAL Tier1 weekly operations fabric