Difference between revisions of "RAL Tier1 weekly operations Fabric 20100118"

From GridPP Wiki
Jump to: navigation, search
 
(No difference)

Latest revision as of 14:30, 18 January 2010

Summary of week gone

Developments

  • All:
  • Martin:
    • Procurements
    • GridPP4
    • Networking plans for capacity procurements
    • Intervention on lcgdb14
  • Ian:
    • Work on Quattor config of vobox
    • Planning for batch server upgrade and other interventions
    • Planning update of Quattor server
  • James T:
    • Fixed two problems with Ganglia
      • The data sources for the Miscellaneous cluster had been decommissioned.
      • Workers_SL5 graphs were fluctuating wildly due to wrongly configured Workers_SL4.
    • Quattorisation of disk servers
      • fsprobe added
      • puppet added
      • Work on processing errata
    • Various system updates
    • Dry run of procedures prior to "Mega Intervention".
    • Progress meeting with Viglen on disk testing. All machines now in testing, complete mid-February.
  • Jonathan:
    • updated CSFadduser script (in /usr/local/sbin on wyatt) for new Tier1 home directory and added new userids for Castor evaluation
    • corrected backup problems on several nodes
    • followed up chkrootkit problem on afs2
    • updated RPMs on several nodes
    • investigated Callout problems on several nodes
    • Nagios configuration updates
    • 2 days out (home emergency)
  • James A:
    • Finalised plan for SINDES implementation.
    • Worked on new user contact database.
    • Moved castoradm2 and castoradm3 from A1 upper to A5 lower.
    • Begun last of CASTOR rack IPMI cabling.
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Decommissioning old batch systems.(R 27)
    • lcgdb14 replaced memory and motherboard by Engineer. (Fixed)
    • gdss134 given back to castor.
    • Produce graphs of hardware failures.
    • gdss105 and 171 given back to castor.
    • Working on 2008 Disk servers and working nodes.
    • Working on gdss66, 70, 282, 364 and 380.

Absences

  • Jonathan (2 days - home emergency)

Operational Issues and Incidents

Index Description Start End Severity Affected VO(s)
EMC arrays serving 3D/LFC/FTS databases made unstable by attempts to stabilise the Castor EMC arrays Tuesday 6/0ct am UPS issues to be fixed Catastrophic All

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB

Component Description Start End Affected VO(s) Type

Development priorities

  • All
  • Martin:
    • Minor procurements
    • planning and chanage control activities for pre-datataking period
  • Ian:
    • Update Quattor server
    • Work with James A to deploy and test Sindes on Quattor server
    • Implement CIP config update on Thursday
    • Virtualisation platform planning
  • James T:
    • Document procedure for "mega intervention".
    • Ongoing quattorisation of disk servers.
    • CRISTAL2 support group.
    • ATLAS WAN tuning for Brian.
    • Progress meeting with Viglen.
    • Updates to some systems.
  • Jonathan:
    • work on test restore of home filesystem subdirectory
    • final checks of change to restrict SSH login on disk servers
    • complete work on installing Nagios slave server via Quattor
    • update RPMs on various servers
    • Nagios configuration updates
  • James A:
    • Rolling out SINDES.
    • Working on user contact database.
    • Finishing IPMI cabling.
    • Working on forwarding BMS alerts to Nagios.
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • gdss380 given back to castor.
    • afs2 drive failure.
    • Continuous decommissioning old batch systems.(R 27)
    • Continuous working on 2008 disk servers and working nodes.
    • Continuous working on gdss66, 70, 282, 364 and 380

Absences

  • Kashif (Thursday - A/L)

Fabric On-Call

JamesT Monday-Thursday Ian Friday-Sunday

Advanced Warning of Requirements and Blocking issues

  • Unable to proceed with Atlas TAG migration to 64bit due to arrays being used for 3D systems while EMC kit is flakey.

Services Issues

  • Various requests for hardware.
    • Working on hardware provision for Services team testbeds.

Category:RAL_Tier1

RAL Tier1 weekly operations fabric