RAL Tier1 weekly operations Fabric 20100201

Summary of week gone

Developments

All:
- Strategy meeting

Martin:
- Minor procurements

Ian:
- Upgraded CIp filesystem layouts
- Upgraded batch server binaries
- Upgraded kernels on SL5 WNs
- Planning for handover of fabric management for Castor systems

James T:
- "Mega intervention" preparation/documentation
- Mega Intervention
- Fisrt Viglen '08 disk servers out of testing.
- Ongoing quattorisation of disk servers.
- Primary on call

Jonathan:
- added new NIS groups and create new pool accounts
- checked SSH problem on lcgdb05; removed special userids oracle, lsfadmin, stage and corresponding groups oinstall, lsfadmin, st from NIS (NIS entries sometimes take precedence over local entries whatever the setting of /etc/nsswitch,conf; this can cause system problems)
- updated RPMs on core systems and rebooted where required
- reconfigured and restarted ntpd on lcgvo0425 (updating ntp RPM can sometimes loose the local NTP configuration)
- Nagios configuration updates
- reinstalled and reconfigured nagios04 after disk replacement

James A:
- Networking preparations ahead of mega-intervention.
- Added snapshotting feature to cacti weather-map.
- Finished cabling IPMI ports in castor racks B&E.
- Updated certificate on t1pg0373.
- Fixed bug in check_spma for handling rotated logs.

Kash:
- Drive replacement.
- Fixing broken WNs.
- Decommissioning old batch systems.(R 27)
- gdss211 running 7 days acceptance test.
- gdss70 given back to castor. (Fixed)
- gdss77 no display. (Found faulty memory) - Intervention
- gdss87 given back to castor for testing.
- nagios04 replaced drive.
- gdss170 given back to castor.
- Moved switches and cables from R27 with James A.
- Working on 2008 Disk servers and working nodes.
- Working on gdss77, 282 and 364.

Absences

Jonathan (1/2 day, domestic reasons)

Operational Issues and Incidents

Index	Description	Start	End	Severity	Affected VO(s)
	EMC arrays serving 3D/LFC/FTS databases made unstable by attempts to stabilise the Castor EMC arrays	Tuesday 6/0ct am	UPS issues to be fixed	Catastrophic	All

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB

Component	Description	Start	End	Affected VO(s)	Type

Development priorities

All

Martin:
- Minor procurements

Ian:
- Upgrading and reconfiguring CIPs
- Work with Catalin on Quattorising further grid services nodes
- Quattor documentation
- (Re-)Instituting steering group for Fabric automation project
- Researching Virtualisation platform options

James T:
- Ongoing quattorisation of disk servers.
- Install first Viglen '08 disk servers.
- Writing nagios checks
- Apply WAN tuning

Jonathan:
- implement cron job with checks to run daily test restores of home filesystem
- complete work on installing Nagios slave server via Quattor
- Nagios configuration updates

James A:
- Two days of SINDES integration.
- Connect uplinks to CASTOR IPMI switches.
- Ensure IPMI on CASTOR boxes comes up.

Kash:
- Drive replacement.
- Fixing broken WNs.
- lcglb01 drive replacement. (Hot swap)
- Continuous work (memory replacement) with Cheney.
- Continuous decommissioning old batch systems.(R 27)
- Continuous working on 2008 disk servers and working nodes.
- Continuous working on gdss77, 282 and 364.

Absences

Jonathan - as from week beginning 8th February, changing work pattern to 3 days per week (normally Tuesday, Wednesday, Thursday)

Fabric On-Call

Ian Primary on call

Advanced Warning of Requirements and Blocking issues

Unable to proceed with Atlas TAG migration to 64bit due to arrays being used for 3D systems while EMC kit is flakey.

Services Issues

Various requests for hardware.
- Working on hardware provision for Services team testbeds.

Category:RAL_Tier1

RAL Tier1 weekly operations fabric

RAL Tier1 weekly operations Fabric 20100201

Contents

Summary of week gone

Developments

Absences

Operational Issues and Incidents

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Development priorities

Absences

Fabric On-Call

Advanced Warning of Requirements and Blocking issues

Services Issues

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools