RAL Tier1 weekly operations Fabric 20100301

Summary of week gone

Developments

All:

Martin:
- Minor procurements
- Castor databases disaster planning
- New hardware unpacking

Ian:
- Worked on Quattor core castor server
- Adapted Quattor disk server to use core quattor server base
- Castor handover planning

James T:
- eScience CA RA course
- Quattor disk servers (problem eventually fixed by Ian)
- Installed all 60 Viglen 08 disk servers with Quattor
- Admin on Duty (2 days)
- 5 x disk servers for deployment for ATLAS
- Created CASTOR_PreProd ganglia instance
- fix for vdt_globus_data_server and grid FTP external kickstart install problems

Jonathan:
- sorted out atlasbackup problems for several nodes
- rebooted lcgui0358 (user front-end) to solve mount problem
- replaced failed drive on afs3 and despatched to DNUK
- Nagios configuration updates

James A:

Kash:
- Drive replacement.
- Fixing broken WNs.
- Decommissioning old batch systems.(R 27)
- gdss211 reinstalling.
- gdss295 given back to castor.
- gdss364 replaced 16 ports raid card (Borrowed from gdss338)
- lcgce07/nc21 replaced system with spare twin system. (Streamline)
- gdss128 and gdss403 given back to castor.
- Castor servers (cdbc13/cdbd03) moved into test area. (Intervention)
- Moved systems/parts to Atlas A5 lower machine room.
- gdss160 given back to castor.
- Working on gdss211 and 295.

Absences

Jonathan on partial retirement
- medical appointment/annual leave Tuesday
- sick leave Thursday

Operational Issues and Incidents

Index	Description	Start	End	Severity	Affected VO(s)
	EMC arrays serving 3D/LFC/FTS databases made unstable by attempts to stabilise the Castor EMC arrays	Tuesday 6/0ct am	UPS issues to be fixed	Catastrophic	All
	gdss364 disk controller sick	Friday ~20:00	Ongoing	Severe	CMS (FarmRead)

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB

Component	Description	Start	End	Affected VO(s)	Type

Development priorities

All

Martin:
- Castor databases planning
- Decommissioning A1 Upper
- Moving development network hardware

Ian:
- Second installation server
- Deployment of Sindes with James A
- Further work on Quattorisation of castor servers with Chris

James T:
- Checking over of "lean" disk server with Chris and Ian
- Tier1 Tour preparation
- Deploy drained Viglen 06 to pre-prod (with re-configured arrays)
- Helpdesk ticket blitz

Jonathan:
- change controls for replacement Nagios slave servers and decommissioned web site
- implement cron job with checks to run daily test restores of home filesystem
- Nagios configuration updates

James A:

Kash:
- Drive replacement.
- Fixing broken WNs.
- gdss347 replace 4x2gb memory.
- Clear Atlas A5 upper test area.
- Continuous decommissioning old batch systems.(R 27)
- Continuous working on gdss211 and 208.

Absences

Ian out on Thursday

Fabric On-Call

Ian: Primary on call Mon-Sun

Advanced Warning of Requirements and Blocking issues

Unable to proceed with Atlas TAG migration to 64bit due to arrays being used for 3D systems while EMC kit is flakey.
- Update (2010/03/01): new hardware is now on site and ready to be installed in the rack in R26 A5L.

Services Issues

Various requests for hardware.
- Working on hardware provision for Services team testbeds.

Category:RAL_Tier1

RAL Tier1 weekly operations fabric

RAL Tier1 weekly operations Fabric 20100301

Contents

Summary of week gone

Developments

Absences

Operational Issues and Incidents

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Development priorities

Absences

Fabric On-Call

Advanced Warning of Requirements and Blocking issues

Services Issues

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools