RAL Tier1 weekly operations Fabric 20091207

Summary of week gone

Developments

All:

Martin:

Ian:
- Finalised new flex license servers for LSF
- Further Quattor tutorial for Cheney & Matt
- Provided physical hardware for CIP
  - Went through various configuration options

James T:
- Quattorisation of disk servers
- Primary on call Mon - Thurs
- Viglen disk swap out support
- Post-mortem on gdss138

Jonathan:
- maintained NIS netgroup
- corrected atlasbackup problems for a few hosts
- Administrator on Duty (Wednesday)
- unmounted /home/csf from lcg0617/618
- Nagios configuration updates
- system tuning of nagger to try to reduce scheduling queue
- installed RPM mrtg on nagger and added configuration to collect performance statistics from Nagios (see http://nagger.gridpp.rl.ac.uk/mrtg/nagios-[a-n].html at present)

James A:
- Achieved a working preliminary SINDES server.
- Upgraded Cacti on thor.gridpp.rl.ac.uk

Kash:
- Drive replacement.
- Fixing broken WNs.
- gdss138 double disks failure. (Intervention)
- gdss149, 162, 163 and 367 given back to castor.
- gdss77 kernel panic possibly faulty memory. (Intervention)
- gdss139 given back to castor.
- Moved 3 batch systems from R27 to HPD room (CV 2005 rack) with MJB.
- Working on 2008 Disk servers and working nodes.
- Working on gdss77, 138 and 282.

Absences

Jonathan: S/L Monday
Jonathan: A/L Thursday am

Operational Issues and Incidents

Index	Description	Start	End	Severity	Affected VO(s)
	EMC arrays serving 3D/LFC/FTS databases made unstable by attempts to stabilise the Castor EMC arrays	Tuesday 6/0ct am	UPS issues to be fixed	Catastrophic	All
	Gdss138 double disk failure: two drives failed in quick succession (30 minutes)	Monday 0530-0600	Ongoing	Severe	LHCb Dst data. Data loss confirmed

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB

Component	Description	Start	End	Affected VO(s)	Type
Cacti (http://cacti.gridpp.rl.ac.uk)	Upgrade Cacti software. Subject to team manager's approval of plan.	Tuesday 2009-12-08 13:00	Tuesday 2009-12-08 17:00	none	At Risk

Development priorities

All
- Work on evacuating A1 Upper (Castor LSF/FlexLM triplet)

Martin:

Ian:
- Reconfigure physical CIP again
- Implement second CIP with Quattor (T2K)
- Start work on Quattor managed glite 3.2 vobox with Catalin
- Assist with new disk servers as required
- Incorporation of latest QWG template updates

James T:
- Quattorisation of disk servers
- Remove nincom as Ganglia data source for Services_Monitoring
- Script to compare Overwatch with real CASTOR status
- TOASTER preparation

Jonathan:
- Quattor implementation for Nagios slave
- security updates to disk servers to prevent general user logins
- Nagios configuration updates

James A:
- Continue with SINDES.
- Upgrade Cacti on cacti.gridpp.rl.ac.uk, install plugins and apply internal patches.

Kash:
- Drive replacement.
- Fixing broken WNs.
- gdss339 kernel panic.(Intervention)
- gdss138 double disk failure. (Intervention)
- Decommissioning old batch systems with Production Team.
- Continuous working on 2008 disk servers and working nodes.
- Continuous working on gdss77, 138, 282 and 339.

Absences

Kashif: A/L Wednesday

Fabric On-Call

Mon-Sun: Ian Primary on call

Advanced Warning of Requirements and Blocking issues

Unable to proceed with Atlas TAG migration to 64bit due to arrays being used for 3D systems while EMC kit is flakey.

Services Issues

Various requests for hardware.
- Working on various hardware requests for Services team.

Category:RAL_Tier1

RAL Tier1 weekly operations fabric

RAL Tier1 weekly operations Fabric 20091207

Contents

Summary of week gone

Developments

Absences

Operational Issues and Incidents

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Development priorities

Absences

Fabric On-Call

Advanced Warning of Requirements and Blocking issues

Services Issues

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools