Difference between revisions of "RAL Tier1 weekly operations Fabric 20091130"

Latest revision as of 15:14, 1 December 2009

Summary of week gone

Developments

All:

Martin:
- Work on Viglen08 disk acceptance tests
- Work on test nodes for LFC database resilience tests
- Various finance an dprocurement issues

Ian:
- Wrapped up FP7 bid - submitted Tuesday
- Tested new features in Quattor for Nagios slave server
- Imported Quattor updates from QWG - fixed a couple of resulting issues

James T:
- CRISTAL2 preparation
- Set up ancillary network for Streamline08 disk servers with James A.
- Quattorisation of disk servers
- CRISTAL2 Wed - Fri
- Fabric on call Mon - Thurs
- Primary on call Fri - Sun

Jonathan:
- updated kernels on NIS servers and rebooted
- removed mount of /home/csf and added soft-links for Bfactory users for farm nodes
- wrote paper about backup policy, recovery etc for Tier1 review
- increased quota for LHCb AFS volume
- cleared up atlasbackup problems for some nodes
- created archive backups for ccsc07/15 for Richard
- Nagios configuration updates
- released new versions of RPMs and tier1-nrpe-config
- rebooted nagger for new kernel
- worked on Quattor configuration of Nagios slave (with help from Ian)

James A:
- Caught up after leave.
- Tried to focus on SINDES.
- Helped with general QUATTOR issues where needed.

Kash:
- Drive replacement.
- Fixing broken WNs.
- gdss163 double disks failure. (Finish test)
- gdss95 and gdss134 given back to castor
- Created graphs of drives failure for MJB.
- Working on 2008 Disk servers and working nodes.
- Working on gdss67, 163 and 282.

Operational Issues and Incidents

Index	Description	Start	End	Severity	Affected VO(s)
	EMC arrays serving 3D/LFC/FTS databases made unstable by attempts to stabilise the Castor EMC arrays	Tuesday 6/0ct am	UPS issues to be fixed	Catastrophic	All
	Gdss138 double disk failure: two drives failed in quick sucession (30 minutes)	Monday 0530-0600	Ongoing	Severe	LHCb Dst data. Data loss confirmed

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB

Component	Description	Start	End	Affected VO(s)	Type

Development priorities

All
- Work on evacuating A1 Upper (Castor LSF/FlexLM triplet)

Martin:
- Viglen08 disk acceptance solution

Ian:
- Finalising new flex license servers for LSF
- Further Quattor tutorial for Cheney
- Assist with new disk servers

James T:
- Quattorisation of disk servers
- Decision on Viglen 2008 suggested solution
- Primary on call Mon - Thurs

Jonathan:
- Quattor implementation for Nagios slave
- security updates to disk servers to prevent general user logins
- Nagios configuration updates

James A:
- Continue with SINDES.
- Make some fixed to the Hardware database for Kash.
- Update and make changes to Cacti.

Kash:
- Drive replacement.
- Fixing broken WNs.
- gdss67 return to castor team after finishing test.
- gdss138 double disk failure. (Intervention)
- Continuous working on 2008 disk servers and working nodes.
- Continuous working on gdss67, 138, 163 and 282.

Absences

Jonathan: S/L Monday
Jonathan: A/L Thursday am

Fabric On-Call

Mon-Thu: James T Primary on call
Fri-Sun: Ian Primary on call

Advanced Warning of Requirements and Blocking issues

Unable to proceed with Atlas TAG migration to 64bit due to arrays being used for 3D systems while EMC kit is flakey.

Services Issues

Various requests for hardware.
- Working on various hardware requests for Services team.

Category:RAL_Tier1

RAL Tier1 weekly operations fabric

Difference between revisions of "RAL Tier1 weekly operations Fabric 20091130"

Latest revision as of 15:14, 1 December 2009

Contents

Summary of week gone

Developments

Operational Issues and Incidents

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Development priorities

Absences

Fabric On-Call

Advanced Warning of Requirements and Blocking issues

Services Issues

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools