Difference between revisions of "RAL Tier1 weekly operations Fabric 20091012"

Latest revision as of 15:04, 12 October 2009

Summary of week gone

Developments

All

Martin:
- Disk procurement ITT evaluation
- Depolyment of 3D databases onto old hardware due to power feed problems making the EMC arrays unstable
- Meeting with Seagate about disk problems

Ian:

James T:
- Viglen testing:
  - Meeting with
  - Drives swapped for a different batch in 10 machines (220 drives).
  - Logs captured on 2 October by Seagate showed further issues so they issued another updated firmware.
  - More logs captured from timed-out drives on Thursday 8th.
  - Tested racks with the functional earth removed - same problems.
- user_xattr mount option rolled out to all CASTOR disk servers.
- Created Storage_CASTOR_Gen ganglia cluster for Brian (former CASTOR team blocking issue).
- Cleaned up some fabric tickets.
- DNS request for repack server.
- HEPSYSMAN on Wednesday 7th (talked about Tier1 storage).

Jonathan:
- configured nagios@nagger.gridpp.rl.ac.uk as PBS operator
- worked on migration of user home filesystems to new server
- updated RPMs on core servers and rebooted where required
- updated wiki documentation referring to change Nagios master server to nagger
- added new users to Tier1 and AFS
- added new top directory superb for Babar (RT #52070)
- Nagios configuration updates on servers and clients

James A:
- Lots of work on BatchWorkers in QUATTOR.
- Brought SL5 farm to 90% of KSI2K Capacity.
- Shrunk SL4 farm respectively.
- Made some minor progress with SINDES.
- Some changes to ARTEMIS for UPS room.
- Removed AtlasBackup from base machine template in QUATTOR

Kash:
- Drive replacement.
- Fixing broken WNs.
- gdss354 fixed and back in production.
- gdss218 wrong way round backplane cables. (Fixed)
- gdss126 double disks failure. Completed verifying array.
- Seagate 220 drives dispatched, given to Seagate Engineer.
- Completed adding additional raid cards in v06 (Castor disk servers).
- Working on 2008 Disk servers and working nodes.
- Working on gdss67, 86, 126 and 170.

Operational Issues and Incidents

Index	Description	Start	End	Severity	Affected VO(s)
	EMC arrays serving 3D/LFC/FTS databases made unstable by attempts to stabilise the Castor EMC arrays	Tuesday am	not in site	Catastrophic	All

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB

Component	Description	Start	End	Affected VO(s)	Type

Development priorities

All

Martin:
- Disk procurement ITT evaluation
- CPU procurement ITT clarifications

Ian:

James T:
- Assign machines for deployment.
- Send out requests for people to complete CRISTAL 2 feedback forms.
- Viglen testing:
  - Continue testing latest firmware.
  - Prepare to hand over to someone else.

Jonathan:
- work on migration of Tier1 home filesystem to new server
- work on installing Nagios slave servers using Quattor
- Nagios configuration updates as required

James A:
- Continue pushing forward with SINDES.
- Take over disk issues from James T.
- Integrate of BMS alerts into ARTEMIS data stream.

Kash:
- Drive replacement.
- Fixing broken WNs.
- Continuous working on 2008 disk servers and working nodes.
- Continuous Working on gdss67, 86, 126 and 170.

Absences

James T
- James T on A/L from Thursday 15th until Monday November 2nd.

Fabric On-Call

Mon-Fri:

Advanced Warning of Requirements and Blocking issues

Services Issues

Various requests for hardware.

Category:RAL_Tier1

RAL Tier1 weekly operations fabric

Difference between revisions of "RAL Tier1 weekly operations Fabric 20091012"

Latest revision as of 15:04, 12 October 2009

Contents

Summary of week gone

Developments

Operational Issues and Incidents

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Development priorities

Absences

Fabric On-Call

Advanced Warning of Requirements and Blocking issues

Services Issues

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools