Difference between revisions of "RAL Tier1 weekly operations Fabric 20100517"

Latest revision as of 14:33, 17 May 2010

Developments

All:

Martin:
- Visit to CERN for:
  - F2F meetings with Tim Bell, Olof Baring
  - Virtualisation Working Group F2F
  - May GDB
- Work on APRs
- Database infrastructure paper

Ian:
- Attended HEPiX Virtualisation Working Group F2F
- Met with services virtualisation admins at CERN
- Attended part of GDB
- Quattor documentation

Tim:
- More work on new tape servers
- Investigating repack checksum errors
- Finished exporting data for BOPCRIS
- DMF sorting out some bad tapes
- Library had two drives with stuck tapes

Cheney:
- investigate logrotate problem on tape servers (cron not running)
- sort out blown psus x 2 occurred at same time
- fix various backups glitches
- add servers to nagios and tweak various odds and ends
- add new VO to sls and tsbn
- testing of db backups restore - (fail)
- temporary fixes for various problems with new tape servers

Jonathan:
- corrected atlasbackup problem
- investigated and solved intermittent “Connection refused” problems for lcgvo-02-21
- restored Somnus Oracle database from backups
- renamed AFS userid
- Completed APR

James A:
- Kept sv-08-16 up-to-date with current ATLAS software from lcg0617 ahead of switch-over.
- Re-cabled Streamline 2009 storage nodes into production network and removed part of testing network from rack.
- Finished APR.
- Started change-control request for upgrading ATLAS software server.

James T
- Retrieved and installed new certificates on gdss87-367
- Got certificates for Viglen09 kit.
- Drive swapping in Kash's absence.
- APR
- Installed Streamline '09 kit via Quattor

Kash:
- Drive replacement.
- Fixing broken WNs.
- Decommissioning old batch systems.(R 27)
- Daily hardware failures status of Streamline 2009 disk servers to James T.
- gdss228, gdss229 and gdss232 given back to castor.
- gdss423 four faulty drives and probably faulty raid card. (Reported)
- Faxed dispatch note to Viglen for reference.
- gdss434 new drive not shown. (Need reboot)
- gdss71 faulty memory. (Intervention)

Absences

Jonathan on partial retirement (not in on Monday and Friday)
Cheney leave Monday 24th to Friday 28th May

Operational Issues and Incidents

Index	Description	Start	End	Severity	Affected VO(s)

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB

Component	Description	Start	End	Affected VO(s)	Type

Development priorities

All

Martin:
- Database infrastructure plan docuemnt + costings
- Disk ITT
- APR + Plan stuff

Ian:
- Set up Redhat repositories for Quattor installation
- Help Tim with final tape server work
- Job plans
- Virtualisation strategy task work

Tim:
- Get remainuing tape servers into production
- Prepare repack for CMS migration
- DMF sort out Peter Chui requirements
- Get CLF access to DMF

Cheney
- Improvements to dmf backups
- srb tasks
- Fix patching

Jonathan:
- start regular check restores of home filesystem
- close nagios01/05 (old Nagios slave servers)
- stop exports for some old filesystems on csfnfs58
- continue investigations on setting up AFS directory as Atlas software server
- Nagios configuration updates

James T:
- Acceptance testing Streamline '09 kit
- Job Plan
- Change control request for moveing to Sl5 64-bit + XFS on disk servers
- Assigning Viglen '06 disk servers to preprod so that the Viglen '08 machines in preprod can be reclaimed to satisfy allocations
- Disk server benchmarking

James A:
- Working on change control for ATLAS software server upgrade.

Kash:
- Drive replacement.
- Fixing broken WNs.
- Continuous decommissioning old batch systems.(R 27)
- Daily hardware failures status of Streamline 2009 disk servers to James T.

Absences

Jonathan on partial retirement (not in on Monday and Friday)
Cheney leave Monday 24th to Friday 28th May

Fabric On-Call

Ian all week

Advanced Warning of Requirements and Blocking issues

Services Issues

RAL Tier1 weekly operations fabric

Category:RAL_Tier1

Difference between revisions of "RAL Tier1 weekly operations Fabric 20100517"

Latest revision as of 14:33, 17 May 2010

Contents

Developments

Absences

Operational Issues and Incidents

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Development priorities

Absences

Fabric On-Call

Advanced Warning of Requirements and Blocking issues

Services Issues

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools