RAL Tier1 weekly operations Fabric 20100510

From GridPP Wiki

Jump to: navigation, search

Contents

1 Developments
2 Absences
3 Operational Issues and Incidents
4 Summary of plans for week ahead

Developments

All:
- APRs

Martin:

Ian:

Tim:

Cheney:
- Sort out problem with openssl and websites
- Tweaking DB backups
- Tweaks to nagios
- Improve dmf backups

Jonathan:
- shutdown restarted netnag server
- 3 Nagios configuration updates

James A:
- Working on newish Atlas Software Server(s)
- Investigating quattorfs from MS.

James T
- AoD Thursday
- Drive changes for Kash
- gdss397
- Phone conference with Streamline. All machines bar two through vendor testing; acceptance testing starting this week.
- APR
- Applied for new certificates for gdss87-367 and gdss478-575

Kash:
- Drive replacement.
- Fixing broken WNs.
- Decommissioning old batch systems.(R 27)
- Daily hardware failures status of Streamline 2009 disk servers to James T.
- gdss397 crashed with single drive failure.(Intervention)
- APR.
- Streamline Engineer service call.
- Boston Engineers service call.

Absences

Jonathan on partial retirement (not in on Monday and Friday)
Jonathan absent 2 days due to personal and family sickness
Kashif Annual Leave (Wednesday and Thursday)

Operational Issues and Incidents

Index	Description	Start	End	Severity	Affected VO(s)

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB

Component	Description	Start	End	Affected VO(s)	Type

Development priorities

All

Martin:

Ian:

Tim:

Cheney
- Improvements to dmf backups
- Add in new tape servers to tsbn/sls
Jonathan:
- start regular check restores of home filesystem
- final checks of new Nagios slave and finally stop nagios01/05
- continue investigations on setting up AFS directory as Atlas software server
- Nagios configuration updates

James T:
- Streamline '09 testing
- APR/JOb plan
- Update certificates on disk servers where needed
- Define disk server benchmarking procedure
- Fill in for Kash on disk server maintenance

James A:
- Duplicating software from current Atlas software server to new standby server.
- Returning Streamline Storage Node networking to original configuration so acceptance testing can be started.

Kash:
- Drive replacement.
- Fixing broken WNs.
- Continuous decommissioning old batch systems.(R 27)
- Daily hardware failures status of Streamline 2009 disk servers to James T.

Absences

Jonathan on partial retirement (not in on Monday and Friday)
Jonathan 1 day A/L - Tuesday
Ian @CERN
Martin @CERN

Fabric On-Call

Advanced Warning of Requirements and Blocking issues

Services Issues

RAL Tier1 weekly operations fabric

Category:RAL_Tier1

Retrieved from "https://www.gridpp.ac.uk/w/index.php?title=RAL_Tier1_weekly_operations_Fabric_20100510&oldid=2454"