RAL Tier1 weekly operations Fabric 20100503
From GridPP Wiki
Contents
Developments
- All:
- APRs
- Martin:
- Ian:
- Work on VOboxes with Catalin and Andrew LAhiff
- Help with new atlas sw server
- Tim:
- Install new tape servers and drives
- NDA talks with Oracle/Sun/STK on T10KC
- monthly stats
- Modify script to move tapes from free to VO tape pool
- Cheney:
- set up a new server to backup archives of database redos
- started looking at dmf backups optimisation
- set up more nagios checks for disk arrays
- docco - castor
- learnt some basic quatting
- testing ahead of db changes
- Jonathan:
- fixed atlasbackup problems on several nodes
- cleaned up /tmp on lcgui01 by removing files that have not been accessed for 150 days
- stopped pacman web service on csfmove02 (RT #57936)
- increased home filesystem quota for user
- Nagios configuration updates
- issued new versions of tier1-nagios-plugins, tier1-nrpe-config and tier1-sudo-config
- James A:
- James T:
- Aod Thursday
- Disk server recovery docco
- Updated hardware raid nagios check to support Viglen '08 kit
- Log searches for security challenge
- Telecon with Streamline/Western Digital/LSI/Boston regarding Streamline '09 kit
- Handed SL5 disk server build to CASTOR team
- Kash:
- Drive replacement.
- Fixing broken WNs.
- Decommissioning old batch systems.(R 27)
- Daily hardware failures status of Streamline 2009 disk servers to James T.
- gdss290 verified raid array. (Fixed)
- gdss312 replaced IPMI card. (Fixed)
- Streamline Engineer service call. Gdss490 taken by Graham.
- gdss420 replaced 24 ports raid controller card. (Fixed)
- Castor C2certdb received replacement drive.
- Wrong labels on Viglen 2009 disk servers. (Updated to James T)
Absences
- Jonathan on partial retirement (not in on Monday and Friday)
- Tim in London Thursday
Operational Issues and Incidents
Index | Description | Start | End | Severity | Affected VO(s) |
---|
Summary of plans for week ahead
Scheduled and Cancelled Down Times
Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB
Component | Description | Start | End | Affected VO(s) | Type |
---|
Development priorities
- All
- Martin:
- Ian:
- Visit to CERN
- Tim:
- Get T10KB migration going
- Plan install of Facilities Castor
- Cheney:
- get the backups working right - they lose sync
- patching
- dmf backups
- docco
- Jonathan:
- start regular check restores of home filesystem
- final checks of new Nagios slave
- continue investigations on setting up AFS directory as Atlas software server
- Nagios configuration updates
- James T:
- AoD Thursday
- Keep an eye on Streamline '09 testing
- APR
- Disk swaps for Kash
- James A:
- Kash:
- Drive replacement.
- Fixing broken WNs.
- Continuous decommissioning old batch systems.(R 27)
- Viglen Engineers service call on Wednesday 28th April 2010.
- gdss290 fs errors and probably data lost. (Intervention)
- gdss312 and gdss337 replace IPMI card.
- gdss420 replace 24 ports raid controller card.
- Daily hardware failures status of Streamline 2009 disk servers to James T.
Absences
- Monday is Bank Holiday
- Jonathan on partial retirement (not in on Monday and Friday)