RAL Tier1 weekly operations Fabric 20100426
From GridPP Wiki
Contents
Developments
- All:
- Martin:
- @ HEPiX (virtually)
- @ HEPiX Virtualisation F2F (virtually)
- Ian:
- Attending HEPiX remotely
- Some work on Castor tape server & quattor
- Tim:
- CMS migration issue
- ADS hardware install finished.
- Talking to IBM about maintanance contracts for above
- DMF tape problems, draining some tapes.
- Configing new tape server for T10KB testing. Now working OK
- Cheney:
- testing of new backup server for database backups
- docco - tsbn & sls instructions / added c2probe / hardware info errors fixed
- did some investigations for srb chaps
- installed nagios on aix (ads)
- mucho fiddling with nagios on aix to get it working
- set up config files for new backup server
- sort out array crash on preprod db
- James T
- Worked on adding SL4.8 to Quattor (for Viglen '09 disk)
- SL5 disk server build
- Met face to face with Streamline regarding disk problems
- Resolved kickstart problems (thanks to John K. for building a new RPM)
- Fixed tuning errors on cmsFarmRead
- SSC finance training
- Security group task planning
- Jonathan:
- fixed space problem on lcgui01 (/tmp) by arranging deletion of old files
- fixed load problems on install01/02 by killing old lftp processes
- fixed atlasbackup problem on several nodes
- updated iptables on several systems to fix connection problems for new Nagios slave
- fixed yum problem on enigma
- added AFS userid atlas147 and increased quota for volume atlassw for Atlas software installation testing
- updated NIS netgroup to add new batch workers
- created directory /scratch/jacksonj for James Jackson on lcgui02 for CMS development work
- Nagios configuration updates
- Oracle Finance Self-Service course
- James A:
- Kash:
- Drive replacement.
- Fixing broken WNs.
- Decommissioning old batch systems.(R 27)
- Daily hardware failures status of Streamline 2009 disk servers to James T.
- Cabling in HPD room with James A.
- Fax Viglen copies of dispatch notes for delivery proof.
- Castor preprod replaced 1 U Power supply. (Fixed)
- Castor ccse03 replaced Motherboard, Memory and Power distribution board. Moved back into rack.(Fixed by Engineer)
- Castor C2certdb reported faulty drive.
Absences
- Jonathan on partial retirement (not in on Monday and Friday)
Operational Issues and Incidents
Index | Description | Start | End | Severity | Affected VO(s) |
---|
Summary of plans for week ahead
Scheduled and Cancelled Down Times
Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB
Component | Description | Start | End | Affected VO(s) | Type |
---|
Development priorities
- All
- APRs
- Martin:
- Interviewing
- Preparation work for Tier1 Supplier day (Thursday)
- Disk server specifications for ITT
- Presentations
- Ian:
- APR
- Quattor support for Castor
- Virtualisation work (Hepix vwq and Tier1 services)
- Atlas SW server
- CMS vobox
- Tim:
- APR/Job plans
- Finish T10KB testing
- Install remaining new tape servers
- Cheney:
- APR
- backup server intervention
- more testing of backup server
- docco
- James T:
- Quattor SL4.8/SL5 disk servers
- Streamline '09 testing problems
- Telecon with Streamline, LSI and Western Digital on Wednesday
- AoD on Thursday
- APR stuff
- Security group
- Jonathan:
- Stop pacman mirror on csfmove02
- start regular check restores of home filesystem
- final checks of new Nagios slave
- continue to work on setting up AFS directory as Atlas software server
- APR
- Nagios configuration updates
- James A:
- Kash:
- Drive replacement.
- Fixing broken WNs.
- Continuous decommissioning old batch systems.(R 27)
- Viglen Engineers service call on Wednesday 28th April 2010.
- gdss290 fs errors and probably data lost. (Intervention)
- gdss312 and gdss337 replace IPMI card.
- gdss420 replace 24 ports raid controller card.
- Daily hardware failures status of Streamline 2009 disk servers to James T.
Absences
- Jonathan on partial retirement (not in on Monday and Friday)
Fabric On-Call
Ian primary on call Monday-Sunday