Difference between revisions of "RAL Tier1 weekly operations Fabric 20100712"
From GridPP Wiki
Martin bly (Talk | contribs) |
(No difference)
|
Latest revision as of 15:30, 12 July 2010
Contents
Developments
- All:
- Martin:
- CPU ITT finalisation
- First look at disk ITT returms
- Finance / spend planning
- WLCG workshop
- Ian:
- WLCG Collaboration workshop Weds-Friday
- Began configuring Hyper-V test server
- Tim:
- monitor repack progress
- DMF single copy for cedar
- DMF delete duplicate copies
- Jonathan:
- Administrator on Duty (Wednesday and Thursday)
- fixed atlasbackup problem on csfnfs58 by killing old processes
- stopped mysql server and mysqlhotcopy on lcgsql0365 after confirmation that MySQL database is no longer required
- worked on new Nagios slave server for batch workers
- issued version 1.0-54 of RPM tier1-sudo-config to correct problem with sudo sub-directory /etc/sudoers.d
- created 1 AFS userid
- 1 Nagios configuration update
- changed TCP tuning parameter on nagios06
- James A:
- Provided interim network cabling for dell services nodes for Hyper-V evaluation.
- Added some basic ganglia monitoring for Apache on quattor01.
- Set up rsync mirroring of dell OpenManage tools and SL5.5 on install02.
- Connected some network cabling in LPD room for Cheney.
- Worked with Alastair to attempt to understand the interaction of Atlas software with the software server and the Frontier squid.
- James T
- WLCG Workshop
- Fixed a bug in ncm-etcservices
- Updated the quattor templates to set the RFIO port in /etc/services and tested on ATLAS nonProd disk servers.
- Helped Kash with gdss67
- Brainstormed alternative storage ideas.
- Cheney
- reboot of most of the database servers following disk fault
- set up of spare acsls
- fix c2probe for sls
- investigate ssh key changes (it was a server rebuild)
- investigate ssh intrusions
- Kash:
- Drive replacement.
- Fixing broken WNs.
- Decommissioning old batch systems.(R 27)
- gdss67 running 7 days acceptance test.
- gdss78 running 7 days acceptance test.
- gdss207 crashed again. (Intervention)
- gdss474 replaced backplane. Another problem. (Intervention)
- Hardware failure stats/graphs.
- gdss231 & 420 low voltage on battery.
- Streamline/areca disk servers crashed due to single faulty drive. (ongoing)
Absences
- Jonathan on partial retirement (not in on Monday and Friday)
Operational Issues and Incidents
Index | Description | Start | End | Severity | Affected VO(s) |
---|
Summary of plans for week ahead
Scheduled and Cancelled Down Times
Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB
Component | Description | Start | End | Affected VO(s) | Type |
---|
Development priorities
- All
- Martin:
- Final CPU ITT
- Begin work on Disk ITT evaluation
- RHEL5 for DB systems
- Ian:
- Work on Virtualiasation testbed
- Investigation of atlas software server
- Facilities Castor instance
- Initial Quattor config for gLite 3.2 LFC
- Tim:
- get back top facilities castor planning
- Cheney
- quat
- Jonathan:
- On leave all week
- James T:
- Streamline 09 testing
- Roll out LHCb WAN tuning
- Deploy fix for /etc/services RFIO port on SL5 disk servers
- Security strategy team stuff
- James A:
- Start porting Quattor server templates to SL5.5.
- Start planning for migration of Atlas software server to a new Dell services node.
- Kash:
- Drive replacement.
- Fixing broken WNs.
- gdss231 & 420 received battery need 2 more persons to do intervention. .
- gdss474 send new logs to Viglen.
- gdss380 run 7 days acceptance test.
- Continuous decommissioning old batch systems.(R 27)
Absences
- Jonathan on partial retirement (not in on Monday and Friday)
- Jonathan on leave on Tuesday - Thursday (so out all week)
- Martin A/L Friday pm