RAL Tier1 weekly operations Fabric 20100524
From GridPP Wiki
Contents
Developments
- All:
- Martin:
- Ian:
- Tim:
- Installing new T10KB tape servers
- DMF admin
- Cheney:
- fixed logrotations in quattor
- changed dmf backups
- replace battery emc kit
- installed acsls on kiki
- investigate tape server problems
- fix metrics problem
- investigate hinode security alerts
- Jonathan:
- updated NIS netgroup file to update list of disk servers and to add new top BDII servers
- updated iptables on enigma to allow Nagios checks via NRPE
- updated NIS group file to bring Tier1-Fabric and Tier1-Castor groups up to date
- worked on AFS solution for Atlas software server
- 4 Nagios configuration updates
- stopped Nagios process on nagios01/05; later issued poweroff command
- updated Twiki documentation for Nagios slave server change
- installed mysql.i386 RPM on nagger to solve problem with MySQL check
- new versions of tier1-nagios-plugins, tier1-sudo-config and tier1-nrpe-config RPMs
- James A:
- Change control proposal for ATLAS software server upgrade.
- Job Plan
- James T
- Acceptance testing Streamline '09 kit
- Modified Quattor disk server installations to correctly detect and mount XFS data partitions
- SL5/XFS disk server change planning and documentation.
- Deployment allocations sliding block puzzle.
- Job plan
- Discovered networking problems on SL5 disk servers, fixed after Martin raised an urgent networking ticket (due to incorrect router ACLs).
- Work on benchmarking disk servers for specification in the tender documents.
- Kash:
- Drive replacement.
- Fixing broken WNs.
- Decommissioning old batch systems.(R 27)
- Daily hardware failures status of Streamline 2009 disk servers to James T.
- gdss380 crashed with single drive failure. (Intervention)
- gdss423 moved into test area with John. (Intervention)
- gdss71 replaced raid card memory with John.(Fixed)
- gdss434 rebuild completed successfully. Back to castor.
- gdss332 probably faulty IPMI card. (Intervention)
Absences
- Jonathan on partial retirement (not in on Monday and Friday)
- cheney leave 24th to 28th
Operational Issues and Incidents
Index | Description | Start | End | Severity | Affected VO(s) |
---|
Summary of plans for week ahead
Scheduled and Cancelled Down Times
Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB
Component | Description | Start | End | Affected VO(s) | Type |
---|
Development priorities
- All
- Martin:
- Ian:
- Tim:
- Work on CMS T10KB migtration
- Plan install of Facilities Castor
- New user for DMF
- Sort out DMF backups
- Cheney
- Sit in sun
- cold beer
- Jonathan:
- start regular check restores of home filesystem
- Job Plan
- stop exports for some old filesystems on csfnfs58
- continue investigations on setting up AFS directory as Atlas software server
- Nagios configuration updates
- James T:
- Acceptance testing Streamline '09 kit
- Allocations and deployment
- Benchmarking disk servers
- Job plan
- James A:
- Taking Streamline 2009 WNs out of Acceptance Testing.
- Benchmarking Streamline 2009 WNs.
- Inspecting Streamline 2009 WNs.
- Kash:
- Drive replacement.
- Fixing broken WNs.
- Continuous decommissioning old batch systems.(R 27)
- Daily hardware failures status of Streamline 2009 disk servers to James T.
- gdss380 sending logs to vendor. Received new raid card.
Absences
- Jonathan on partial retirement (not in on Monday and Friday)
- Cheney leave Monday 24th to Friday 28th May
- Jonathan on leave Tuesday 25th May
- Tim on leave 1-4th June
- Ian On Leave Friday pm
Fabric On-Call
Ian Monday-Thursday Kash - Friday-Monday
Advanced Warning of Requirements and Blocking issues
Services Issues
- Xen development environment not available for hadoop testing due to network issues. A resolution of the problem is being started on but is unlikely to be available for a while yet.