Difference between revisions of "RAL Tier1 weekly operations Fabric 20100705"
From GridPP Wiki
(No difference)
|
Latest revision as of 10:46, 5 July 2010
Contents
Developments
- All:
- Martin:
- Ian:
- Facilities Castor planning
- Setting up new Castor Dell boxes for acceptance testing
- prepared (and fixed) ncm-resolver to allow config of resolv.conf
- added quattor components for Castor to repositories
- Tim:
- Check test T10KB migrations went OK
- Start mass CMS T10KB migration
- DMF - set up " site bit" in DMF to control migration for user
- Finish planning for FaC instance
- Sort out STK maintanence contract
- Jonathan:
- worked on deleting old filesystems from csfnfs55/58 and gdss130/132
- updated RPMs on NIS servers and mail server
- added userid accounts for BADC testing of Castor (RT #61654)
- brought Quattor up to date for minor manual change to nagios06.gridpp.rl.ac.uk
- updated kernel RPMs on nagger and rebooted
- updated RPMs on nagios0[2-4] and rebooted for new kernel
- issued new version of RPM tier1-sudo-config (1.0-53)
- set kernel tuning parameter on nagios06
- James A:
- Cabled up several of Dell services nodes, more remain to be done.
- Finished troubleshooting multipath configuration problems.
- Developed several new squid ganglia metrics with Alastair for monitoring the frontier service.
- James T
- HDD replacements in Kash's absence
- TAVS (Tier1 Array Verify Scheduler)
- Added support for Adaptec cards
- Fixed bug in error handling (thanks to John for spotting that)
- Built RPM (still needs to be deployed)
- Change control for LHCb WAN tuning
- Streamline 2009 disk server problems
- Meeting with Gareth Glaccum
- Shipped machien to LSI in the USA for testing
- Cheney
- database backup restore testing
- chasing down rogue emails
- document how to replace array disks
- turn off of old ads arrays
- show ops details of using ipmi
- workaround for sls availability
- put live tsbn web version
- testing of new version amanda backup rpm
- Kash:
- Drive replacement.
- Fixing broken WNs.
- Decommissioning old batch systems.(R 27)
- gdss67 filesystem problem. Replaced Raid card. (Intervention)
- gdss78 being re-installed.
- gdss207 given back to castor.
- gdss474 waiting for backplane.
- gdss486 collected by Gareth (Streamline) for testing.
- gdss512 shipped to LSI (USA) for testing.
- gdss420 low voltage on battery. (waiting for battery)
- Streamline/areca disk servers crashed due to single faulty drive. (ongoing)
Absences
- Jonathan on partial retirement (not in on Monday and Friday)
Operational Issues and Incidents
Index | Description | Start | End | Severity | Affected VO(s) |
---|
Summary of plans for week ahead
Scheduled and Cancelled Down Times
Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB
Component | Description | Start | End | Affected VO(s) | Type |
---|
Development priorities
- All
- Martin:
- Ian:
- Setting up test HyperV server for services virtualisation
- Facilities Castor instance planning
- WLCG Collaboration workshop
- Tim:
- Monitor dust levels
- New "check_tape_pools" script
- Cheney
- change control docco improvements
- Jonathan:
- Administrator on Duty on Wednesday and Thursday (during WLCG collaboration workshop)
- continue work on shutting down old NFS servers
- create Nagios slave for worker nodes
- Nagios configuration updates
- James T:
- Fix bug in ncm-etcservices Quattor component
- LHCb WAN tuning
- File system creating and testing of gdss67
- WLCG workshop Wednesday to Friday
- James A:
- Continue cabling up Dell services nodes.
- Support systems while team members are at the WLCG Collaboration workshop.
- Kash:
- Drive replacement.
- Fixing broken WNs.
- gdss78 need re-install and create array from scratch.
- gdss380 run 7 days acceptance test.
- Continuous decommissioning old batch systems.(R 27)
Absences
- Jonathan on partial retirement (not in on Monday and Friday)
- Ian, James T and Martin at WLCG collaboration workshop in London Wednesday-Friday
Fabric On-Call
- Ian primary on call, Monday-Sunday - not Saturday