Difference between revisions of "RAL Tier1 weekly operations Fabric 20091130"
From GridPP Wiki
(No difference)
|
Latest revision as of 15:14, 1 December 2009
Contents
Summary of week gone
Developments
- All:
- Martin:
- Work on Viglen08 disk acceptance tests
- Work on test nodes for LFC database resilience tests
- Various finance an dprocurement issues
- Ian:
- Wrapped up FP7 bid - submitted Tuesday
- Tested new features in Quattor for Nagios slave server
- Imported Quattor updates from QWG - fixed a couple of resulting issues
- James T:
- CRISTAL2 preparation
- Set up ancillary network for Streamline08 disk servers with James A.
- Quattorisation of disk servers
- CRISTAL2 Wed - Fri
- Fabric on call Mon - Thurs
- Primary on call Fri - Sun
- Jonathan:
- updated kernels on NIS servers and rebooted
- removed mount of /home/csf and added soft-links for Bfactory users for farm nodes
- wrote paper about backup policy, recovery etc for Tier1 review
- increased quota for LHCb AFS volume
- cleared up atlasbackup problems for some nodes
- created archive backups for ccsc07/15 for Richard
- Nagios configuration updates
- released new versions of RPMs and tier1-nrpe-config
- rebooted nagger for new kernel
- worked on Quattor configuration of Nagios slave (with help from Ian)
- James A:
- Caught up after leave.
- Tried to focus on SINDES.
- Helped with general QUATTOR issues where needed.
- Kash:
- Drive replacement.
- Fixing broken WNs.
- gdss163 double disks failure. (Finish test)
- gdss95 and gdss134 given back to castor
- Created graphs of drives failure for MJB.
- Working on 2008 Disk servers and working nodes.
- Working on gdss67, 163 and 282.
Operational Issues and Incidents
Index | Description | Start | End | Severity | Affected VO(s) |
---|---|---|---|---|---|
EMC arrays serving 3D/LFC/FTS databases made unstable by attempts to stabilise the Castor EMC arrays | Tuesday 6/0ct am | UPS issues to be fixed | Catastrophic | All | |
Gdss138 double disk failure: two drives failed in quick sucession (30 minutes) | Monday 0530-0600 | Ongoing | Severe | LHCb Dst data. Data loss confirmed |
Summary of plans for week ahead
Scheduled and Cancelled Down Times
Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB
Component | Description | Start | End | Affected VO(s) | Type |
---|
Development priorities
- All
- Work on evacuating A1 Upper (Castor LSF/FlexLM triplet)
- Martin:
- Viglen08 disk acceptance solution
- Ian:
- Finalising new flex license servers for LSF
- Further Quattor tutorial for Cheney
- Assist with new disk servers
- James T:
- Quattorisation of disk servers
- Decision on Viglen 2008 suggested solution
- Primary on call Mon - Thurs
- Jonathan:
- Quattor implementation for Nagios slave
- security updates to disk servers to prevent general user logins
- Nagios configuration updates
- James A:
- Continue with SINDES.
- Make some fixed to the Hardware database for Kash.
- Update and make changes to Cacti.
- Kash:
- Drive replacement.
- Fixing broken WNs.
- gdss67 return to castor team after finishing test.
- gdss138 double disk failure. (Intervention)
- Continuous working on 2008 disk servers and working nodes.
- Continuous working on gdss67, 138, 163 and 282.
Absences
- Jonathan: S/L Monday
- Jonathan: A/L Thursday am
Fabric On-Call
- Mon-Thu: James T Primary on call
- Fri-Sun: Ian Primary on call
Advanced Warning of Requirements and Blocking issues
- Unable to proceed with Atlas TAG migration to 64bit due to arrays being used for 3D systems while EMC kit is flakey.
Services Issues
- Various requests for hardware.
- Working on various hardware requests for Services team.