Difference between revisions of "RAL Tier1 weekly operations Grid 20090713"
From GridPP Wiki
Derek ross (Talk | contribs) |
(No difference)
|
Latest revision as of 13:54, 13 July 2009
Contents
Summary of Previous Week
Developments
- Derek
- CE Services restarted
- Quattorising maui configuration, updating torque server profile to use new QWG release
- Listened to GDB
- CA updates on Grid Services nodes
- Moved lcg-support alias to point at Support queue
- Removed ops publishing from LHCb LFC
Operational Issues and Incidents
Description | Start | End | Affected VO(s) | Severity |
---|---|---|---|---|
Production pool account at 32k subdirectory limit | 2009-06-03 | Ongoing | ATLAS | High |
LB01 RAID failure | 2009-06-17 | Ongoing | All | Low |
lfc0448 - SMART errors detected | 2009-06-15 | Ongoing | All | Low |
lcgpx0619 - RAID failure | 2009-07-03 | Ongoing | All | Low |
helpdesk DB tables not backed up | 2009-07-01 | Ongoing | none | Medium |
Plans for Week(s) Ahead
Development Priorities
- Derek
- Continue quattorising torque server
- Schedule FTS drain before Castor downtime
- Implement OPN ticket merging in Notifications queue
- Update blog versions
Resource Requests
Downtimes
Description | Start | End | Affected VO(s) |
---|---|---|---|
LFC ATLAS separation | 2009-07-20 08:00 | 2009-07-20 17:00 | All |
Requirements and Blocking Issues
Description | Required By | Priority | Status |
---|---|---|---|
SL5 Worker Node Kickstart | High | Post-kickstart configuration needed; not yet suitable for bulk deployment | |
LB01 RAID failure | Medium | Testing hotswap configuration | |
lfc0448 disk failures | Medium | Disk replacement needed | |
Non-capacity HW for testing | Medium | Still using the old HW | |
Hardware for PPS | Medium | May need to deploy imminently |
OnCall/AoD Cover
- Primary OnCall
- Catalin (Wed-)
- Grid Oncall
- Derek (Mon,Tue)
- AoD