RAL Tier1 weekly operations Grid 20090622
From GridPP Wiki
Contents
Summary of Previous Week
Developments
- Catalin
- Work on LFC streaming
- WMS service draining
- Tests on SL5 WNs
- Derek
- YII Objectives
- Wrote Nagios test for Cream CE issue
- Investigating possible solutions to LCG CE file limit problem
- Deployed test Quattor configuration for site and top-level BDIIs
- Matt
- Completed R89 Rack Migration templates (whole team).
- Migrated MyProxy service to non-migrating rack.
Operational Issues and Incidents
Description | Start | End | Affected VO(s) | Severity |
---|---|---|---|---|
Production pool account at 32k subdirectory limit | 2009-06-03 | Ongoing | ATLAS | High |
LB01 RAID failure | 2009-06-17 | Ongoing | All | Low |
lfc0448 - SMART errors detected | 2009-06-15 | Ongoing | All | Low |
ce.ngs - SAN problems | 2009-06-16 - 17 | Done | egee test vo | Low |
Plans for Week(s) Ahead
Downtimes
Description | Start | End | Affected VO(s) |
---|---|---|---|
WMS drain ahead of R89 move | 2009-06-17 10:00 | 2009-06-26 12:00 | All |
Development Priorities
- Catalin
- support the R89 move (if needed)
- finalise recovery documentation
- debug the LFC streaming (with Carmine)
- Derek
- Continuing investigation of LCG CE 32k file solutions
- Refine YII Objectives
- Quattorise test LFC
- Matt
- Plan SL4 to SL5 migration.
- Migrate MyProxy service to R89 CPU rack.
- R89 late rota cover Thu/Fri.
Requirements and Blocking Issues
Description | Required By | Priority | Status |
---|---|---|---|
SL5 Worker Node Kickstart | High | Post-kickstart configuration needed; not yet suitable for bulk deployment | |
LB01 RAID failure | Medium | Disk replacement needed | |
lfc0448 disk failures | Medium | Disk replacement needed | |
Non-capacity HW for testing | Medium | Still using the old HW | |
Hardware for PPS | Medium | May need to deploy imminently |
OnCall/AoD Cover
- Primary OnCall
- Catalin: Fri-Sun
- Grid Oncall
- Derek: Mon-Thu
- AoD
- None