RAL Tier1 weekly operations Grid 20090803
From GridPP Wiki
Revision as of 07:55, 4 August 2009 by Matt hodges (Talk | contribs)
Contents
Summary of Previous Week
Developments
- Catalin
- Tune WMS/LB servers
- Prepare documentation about the LFC separation
- Intervention on lcgmon01 (+ Kashif and JamesA)
- Derek
- Finished quattorising torque server config
- Started quattorising WN
- Tested helpdesk database dump speed
- Matt
- PPS/CASTOR Pre-Prod interviews (Tuesday)
- Update SL4/SL5 migration plan (distribute to VOs)
- LFC ATLAS plan finalised (with Catalin)
- Checked Derek's quattor-generated torque configuration
- Track down CMS OoM jobs (Derek banned user at CE level)
Operational Issues and Incidents
Description | Start | End | Affected VO(s) | Severity |
---|---|---|---|---|
lfc0448 - SMART errors detected | 2009-06-15 | Ongoing | All | Medium |
helpdesk DB tables not backed up | 2009-07-01 | Ongoing | none | Low |
lcgmon01 - SMART errors detected | 2009-07-18 | 2009-07-30 | None | Medium |
Plans for Week(s) Ahead
Development Priorities
- Catalin
- Derek
- Continue quattorising worker node
- Document helpdesk installation procedure
- Matt
- SL4/SL5 Migration
- Get final SL4/SL5 VO requirements
- Test torque submit filter scripts (for directing jobs to nodes with sl4 or sl5 properties)
- LFC:
- ATLAS front-end separation (DNS alias, GOCDB, IS changes)
- ATLAS back-end separation planning (depends on timing information for DB cleanup, and final plans for folding in resilience upgrades)
- FTS:
- Document procedure to add domain to CMS cloud (added Ukraine to CERN cloud)
- Document procedure to deal with site name changes (BNL will change soon)
- WLCG accounting
- Test deployment of gLite 3.2 (SL5) UI using Quattor
- Update gLite middleware on SL4 UI
- SL4/SL5 Migration
Resource Requests
Downtimes
Description | Start | End | Affected VO(s) |
---|---|---|---|
LFC ATLAS front-end separation | August 3 | August 5 | ATLAS |
LFC ATLAS back-end separation | August 26 (TBC) | August 26 (TBC) | All |
Requirements and Blocking Issues
Description | Required By | Priority | Status |
---|---|---|---|
SL5 Worker Node Kickstart | High | Post-kickstart configuration needed; not yet suitable for bulk deployment | |
lfc0448 disk failures | Medium | Disk replacement needed | |
Non-capacity HW for testing | Medium | Still using the old HW | |
Hardware for PPS | Medium | We have made a commitment to test PPS pre-releases, and have no hardware dedicated for this. |
OnCall/AoD Cover
- Primary OnCall
- Grid OnCall: Matt
- AoD: reschedule Wednesday