RAL Tier1 weekly operations Grid 20091005
From GridPP Wiki
Revision as of 14:43, 5 October 2009 by Andrew lahiff (Talk | contribs)
Contents
Summary of Previous Week
Developments
- Andrew
- Deployed pbslogs2mysql and gmetric-eff.pl on lcgbatch01 using Quattor
- Deployed October CPU allocations using Quattor
- Updated PhEDEx to 3_2_6; wrote documentation
- Changed number of streams for CLOUDCMSUS-RALLCG2 FTS channel
- Preparations for testing CMS skimming without LazyDownload
- Investigated bio very low efficiencies
- Adjustments to LFC ganglia monitoring
- Training: manual handling; Quattor; APR
- Catalin
- re-installed WMS02 and made it hotswappable, updated documentation
- Quattor and Castor training
- some work on WMS purging ('held jobs' issue)
- Derek
- n/a
- Matt
- Reviewed Grid Services installation/recovery documentation
- BDII reconfigured for CIP upgrade
- Generated disk deployment requests for Q4/09 allocations
- Richard
- Installed BDII using Quattor; Quattor training
- Increased use of RT reports
Operational Issues and Incidents
Description | Start | End | Affected VO(s) | Severity |
---|---|---|---|---|
lcgce07 partition failure | 18/09/09 | none yet (potentially alice, cms, lhcb lose resilience) | medium |
Plans for Week(s) Ahead
Development Priorities
- Alastair
- Induction
- Andrew
- Updates to September PBS jobs database
- Testing CMS skimming without LazyDownload
- Delete some CMS "dark" data
- Add monitoring of PhEDEx watchdog agents to Nagios
- Training: CASTOR; welcome to library; display screen equipment
- Catalin
- review the requests for Frontier deployment for ATLAS
- chase ALICE SW area issue
- work on WMS purging
- Derek
- n/a
- Matt
- Disaster recovery planning
- Richard
- n/a
Resource Requests
Downtimes
Description | Hosts | Type | Start | End | Affected VO(s) |
---|---|---|---|---|---|
WMS02 hotswappable | lcgwms02 | Scheduled Outage | Sep 22 (16:00) | Sep 30 (17:00) | LHC |
Oracle ASM patching | FTS, FTM, LFCs | Scheduled At Risk | Oct 01 (13:30) | Oct 01 (16:30) | All |
Requirements and Blocking Issues
Description | Required By | Priority | Status |
---|---|---|---|
Non-capacity HW for testing | Medium | Still using the old HW | |
Hardware for PPS | Medium | We have made a commitment to test PPS pre-releases, and have no hardware dedicated for this. | |
Hardware for testing LFC/FTS resilience | Medium | DataServices want to deploy a DataGuard configuration to test LFC/FTS resilience |
OnCall/AoD Cover
- Primary OnCall: Catalin (Fri-Sun)
- Grid OnCall: Catalin (Mon), Matt (Tue-Thu)
- AoD: