Difference between revisions of "RAL Tier1 weekly operations Grid 20091026"
From GridPP Wiki
(No difference)
|
Latest revision as of 09:54, 2 November 2009
Contents
Summary of Previous Week
Developments
- Alastair
- Perform Security Audit
- Learn how to deploy disk servers for ATLAS
- Discuss Job Plan with Matt
- Discuss allocation of ATLAS disk space with Brian Davies and Stephen Burke
- Go to Shared Service training
- Andrew
- FTS channel adjustments: timeouts doubled for STAR-FIHIPT2 & RALLCG2-CLOUDCMSITALY
- Disk server deployment (5 servers to cmsFarmRead)
- APEL & PBS comparisons for CREAM CE
- Correcting PBS jobs MySQL table for October
- Resolved problem with PhEDEx mss-remove agent
- Upgraded PhEDEx to 3.2.9
- Completed CMS "dark" data removal
- Investigating consistency between missing files lists from PhEDEx & CASTOR team
- Catalin
- CRISTAL 1 course
- finished kickstarts for FronTier and SL5 VOBOX and waited for HW
- assisted the LFC ATLAS cleaning operation
- disk servers deployment for ALICE
- Derek
- Updating vo configuration in quattor
- Testing helpdesk backup
- Cristal level 1
- SSC Training
- Out sick 1 day
- Matt
- Determine LHCb service class requirements for new allocation
- Disk deployment meeting
- Richard
- ORACLE SSC Training
- Further disk server deployments into Atlas NonProd (including updates to the TWiki instructions)
- Continued work on BDII/Quattor task
- CASTOR activities: Read through SDW's training slides; work on new pre-prod instance
- Mayo
- Worked on the new Metrics Gathering System
- Thought Bubble website now in operation
- Initial research into IPMI power control project
Operational Issues and Incidents
Description | Start | End | Affected VO(s) | Severity | Status |
---|
Plans for Week(s) Ahead
Plans
- Alastair
- Finish security audit (if not already finished)
- go through gLite training
- go through castor training slides
- learn about FTS and outputs that I will take over from Brian
- Update CPU efficiencies
- Andrew
- Attend CMS Offline & Computing Workshop, CERN
- Catalin
- ready to deploy SL5 VOBOX for Alice (waiting for HW)
- ready to deploy FronTier/squid for ATLAS (waiting for HW)
- finish Alice disk servers deployment
- start WMS03 drain
- Derek
- Test helpdesk restore
- Updating quattor vo configuration
- Update CE documentation
- Matt
- Check priorities for deploying Viglen 08 kit after it passes acceptance tests
- VO requirements capture
- Disaster recovery planning
- Richard
- RPM packaging and installation for new BDII connection throttling script
- RPM packaging and installation for new BDII monitoring script
- Complete quattor config/build for BDII servers
- CASTOR activities: Continue work on new pre-prod instance
- Mayo
- Continued work on New Metric Gathering System
- Begin Stage 2 of on call documentation project
- Continue research into IPMI power control project
Resource Requests
Downtimes
Description | Hosts | Type | Start | End | Affected VO(s) |
---|---|---|---|---|---|
WMS03 hotswappable | lcgwms03.gridpp.rl.ac.uk | Scheduled Outage | Oct 30 (09:00) | Nov 05 (16:00) | non-LHC |
Requirements and Blocking Issues
Description | Required By | Priority | Status |
---|---|---|---|
HW for Squid deployment | ATLAS | High | request made via RT Fabric queue |
HW for FronTier deployment | ATLAS | High | request made via RT Fabric queue |
HW for SL5 64-bit VOBOX | Alice | High | request made via RT Fabric queue |
Hardware for testing LFC/FTS resilience | High | DataServices want to deploy a DataGuard configuration to test LFC/FTS resilience; request for HW made through RT Fabric queue | |
Non-capacity HW for testing | Medium | Still using the old HW | |
Hardware for PPS | Medium | We have made a commitment to test PPS pre-releases, and have no hardware dedicated for this. |
OnCall/AoD Cover
- Primary OnCall: Catalin (Mon-Thu)
- Grid OnCall:
- AoD: