RAL Tier1 weekly operations Grid 20100118
From GridPP Wiki
Contents
Summary of Previous Week
Developments
- Alastair
- Looked into results of Hammer Cloud test to understand Frontier Performance.
- Made progress with getting ATLAS powerusers to run at the Tier 1.
- Updated RAL PP twiki with feedback from ATLAS meeting.
- Andrew
- Added checksum checking of migrated files to PhEDEx production instance
- Wrote Nagios plugin for checking user proxy on CMS VOBOX
- Added (hidden) option to capacity & efficiency ganglia pages for specifying units (KSI2K or HEP-SPEC06)
- Added options to all UB schedule scripts for HEP-SPEC06 option
- Wrote documentation about adding new VO to UB schedule scripts
- Preparations for CMS Data Ops training
- Training: online display screen equipment course & self-assessment
- Catalin
- worked on SL5 LHCb VOBOX quattorised deployment
- closed the t2k.org issue (user error)
- WMS03 (non-LHC) update
- Derek
- Test CREAM CE reinstallation instructions
- Created and tested quattor template to implement BLParser service
- Added updated voms certificates to yaim config rpm
- Listened in on GDB
- Matt
- Tested R-GMA recovery (Flexible Archiver component)
- Worked with Carmine on LFC recovery plans
- Produced 2009/Q4 FTS metrics for quarterly report
- Richard
- 2 days A/L
- Finished plan for BDII changes
- Continued writing discussion document for DNS proposal
- Continued work on the CASTOR pre-prod instance
- Built a test machine as a BDII server to test quattor templates
- Worked with JK and GS on a script to check CASTOR checksums
- Mayo
- Encrypted passwords within the Metric system
- Added a change password feature to the metric system
- Fixed a bug within the Metric system
- Worked on tape statistics spreadsheet project: converting excel chatrs to HTML
Operational Issues and Incidents
Description | Start | End | Affected VO(s) | Severity | Status |
---|---|---|---|---|---|
FTS DB performance problems | 20100115 11:00 | 20100115 16:00 | LHC | High | Load on Orisa nodes redistributed across nodes by reconfiguring FTS agents. |
Plans for Week(s) Ahead
Plans
- Alastair
- Run (hopefully) final tests on Frontier server (after Catalin has performed servlet update) to confirm it is working well.
- Continue updating RAL PP twiki.
- Complete version 1 of Tier 1 VO requirements with information that has been provided by Raja.
- Possibly away/working from home Tuesday (Depends how long Hospital appointment takes)
- Andrew
- Joining CMS Data Ops - away at CERN for training
- Catalin
- finalise SL5 LHCb VOBOX deployment (hotswapping issues)
- follow up some post-reboot WMS issues with CERN
- work on LFC schemas tidying up (with Carmine)
- exercise Alice xrootd (manager + peer) re-installation (on old SL4 voboxes)
- Derek
- Implementing BLParser on lcgbatch01
- Completing testing of CE and CREAM CE for Intervention changes
- GLexec and SCAS on SL5
- Matt
- Finish Grid Services Disaster Recovery document
- Planning ATLAS/R89 co-hosting of Grid Services
- Provide test site BDII for CIP upgrade testing
- Richard
- Finish discussion document for DNS proposal
- Continue working on CASTOR pre-prod instance
- Further work on the Quattor templates for BDII server
- Re-do existing STP time bookings and enter EGEE timesheets back to starting date
- 2 days A/L
- Mayo
- Automating Metric report system
- Adding charts to the metric system
- Web interface and script to fetch data for Tape robot statistics spreadsheet project
Resource Requests
Downtimes
Description | Hosts | Type | Start | End | Affected VO(s) |
---|---|---|---|---|---|
FTS DB problems | Orisa, FTS agents | Unscheduled | 20100115 11:00 | 20100115 16:00 | LHC |
Requirements and Blocking Issues
Description | Required By | Priority | Status |
---|---|---|---|
Hardware for testing LFC/FTS resilience | High | DataServices want to deploy a DataGuard configuration to test LFC/FTS resilience; request for HW made through RT Fabric queue | |
Hardware for Testbed | High | Required for change validation, load testing, etc. Also for phased rollout (which replaces PPS). | |
Hardware for SCAS servers | Feb 1 2010 | High | Hardware required for production SCAS servers - required to be in place by end of Feb |
Hardware for SL5 CREAM CE for Non LHC SL5 batch access | Medium | Hardware required for CREAM CE for non-LHC vos | |
Pool accounts for Super B vo | Medium | Required to enable Super B vo on batch farm |
OnCall/AoD Cover
- Primary OnCall: Catalin (Mon-Sun)
- Grid OnCall:
- AoD: Catalin (Wed)