Difference between revisions of "RAL Tier1 weekly operations Grid 20091207"
From GridPP Wiki
Matt hodges (Talk | contribs) |
(No difference)
|
Latest revision as of 08:24, 10 December 2009
Contents
Summary of Previous Week
Developments
- Alastair
- Add check_world_writable.sh to Nagios
- Make wiki page for Computing requirements
- Run tests for user analysis at RAL.
- Andrew
- Wrote script to generate my metrics from MySQL; script to do disk accounting consistency check with overwatch
- Applied December fairshares in Maui; started Novemember accounting (waiting for tape info now)
- Added checksum checking for CMS tape migrations in PhEDEx; updated PhEDEx dev instance; testing with transfers from CERN
- Corrected check_pbs_efficiencies Nagios plugin
- Prepared & gave short presentation at FacOps meeting about IO testing
- Started work on CMSSW TTreeCache patches IO testing
- CMS computing model spreadsheet
- Catalin
- added 2nd ALICE VOBOX into production
- continued work on MySQL systems audit
- put Frontier fix, not yet confirmed
- svn'ed the Alice xrootd installation docs
- still waiting from LFC@CERN feedback for recovery and consistency checks
- Derek
- Continuing work on quattorising helpdesk frontend
- Matt
- Richard
- Quattor template(s) for a production CIP server
- Read through JW's Nagios slides as prep. for the upcoming NRPE class
- Tuned into GDB meeting
- Added a couple of items to the Fabric team's quattor documentation (https://wiki.e-science.cclrc.ac.uk/web1/bin/view/EScienceInternal/QuattorImplementationNotes)
- CASTOR activities:
- Added to quattor a set of templates for SLC 4.6 as a "dry run" for the SLC 4.8 version of the templates
- Worked with CERN folk to try and get a set of quattor templates for SLC 4.8
- Re-arranged some of the quattor templates for PPS instance to simplify config file handling
- Built further instances of the server types to check installation process
- Reported "incomplete build" quattor issue to mailing list and found others are seeing the same problem
- Mayo
- Worked on New Metrics system: Took feedback on newly added features and fixed any bugs testing revealed
- began work on admin interface for metrics sytem
- Had a meeting of extending the new metric system to include Gridpp users
- Worked on automating tape robot spreadsheet project
Operational Issues and Incidents
Description | Start | End | Affected VO(s) | Severity | Status |
---|---|---|---|---|---|
250 Atlas jobs deleted (GGUS #53813) | Wed 2 Dec 17:40 | Wed 2 Dec 18:00 | Atlas | Medium | Resolved - believed not to be a site issue |
Plans for Week(s) Ahead
Plans
- Alastair
- Deploy Disk server.
- Contact Ganga developers about adding better error information to ATLAS jobs for normal users running at RAL.
- Test poweruser analysis at RAL.
- Away on Wednesday.
- Andrew
- PhEDEx: complete checksum checking tests on dev instance; check if dev instance can be run from another VOBOX easily
- Complete November accounting
- Continue CMSSW TTreeCache IO testing
- Training: Nagios plugins
- Attend relevant meetings at CMS week
- Catalin
- follow up Frontier fix
- continue working on backup/recovery
- ready to start deployment on LHCB SL5 VOBOX
- plans to migrate MySQL server(s) to SL5 64-bit
- decommission SL4 ALICE VOBOXes
- Derek
- Change control process via RT
- Matt
- Swap in resilient CIP plugin on site BDIIs
- Tier-1 Review resilience talk
- Richard
- NRPE training
- CASTOR activities:
- Complete the "data configurator" tool to handle disk servers as well as other server types
- Progress the initial setup of databases on a new instance
- Continue activity on SLC 4.8 templates
- Mayo
- Work Metric system admin interface and documentation
- Add Sarah Pearce to metrics system for testing with regards to gridpp extension
- Continue working on automated spreadsheet project
- nrpe nagios plugins training
Resource Requests
Downtimes
Description | Hosts | Type | Start | End | Affected VO(s) |
---|---|---|---|---|---|
Requirements and Blocking Issues
Description | Required By | Priority | Status |
---|---|---|---|
LHCb SL5 64bit VOBOX deployment using Quattor | 25 Nov 2009 | Medium | Quattor recipe not yet available (RT#53392) |
Hardware for testing LFC/FTS resilience | High | DataServices want to deploy a DataGuard configuration to test LFC/FTS resilience; request for HW made through RT Fabric queue | |
Hardware for PPS | High | We have made a commitment to test PPS pre-releases, and have no hardware dedicated for this. | |
Hardware for Grid Services testbed | Medium | ||
Hardware for SL5 64-bit MySQL main server | Medium | Plan to migrate to SL5 64-bit by mid January |
OnCall/AoD Cover
- Primary OnCall:
- Grid OnCall: Catalin
- AoD: