RAL Tier1 weekly operations Grid 20091221
From GridPP Wiki
Contents
Summary of Previous Week
Developments
- Alastair
- Deployed 2 Disk servers.
- Contacted Panda/Ganga developers to improve error information for ATLAS jobs at RAL.
- Tested poweruser analysis at RAL, found problem with CERN WMS.
- Andrew
- Submitted change request document for FTS channel timeout adjustment; applied change
- Updates check_pbs_efficiencies.pl Nagios script to allow automatic killing of low efficiency jobs for selected VOs
- Resolved failing SRMv2-user CMS SAM test
- TTreeCache & read-coalescing IO testing on reco & skimming jobs
- Added Ganglia monitoring of CMS tape migrations (from PhEDEx logs, not CASTOR)
- Investigated various CMS issues
- Catalin
- worked on old/new ALICE VOBOXes
- no progress on LHCb VOBOX quattorising
- still waiting from LFC@CERN feedback for recovery and consistency checks
- some work on MySQL migration
- attended various meetings
- Derek
- Moved change control system from dev helpdesk to prod helpdesk
- Produced metrics report
- Implemented cron jobs to back up lcgcenfs files to CEs
- Matt
- Tested new production CIP on test site BDII
- Tier-1 Review
- Richard
- Continued plan for proposed BDII changes during January
- Wrote a script to dump our DNS domain to simplify "which machine is that" type queries arising from monitoring alerts/emails
- CASTOR activities:
- Defined disk and tape servers to use with new pre-prod instance
- Mayo
- Created admin UI for metric system and wrote system user documentation
- created user account for Sarah Pearce to enable testing with regads to the possible gridpp extension
- Attended Cheney's NRPE training
- Worked on automating tape robot spreadsheet project
Operational Issues and Incidents
Description | Start | End | Affected VO(s) | Severity | Status |
---|
Plans for Week(s) Ahead
Plans
- Alastair
- Try and fix Poweruser issues
- Look into "slow" FTS rates in UK Cloud.
- Andrew
- On A/L next week
- Catalin
- continue work on MySQL migration
- follow up issue with t2k.org 'zero size' LFC entries
- minor issues with ALICE VOBOXes central monitoring
- decomission old SL4 ALICE VOBOXes
- Derek
- Document process for coping with catastrophic failure of lcgcenfs
- Document process for breaking helpdesk mail loops
- Matt
- Switch Site BDIIs to new CIPs
- GridPP4 input
- R-GMA Registry recovery testing
- Investigate APEL publishing problems (lcgbatch01)
- Richard
- Finish off the plan for proposed BDII changes during January
- Work with MB on getting a DNS zone delegated to Tier1
- Work with JA/DR on placing a link to "DNS dump" script on Tier1 web page
- CASTOR activities:
- Rebuild disk servers to be used in new pre-prod instance
- Update the software on tape server for new pre-prod instance
- Continue activity on SLC 4.8 templates
- Mayo
- Work on Metric system: adding change password feature for users / report printing features
- Work on possible exstention of system to include Gridpp
- Continue working on automated spreadsheet project
Resource Requests
Downtimes
Description | Hosts | Type | Start | End | Affected VO(s) |
---|---|---|---|---|---|
Requirements and Blocking Issues
Description | Required By | Priority | Status |
---|---|---|---|
LHCb SL5 64bit VOBOX deployment using Quattor | 25 Nov 2009 | Medium | Quattor recipe not yet available (RT#53392) |
Hardware for testing LFC/FTS resilience | High | DataServices want to deploy a DataGuard configuration to test LFC/FTS resilience; request for HW made through RT Fabric queue | |
Hardware for PPS | High | We have made a commitment to test PPS pre-releases, and have no hardware dedicated for this. | |
Hardware for Grid Services testbed | Medium |
OnCall/AoD Cover
- Primary OnCall:
- Grid OnCall:
- AoD: