RAL Tier1 weekly operations Grid 20100208
From GridPP Wiki
Revision as of 14:26, 12 February 2010 by Andrew lahiff (Talk | contribs)
Contents
Operational Issues
Description | Start | End | Affected VO(s) | Severity | Status |
---|---|---|---|---|---|
Job status monitoring from CREAMCE | 2-Feb-2010 | CMS | medium |
Blocking Issues
Description | Requested Date | Required By Date | Priority | Status |
---|---|---|---|---|
Hardware for testing LFC/FTS resilience | High | DataServices want to deploy a DataGuard configuration to test LFC/FTS resilience; request for HW made through RT Fabric queue | ||
Hardware for SCAS servers | 2010-02-01 | High | Hardware required for production SCAS servers - required to be in place by end of Feb
[Done] | |
Hardware for Testbed | Medium | Required for change validation, load testing, etc. Also for phased rollout (which replaces PPS).
Have initial hardware. | ||
Hardware for SL5 CREAM CE for Non LHC SL5 batch access | Medium | Hardware required for CREAM CE for non-LHC VOs
[Done] |
Developments/Plans
Highlights for Tier-1 Ops Meeting
- LHC schedule 2010/2011 (Alastair)
- Grid Services Team: Out of office all day Tuesday (CR03)
- CMS: RAL received some Commissioning10 cosmics data from CERN (~60 MB/s, 2 days, ~5 TB); 24/7 operation has started today
Highlights for Tier-1 VO Liaison Meeting
- SCAS/glexec deployment
Detailed Individual Reports
Alastair
- Continue work on computing requirements / Capacity Planning. [Ongoing]
- Write Nagios script to warn when space token are near full. [Done]
- Work with Brian + Chris in re-deploying/draining disk servers to ATLAS space tokens. [Ongoing]
- Look into ATLAS jobs hitting 3GB memory limit. [Ongoing]
Andrew
- Restarted backfill at RAL (re-reco on BeamCommissioning09 Cosmics)
- Investigated new CREAMCE monitoring issues
- Adding PhEDEx-CASTOR consistency Ganglia monitoring [Done]
- Test another new CMSSW I/O optimisation patch & report to developer [Done]
- Added monitoring of PhEDEx agent restarts [Done]
- Add warning for CMS files that get stuck in migration queue for weeks
- Complete document about automatic job killing [Ongoing]
Catalin
- tested Alice xrootd (manager + peer) re-installation (with Chris) [Done]
- improved Nagios configuration knowledge [ongoing]
- Frontier Nagios checks [ongoing]
- work on LFC schema tidying up (with Carmine) [ongoing]
- quattorise additional LFC frontends (with Ian)
- install APEL patches on MONbox (for a correct published installed capacity )
Derek
- Installing SL5 SCAS server
- Testing SL5 GLexec WN
Matt
- Plan ATLAS/R89 co-hosting of Grid Services
- T2K configuration of FTS, and request dedicated diskpool
- Test upgrade path from FTS2.1 to FTS2.2 on orisa
Richard
- Nagios plugin for checking rtcpclientd server logs on CASTOR stagers [Done]
- Writing a roadmap for completing the quattorisation of Grid Services machines
- Setting up a quattor template for a top-level BDII that works around issues in the stock QWG templates
- CASTOR items:
- Completed setting up disk servers for use with pre-prod CASTOR instance [Done]
- Waiting for resolution of:
- Powering off / Crashing problem on ccse02
- Waiting for resolution of:
Mayo
- Create system for exporting Metrics report to spreadsheet [Done]
- Adding bar chart to Metric system [Done]
- Admin interface for Metric System[Done]
- Update documentation for Metric System
- Configure assigned nrpe nagios plugins
VO Reports
ALICE
- Plans to use the 2nd VOBOX at sites in production if and only if the primary one is not behaving well
ATLAS
CMS
- Proposal to no longer support PhEDEx on SLC4 after March 1 (date not definite yet, to be discussed by FacOps)
- RAL is now the only Tier 1 with CREAMCE job monitoring issues (before downtime, only jobs killed by batch system affected; now all jobs are affected)
- RAL has started receiving some Commissioning10 data from CERN (~60MB/s, 2 days, ~5 TB)
- Current/upcoming activities at RAL: backfill (in progress)
LHCb
OnCall/AoD Cover
- Primary OnCall: Catalin (Mon-Sun)
- Grid OnCall:
- AoD: