RAL Tier1 weekly operations Grid 20100208

Operational Issues

Description	Start	End	Affected VO(s)	Severity	Status
Job status monitoring from CREAMCE	2-Feb-2010		CMS	medium

Blocking Issues

Description	Required By Date	Priority	Status
Hardware for testing LFC/FTS resilience		High	DataServices want to deploy a DataGuard configuration to test LFC/FTS resilience; request for HW made through RT Fabric queue
Hardware for SCAS servers	2010-02-01	High	Hardware required for production SCAS servers - required to be in place by end of Feb [Done]
Hardware for Testbed		Medium	Required for change validation, load testing, etc. Also for phased rollout (which replaces PPS). Have initial hardware.
Hardware for SL5 CREAM CE for Non LHC SL5 batch access		Medium	Hardware required for CREAM CE for non-LHC VOs [Done]

Developments/Plans

Highlights for Tier-1 Ops Meeting

LHC schedule 2010/2011 (Alastair)
Grid Services Team: Out of office all day Tuesday (CR03)
CMS: RAL received some Commissioning10 cosmics data from CERN (~60 MB/s, 2 days, ~5 TB); 24/7 operation has started today

Highlights for Tier-1 VO Liaison Meeting

SCAS/glexec deployment

Detailed Individual Reports

Alastair

Continue work on computing requirements / Capacity Planning. [Ongoing]
Write Nagios script to warn when space token are near full. [Done]
Work with Brian + Chris in re-deploying/draining disk servers to ATLAS space tokens. [Ongoing]
Look into ATLAS jobs hitting 3GB memory limit. [Ongoing]

Andrew

Restarted backfill at RAL (re-reco on BeamCommissioning09 Cosmics)
Investigated new CREAMCE monitoring issues
Adding PhEDEx-CASTOR consistency Ganglia monitoring [Done]
Test another new CMSSW I/O optimisation patch & report to developer [Done]
Added monitoring of PhEDEx agent restarts [Done]
Add warning for CMS files that get stuck in migration queue for weeks
Complete document about automatic job killing [Ongoing]

Catalin

tested Alice xrootd (manager + peer) re-installation (with Chris) [Done]
improved Nagios configuration knowledge [ongoing]
Frontier Nagios checks [ongoing]
work on LFC schema tidying up (with Carmine) [ongoing]
quattorise additional LFC frontends (with Ian)
install APEL patches on MONbox (for a correct published installed capacity )

Derek

Installing SL5 SCAS server
Testing SL5 GLexec WN

Matt

Plan ATLAS/R89 co-hosting of Grid Services
T2K configuration of FTS, and request dedicated diskpool
Test upgrade path from FTS2.1 to FTS2.2 on orisa

Richard

Nagios plugin for checking rtcpclientd server logs on CASTOR stagers [Done]
Writing a roadmap for completing the quattorisation of Grid Services machines
Setting up a quattor template for a top-level BDII that works around issues in the stock QWG templates
CASTOR items:
Completed setting up disk servers for use with pre-prod CASTOR instance [Done]
- Waiting for resolution of:
  - Powering off / Crashing problem on ccse02

Mayo

Create system for exporting Metrics report to spreadsheet [Done]
Adding bar chart to Metric system [Done]
Admin interface for Metric System[Done]
Update documentation for Metric System
Configure assigned nrpe nagios plugins

VO Reports

ALICE

Plans to use the 2nd VOBOX at sites in production if and only if the primary one is not behaving well

ATLAS

CMS

Proposal to no longer support PhEDEx on SLC4 after March 1 (date not definite yet, to be discussed by FacOps)
RAL is now the only Tier 1 with CREAMCE job monitoring issues (before downtime, only jobs killed by batch system affected; now all jobs are affected)
RAL has started receiving some Commissioning10 data from CERN (~60MB/s, 2 days, ~5 TB)
Current/upcoming activities at RAL: backfill (in progress)

LHCb

OnCall/AoD Cover

Primary OnCall: Catalin (Mon-Sun)
Grid OnCall:
AoD:

RAL Tier1 weekly operations Grid 20100208

Contents

Operational Issues

Blocking Issues

Developments/Plans

Highlights for Tier-1 Ops Meeting

Highlights for Tier-1 VO Liaison Meeting

Detailed Individual Reports

Alastair

Andrew

Catalin

Derek

Matt

Richard

Mayo

VO Reports

ALICE

ATLAS

CMS

LHCb

OnCall/AoD Cover

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools