RAL Tier1 weekly operations Grid 20100301

Operational Issues

Description	Start	End	Affected VO(s)	Severity	Status
Job status monitoring from CREAMCE	2-Feb-2010		CMS	medium	[10-Feb-2010] WMS patch available soon; CREAMCE new version available soon
ATLAS s/w server overloaded	Sun 21 Feb 2010	Thu 26 Feb 2010	ATLAS	medium

Downtimes

Description	Hosts	Type	Start	End	Affected VO(s)

Blocking Issues

Description

Requested Date

Required By Date

Priority

Status

Hardware for testing LFC/FTS resilience

High

DataServices want to deploy a DataGuard configuration to test LFC/FTS resilience; request for HW made through RT Fabric queue

Production hardware will be available soon.

[2010-02-22] Test hardware available; some config tweaks needed.

Hardware for Testbed

Medium

Required for change validation, load testing, etc. Also for phased rollout (which replaces PPS).

Have initial hardware.

[2010-02-22] More hardware expected by end of March.

Hardware for additional SL4 LFC frontends

Medium

Required to improve resilience of existing LFC services

Developments/Plans

Highlights for Tier-1 Ops Meeting

Confirming upgrade procedure for FTS2.1 to FTS2.2.
CMS: Cosmics data taking continued, then splash events over the weekend. 27 splashes from beam 1, 30 splashes from beam 2.

Highlights for Tier-1 VO Liaison Meeting

Disk deployments for ATLAS and LHCb. No overall change for ATLAS.
FTS2.2 upgrade path tested, and endpoint available for further testing.

Detailed Individual Reports

Alastair

Work with Brian in deploying disk servers for the new ATLAS space token requests.
Monitor first ATLAS powerusers that have started to use Tier 1
Monitor/investigate ATLAS MC production and re-processing currently going on at RAL. [Ongoing]

Andrew

CMS backfill
- Continued running at RAL; also started at ASGC
- Will be responsible for running at IN2P3 from 1st March (as well as RAL)
Investigating why the skimming jobs currently running on Commissioning10 data have very low CPU efficiency
Disk server deployment (gdss119 to lhcbNonProd, gdss393,414,415 to atlasNonProd then atlasSimStrip) [Done]
Added a DN (for CMS) to renewer/retriver host list on RAL MyProxy [Done]
Deleted CMS data from /store/unmerged & /store/testfile-put-*.txt files [Done]
Made a number of adjustments to maui.cfg due to the ATLAS software disk problems [Done]
LHCb disk server deployment [To do]

Catalin

lcgce07 downtime - disk replacement, memory swap [done]
install APEL patches on CEs [ongoing]
work on LFC schema tidying up (w/ Carmine) [ongoing]
quattorise additional LFC frontends (w/ Ian - pending on HW provisioning)
enable ngs.ac.uk on LFC catalog

Derek

A/L

Matt

Tier-1 talk
FTS2.2
- Confirming upgrade procedure for FTS2.1 to FTS2.2.
- Initial test of upgrade path from FTS2.1 to FTS2.2 on orisa [Done]
CA updates (again) on service nodes (including CEs in Derek's absence)
Test APEL publication with latest patches [Done]
Request dedicated diskpool for T2K (depends on allocation)

Richard

Checking behaviour of new/old BDII servers to ensure that important information is not being suppressed
Working on the Grid Services Quattorisation Roadmap
Working on proposal on intra/inter -team communication to meet an action from the team awayday
Reviewing G/S process documentation
Further Nagios items from the to-do list (https://wiki.e-science.cclrc.ac.uk/web1/bin/view/EScienceInternal/NagiosTasksToDo)
CASTOR items:
- Working on benchmarking plan to establish baseline performance before upgrading to new CASTOR release(s)
- Adding support for lcg-cp command to stress testing suite

Mayo

TSBN spreadsheet web interface (first version) [Done]
TSBN spreadsheet backend script to copy data form castoradm1 to TSBN spreadsheet
Create Batch job to run TSBN backend script and update web interface automatically
writing and configuring Nagios nrpe plugins

VO Reports

ALICE

ATLAS

CMS

Problems with jobs failing (backfill and skimming) due to gdss364 RAID card issue
Cosmics data taking continued
Splash events over the weekend: 27 splashes from beam 1, 30 splashes from beam 2

LHCb

OnCall/AoD Cover

Primary OnCall:
Grid OnCall: Catalin (Mon, Wed-Sun)
AoD:

RAL Tier1 weekly operations Grid 20100301

Contents

Operational Issues

Downtimes

Blocking Issues

Developments/Plans

Highlights for Tier-1 Ops Meeting

Highlights for Tier-1 VO Liaison Meeting

Detailed Individual Reports

Alastair

Andrew

Catalin

Derek

Matt

Richard

Mayo

VO Reports

ALICE

ATLAS

CMS

LHCb

OnCall/AoD Cover

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools