Difference between revisions of "RAL Tier1 weekly operations Grid 20100222"

Latest revision as of 13:25, 24 February 2010

Operational Issues

Description	Start	End	Affected VO(s)	Severity	Status
Job status monitoring from CREAMCE	2-Feb-2010		CMS	medium	[10-Feb-2010] WMS patch available soon; CREAMCE new version available soon
CRL issues for SL4 batch	Tue 16 Feb 2010	Wed 17 Feb 2010	non-LHC	medium	solved; CRLs updated on NFS server
ATLAS s/w server overloaded	Sun 21 Feb 2010	Ongoing	ATLAS	medium

Downtimes

Description	Hosts	Type	Start	End	Affected VO(s)
RAID and memory issues	lcgce07 and lcg0280	SD	Fri 19 Feb 2010 14:00	Tue 23 Feb 2010 16:00	CMS, Alice, LHCb

Blocking Issues

Description

Requested Date

Required By Date

Priority

Status

Hardware for testing LFC/FTS resilience

High

DataServices want to deploy a DataGuard configuration to test LFC/FTS resilience; request for HW made through RT Fabric queue

Production hardware will be available soon.

[2010-02-22] Test hardware available; some config tweaks needed.

Hardware for Testbed

Medium

Required for change validation, load testing, etc. Also for phased rollout (which replaces PPS).

Have initial hardware.

[2010-02-22] More hardware expected by end of March.

Hardware for additional SL4 LFC frontends

Medium

Required to improve resilience of existing LFC services

Developments/Plans

Highlights for Tier-1 Ops Meeting

Ongoing load issues on ATLAS s/w server.
ATLAS 4GB jobs having minimal affect regarding blocked job starts (~1%).
FTS 2.2 released; starting to test upgrade path.
Disk deployment: 100TB requested for ATLAS to enable LHCb drain to commence; capacity needed in SimStrip, which was filled over the weekend.

Highlights for Tier-1 VO Liaison Meeting

Disk deployment: 100TB requested for ATLAS to enable LHCb drain to commence; capacity needed in SimStrip, which was filled over the weekend.
FTS2.2 testing ongoing; CNAF experiencing problems with upgrade.

Detailed Individual Reports

Alastair

Work with Brian + Chris in re-deploying disk servers to ATLAS space tokens. [Ongoing]
Write scripts to monitor effect of 4GB memory limit change on batch system. [Done]
Monitor/investigate ATLAS MC production and re-processing currently going on at RAL. [Ongoing]

Andrew

Running backfill at RAL (re-reco of BeamCommissioning09 Cosmics) [ongoing]
Ran a production workflow: reprocessing of a Summer09 MC sample (generated data is custodial) [Done]
Added ganglia monitoring of usage of CMS tape pools (per tape pool & combined stack plot) [Done]
Testing new disk server on CASTOR pre-prod instance with CMSSW (skimming & reconstruction) [ongoing]
Added .tr (for T2_TR_METU) to CLOUD-CMS-CERN FTS channel [Done]

Catalin

're-certified' ATLAS Frontier after 3D migration (with Alastair) [done]
install APEL patches on CEs [ongoing]
work on LFC schema tidying up (with Carmine) [ongoing]
quattorise additional LFC frontends (with Ian - pending on HW provisioning)
lcgce07 downtime - disk replacement, memory swap

Derek

A/L

Matt

FTS2.2
- Look at GGUS bug regarding checksum scenarios [Done]
- Test upgrade path from FTS2.1 to FTS2.2 on orisa
Disk deployment: request 100TB for ATLAS to enable LHCb drain to commence [Done]
Tier-1 Open Day talk for Grid Services
Test FTS functionality for T2K [Done]
CA updates on service nodes (including CEs in Derek's absence) [Done]
Test APEL publication with latest patches
Request dedicated diskpool for T2K (depends on allocation)

Richard

Submitted change control request for rolling out quattorised BDII server [Done]
Now working with Ian C to "factorise" the template so that non-machine specific items are distributed to the appropriate points in the hierarchy of templates
Working on the Grid Services Quattorisation Roadmap
Writing a proposal on intra/inter -team communication to meet an action from the team awayday
Reviewing G/S process documentation
Further Nagios items from the to-do list (https://wiki.e-science.cclrc.ac.uk/web1/bin/view/EScienceInternal/NagiosTasksToDo)
CASTOR items:
- Working on benchmarking plan to establish baseline performance before upgrading to new CASTOR release(s)
- Set up a "Plan B" CASTOR LSF server in case the need arises [Done]

Mayo

Adding bar chart to Metric system [Done]
Admin interface for Metric System [Done]
TSBN spreadsheet web interface and backend automation script
writing and configuring Nagios nrpe plugins

VO Reports

ALICE

ATLAS

CMS

Problems over the past week: Oracle problems affecting transfers (x2); writes to cmsWanIn pending for too long causing transfers to fail (x4); tape migration; tape recall problems (one tape); gdss364 problems caused jobs to fail on 19th-20th Feb
Transfers to/from RAL over the past week:
- from CERN: 13.1 TB (Commissioning10 cosmics)
- from T2s: 2.4 TB
- to T1s: 19.5 TB
- to T2s: 20.4 TB
- migrated to tape: 25.5 TB
CPU usage over the past week:
- backfill (re-reco) & MC reprocessing: 7384 KSI2K days, CPU efficiency 92%
- skimming: 464 KSI2K days, CPU efficiency 51%

LHCb

OnCall/AoD Cover

Primary OnCall:
Grid OnCall: Matt (Mon-Sun)
AoD:

Difference between revisions of "RAL Tier1 weekly operations Grid 20100222"

Latest revision as of 13:25, 24 February 2010

Contents

Operational Issues

Downtimes

Blocking Issues

Developments/Plans

Highlights for Tier-1 Ops Meeting

Highlights for Tier-1 VO Liaison Meeting

Detailed Individual Reports

Alastair

Andrew

Catalin

Derek

Matt

Richard

Mayo

VO Reports

ALICE

ATLAS

CMS

LHCb

OnCall/AoD Cover

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools