RAL Tier1 weekly operations Grid 20100125

Summary of Previous Week

Developments

Alastair
- Ran tests on Frontier server to confirm it is working well. Small number (< 10) errors not understood currently.
- Completed version 1 of Tier 1 VO requirements with information that has been provided by Raja.
- Helped deploy disk servers to ATLAS scratch disk when it became full. Wrote script to clean scratchdisk should it rapidly fill again.
Andrew
- CMS data ops training at CERN (ProdAgent; MC production; backfill; creating workflows)
Catalin
- handed over the SL5 LHCb VOBOX
- ATLAS Frontier 3.22 update
- tested the Alice xrootd (manager + peer) re-installation
- followed up some post-reboot WMS issues with CERN
Derek
- Added BLParser to lcgbatch01
- Wrote quattor template for SL5 Glexec
- Added extra config to yaim configuration to be categorised correctly by GSTAT
- Deployed and tested Staged Rollout SL5 64 bit top BDII
- Completed Multi User pilot job questionnaire
Matt
- Migrated FTS agents to warm standby host
- Documented R-GMA recovery procedure
- Finished draft of Grid Services Disaster Recovery document
- Provided test site BDII for CIP upgrade testing, and tested CIP output
- Provided input for Grid Team for GridPP4
Richard
- 2 days A/L
- Catch-up
- Fire Safety Training
- Attended "share out" meeting for Nagios/NRPE plugins
- Re-built the machine sv-08-02 as a test BDII server for Jens' CIP activity
- CASTOR items:
  - Re-installed castor303.ads as a CASTOR disk disk server
  - Castor301 needs same treatment but has memory problems
Mayo
- Encrypted passwords within the Metric system
- Added a change password feature to the metric system
- Fixed a bug within the Metric system
- Worked on tape statistics spreadsheet project: converting excel chatrs to HTML

Operational Issues and Incidents

Description	Start	End	Affected VO(s)	Severity	Status
Disk errors on FTS Agents host	20100119 14:00	20100121 09:00	LHC	Low	Schedules migration agents to standby host

Plans for Week(s) Ahead

Plans

Alastair
- Understand remaining errors from HC test.
- Continue updating RAL PP twiki.
- Prepare slides for presentation on computing requirements.
- Write Nagios script to warn when space token are near full.
Andrew
- IO testing: will try new CMSSW TTreeCache patch & compare to lazy download
- Investigate CMS job status reporting problems for my backfill jobs
- Investigate my backfill jobs killed by batch system
- Investigate my aborting MC production jobs at UK T2s/T3s
- Write document about automatic job killing
Catalin
- WMS01 and 02 upgrades
- kernel updates
- chase CERN for LFC schemas tidying up
- test Alice xrootd (manager + peer) re-installation (with ChrisK)
Derek
- Reinstalling lcgce08 with host swap config
- Reconfiguring lcgce01
- Continuing work on Glexec and SCAS
Matt
- FTS drain and migration of front-ends back to somnus
- Test upgrade path from FTS2.1 to FTS2.2 on orisa
- Planning ATLAS/R89 co-hosting of Grid Services
- Plan T2K configuration of FTS, and request dedicated diskpool
Richard
- Manual Handling Training
- Feed back results from Jens' CIP testing into Quattor profile for BDII server
- CASTOR items:
  - Finish installing the suite of RPMs needed on castor303 (new disk server)
  - Re-install castor301 when memory has been fixed.
Mayo
- Automating Metric report system
- Adding charts to the metric system
- Web interface and script to fetch data for Tape robot statistics spreadsheet project

VO Reports

ALICE

Status Report (21 Jan) - very stable behaviour, CREAM stability exceptional, "a bit of free resources" (re: RAL farm between week 53/2009 and week 02/2010)

ATLAS

ATLAS confirmed that RAL WMS not critical for UK operations.

CMS

RAL ranked 2 in T1 Site Readiness Ranking on 2010-01-25 (last 2 weeks)

Only JobRobot and Backfill (re-reco) jobs running recently

High CMS network usage on Friday was due to lazy-download not being specified in the reco config file

LHCb

Other Tier-1s (IN2P3, PIC) also reporting low job efficiencies. Suspect LHCb user or application problems.

LHCb confirmed that RAL WMS not critical for UK operations.

New release of DIRAC, so testing on SL5 VOBOX should be able to proceed shortly.

Resource Requests

Downtimes

Description	Hosts	Type	Start	End	Affected VO(s)
Disk problems on FTS agents host	lcgfts01	Scheduled	20100121 07:00	20100121 09:00	LHC

Requirements and Blocking Issues

Description	Required By	Priority	Status
Hardware for testing LFC/FTS resilience		High	DataServices want to deploy a DataGuard configuration to test LFC/FTS resilience; request for HW made through RT Fabric queue
Hardware for Testbed		High	Required for change validation, load testing, etc. Also for phased rollout (which replaces PPS).
Hardware for SCAS servers	Feb 1 2010	High	Hardware required for production SCAS servers - required to be in place by end of Feb
Hardware for SL5 CREAM CE for Non LHC SL5 batch access		Medium	Hardware required for CREAM CE for non-LHC vos
Pool accounts for Super B vo		Medium	Required to enable Super B vo on batch farm

OnCall/AoD Cover

Primary OnCall:
Grid OnCall: Derek (Mon-Sun)
AoD:

RAL Tier1 weekly operations Grid 20100125

Contents

Summary of Previous Week

Developments

Operational Issues and Incidents

Plans for Week(s) Ahead

Plans

VO Reports

ALICE

ATLAS

CMS

LHCb

Resource Requests

Downtimes

Requirements and Blocking Issues

OnCall/AoD Cover

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools