RAL Tier1 weekly operations Grid 20100208

From GridPP Wiki
Jump to: navigation, search

Operational Issues

Description Start End Affected VO(s) Severity Status
Job status monitoring from CREAMCE 2-Feb-2010 CMS medium

Blocking Issues

Description Requested Date Required By Date Priority Status
Hardware for testing LFC/FTS resilience High DataServices want to deploy a DataGuard configuration to test LFC/FTS resilience; request for HW made through RT Fabric queue
Hardware for SCAS servers 2010-02-01 High Hardware required for production SCAS servers - required to be in place by end of Feb

[Done]

Hardware for Testbed Medium Required for change validation, load testing, etc. Also for phased rollout (which replaces PPS).

Have initial hardware.

Hardware for SL5 CREAM CE for Non LHC SL5 batch access Medium Hardware required for CREAM CE for non-LHC VOs

[Done]

Developments/Plans

Highlights for Tier-1 Ops Meeting

  • LHC schedule 2010/2011 (Alastair)
  • Grid Services Team: Out of office all day Tuesday (CR03)
  • CMS: RAL received some Commissioning10 cosmics data from CERN (~60 MB/s, 2 days, ~5 TB); 24/7 operation has started today

Highlights for Tier-1 VO Liaison Meeting

  • SCAS/glexec deployment

Detailed Individual Reports

Alastair

  • Continue work on computing requirements / Capacity Planning. [Ongoing]
  • Write Nagios script to warn when space token are near full. [Done]
  • Work with Brian + Chris in re-deploying/draining disk servers to ATLAS space tokens. [Ongoing]
  • Look into ATLAS jobs hitting 3GB memory limit. [Ongoing]

Andrew

  • Restarted backfill at RAL (re-reco on BeamCommissioning09 Cosmics)
  • Investigated new CREAMCE monitoring issues
  • Adding PhEDEx-CASTOR consistency Ganglia monitoring [Done]
  • Test another new CMSSW I/O optimisation patch & report to developer [Done]
  • Added monitoring of PhEDEx agent restarts [Done]
  • Add warning for CMS files that get stuck in migration queue for weeks
  • Complete document about automatic job killing [Ongoing]

Catalin

  • tested Alice xrootd (manager + peer) re-installation (with Chris) [Done]
  • improved Nagios configuration knowledge [ongoing]
  • Frontier Nagios checks [ongoing]
  • work on LFC schema tidying up (with Carmine) [ongoing]
  • quattorise additional LFC frontends (with Ian)
  • install APEL patches on MONbox (for a correct published installed capacity )

Derek

  • Installing SL5 SCAS server
  • Testing SL5 GLexec WN

Matt

  • Plan ATLAS/R89 co-hosting of Grid Services
  • T2K configuration of FTS, and request dedicated diskpool
  • Test upgrade path from FTS2.1 to FTS2.2 on orisa

Richard

  • Nagios plugin for checking rtcpclientd server logs on CASTOR stagers [Done]
  • Writing a roadmap for completing the quattorisation of Grid Services machines
  • Setting up a quattor template for a top-level BDII that works around issues in the stock QWG templates
  • CASTOR items:
  • Completed setting up disk servers for use with pre-prod CASTOR instance [Done]
    • Waiting for resolution of:
      • Powering off / Crashing problem on ccse02

Mayo

  • Create system for exporting Metrics report to spreadsheet [Done]
  • Adding bar chart to Metric system [Done]
  • Admin interface for Metric System[Done]
  • Update documentation for Metric System
  • Configure assigned nrpe nagios plugins

VO Reports

ALICE

  • Plans to use the 2nd VOBOX at sites in production if and only if the primary one is not behaving well

ATLAS

CMS

  • Proposal to no longer support PhEDEx on SLC4 after March 1 (date not definite yet, to be discussed by FacOps)
  • RAL is now the only Tier 1 with CREAMCE job monitoring issues (before downtime, only jobs killed by batch system affected; now all jobs are affected)
  • RAL has started receiving some Commissioning10 data from CERN (~60MB/s, 2 days, ~5 TB)
  • Current/upcoming activities at RAL: backfill (in progress)

LHCb

OnCall/AoD Cover

  • Primary OnCall: Catalin (Mon-Sun)
  • Grid OnCall:
  • AoD: