RAL Tier1 weekly operations Grid 20100329

From GridPP Wiki
Jump to: navigation, search

Operational Issues

Description Start End Affected VO(s) Severity Status
Job status monitoring from CREAMCE 2-Feb-2010 CMS medium [10-Feb-2010] WMS patch available soon; CREAMCE new version available soon

Downtimes

Description Hosts Type Start End Affected VO(s)

Blocking Issues

Description Requested Date Required By Date Priority Status
Hardware for Testbed Medium Required for change validation, load testing, etc. Also for phased rollout (which replaces PPS).

Have initial hardware.

[2010-02-22] More hardware expected by end of March.

HW for SL5 CMS Phedex Vobox High Required to replace the existing SL4 machine

Developments/Plans

Highlights for Tier-1 Ops Meeting

  • Support survey sent out to all VOs which use RAL Tier-1. Reponses requested by 23rd April.
  • Talks from Batch system training
  • CMS will start using GGUS team tickets instead of Savannah tickets for Tier-1s

Highlights for Tier-1 VO Liaison Meeting

  • Support survey sent out to all VOs which use RAL Tier-1. Reponses requested by 23rd April.
  • lcgce05 deployed for non-LHC vo access to SL5 WNs

Detailed Individual Reports

Alastair

  • Watch for ATLAS problems during LHC first collisions.
  • Add extra diskservers to ATLASGROUPDISK space token and set this up in TiersofATLAS.
  • Change FTS transfer settings for Tier 2 channels.

Andrew

  • Sent out support survey to VOs (responses requested by April 23rd) [Done]
  • Added per-VO job monitoring of lcgce01 [Done]
  • Sorting out gaps & problems in APEL publishing [Ongoing]
  • Installing & setting up FTS monitor, including DN restriction
  • I/O tests of official version of patches to go into CMSSW (skimming & reconstruction) [Done]
  • CMS data ops
    • Started running backfill at FNAL and CNAF [Ongoing]
    • Cleaned up some old ProdAgent instances, installed some new 0_12_17_patch3
  • PPD staff meeting; batch-system training

Catalin

  • non-LHC LFC schema clean up (w/ Carmine) [done]
  • work on various Nagios checks on grid services hosts [ongoing]
  • work on Dataguard replication (w/ Carmine) [ongoing]
  • quattorise additional LFC frontends (w/ Ian) [ongoing]
  • various grid services yum updates [ongoing]
  • install squid on LHCb VOBOX

Derek

  • Enabling new vo on ce.ngs host
  • Publishing lcgce05
  • Batch system training [Done]
  • Writing Open day talk

Matt

  • Write up production plans.

Richard

  • Re-working the Grid Services Quattorisation Roadmap as a WIKI page [done]
  • Working on proposal on intra/inter -team communication to meet an action from the team awayday
  • Reviewing G/S process documentation
  • Further Nagios items from the to-do list (https://wiki.e-science.cclrc.ac.uk/web1/bin/view/EScienceInternal/NagiosTasksToDo)
  • CASTOR items:
    • Deployed one set of diskservers into the lhcbNonProd s/class and the other into cmsNonProd [done]
    • Updating benchmarking tool to meet requirements of pre-prod stress testing

Mayo

  • TSBN spreadsheet backend script to copy data form castoradm1 to TSBN spreadsheet [Done]
  • Create Batch job to run TSBN backend script and update web interface automatically [Done]
  • implement feedback into TSBN web interface
  • Set up scripts that update TSBN interface to run as scheduled jobs on a windows machine
  • write user experience report on NGS certificate wizard project [Done]
  • writing and configuring Nagios nrpe plugins

VO Reports

ALICE

ATLAS

CMS

  • CMS will start using GGUS team tickets instead of Savannah tickets for Tier-1s

LHCb

OnCall/AoD Cover

  • Primary OnCall:
  • Grid OnCall: Catalin (Mon-Wed), Derek (Thu - Sun)
  • AoD: