RAL Tier1 weekly operations Grid 20100322

From GridPP Wiki
Jump to: navigation, search

Operational Issues

Description Start End Affected VO(s) Severity Status
Job status monitoring from CREAMCE 2-Feb-2010 CMS medium [10-Feb-2010] WMS patch available soon; CREAMCE new version available soon

Downtimes

Description Hosts Type Start End Affected VO(s)
non-LHC LFC schema clean-up lfc.gridpp.rl.ac.uk, Somnus RAC SD Thu 25 Mar 09:00 Thu 25 Mar 14:00 non-LHC

Blocking Issues

Description Requested Date Required By Date Priority Status
Hardware for Testbed Medium Required for change validation, load testing, etc. Also for phased rollout (which replaces PPS).

Have initial hardware.

[2010-02-22] More hardware expected by end of March.

Developments/Plans

Highlights for Tier-1 Ops Meeting

  • Added per-vo jobs CE monitoring to ganglia
  • non-LHC LFC schema clean up (w/ Carmine)
  • Deploying SCAS servers and glexec
  • LHC status: preparations for 7 TeV collisions continuing; 900 GeV collisions are possibly planned but not scheduled yet

Highlights for Tier-1 VO Liaison Meeting

Detailed Individual Reports

Alastair

  • Invesitage ways of installing ATLAS software in a new AFS test area.
  • Monitor ATLAS MC production and re-processing currently going on at RAL. [Ongoing]
  • Continue ATLAS disk deployment.

Andrew

  • Wrote script & setup cron for checking APEL-PBS consistency daily [Done]
  • Removed a host from renewers & retrievers host list [Done]
  • Modified PhEDEx debug/dev instances then prod instance to use CERN FTS 2.2 server [Done]
  • Disk server deployment: gdss114-117 to cmsNonProd then cmsWanIn [Done]
  • Added per-vo jobs CE monitoring to ganglia [Done, except for lcgce01 which is ongoing]
  • Working out how to install FTS monitor on lcgwww [Ongoing]
  • CMS data ops
    • Running backfill at RAL and IN2P3 [Ongoing]
    • Ran more MC production at RAL (using cmsHotDisk) [Done]

Catalin

  • non-LHC LFC schema clean up (w/ Carmine)
  • work on various Nagios checks on grid services hosts
  • work on Dataguard replication (w/ Carmine) [ongoing]
  • quattorise additional LFC frontends (w/ Ian) [ongoing]
  • various grid services yum updates [ongoing]

Derek

  • Deploying SCAS servers and glexec
  • Change Control for lcgce05 [Done]
  • Deploying infrastructure hosts for testbed
  • Writing talks for batch system training
  • Enabling new vo on ce.ngs host

Matt

  • Revise Tier-1 talk.
  • Write up production plans.
  • Write batch system training material.
  • Upgrade FTS to 2.2.3. [Done]
  • Update resource profiles for Q2/10. [Done]

Richard

  • Using stress-testing script developed for CASTOR to test behaviour of new BDII server
  • Re-working the Grid Services Quattorisation Roadmap as a WIKI page
  • Working on proposal on intra/inter -team communication to meet an action from the team awayday
  • Reviewing G/S process documentation
  • Further Nagios items from the to-do list (https://wiki.e-science.cclrc.ac.uk/web1/bin/view/EScienceInternal/NagiosTasksToDo)
  • CASTOR items:
    • Working on benchmarking plan to establish baseline performance before upgrading to new CASTOR release(s)

Mayo

  • TSBN spreadsheet backend script to copy data form castoradm1 to TSBN spreadsheet [Done]
  • Create Batch job to run TSBN backend script and update web interface automatically [Done]
  • implement feedback into TSBN web interface
  • Set up scripts that update TSBN interface to run as scheduled jobs on a windows machine
  • write user experience report on NGS certificate wizard project [Done]
  • writing and configuring Nagios nrpe plugins

VO Reports

ALICE

ATLAS

CMS

  • The TTreeCache patches will be put into a patched version of CMSSW soon. This should mean that if lazy-download is not used, there will be no problems. RAL has been chosen as the CASTOR site to be tested. Not known yet when this will take place.

LHCb

OnCall/AoD Cover

  • Primary OnCall:
  • Grid OnCall: Derek (Mon - Sun excl Wed), Catalin (Wed)
  • AoD: Catalin (Wed)