RAL Tier1 weekly operations Grid 20100315

From GridPP Wiki
Jump to: navigation, search

Operational Issues

Description Start End Affected VO(s) Severity Status
Job status monitoring from CREAMCE 2-Feb-2010 CMS medium [10-Feb-2010] WMS patch available soon; CREAMCE new version available soon

Downtimes

Description Hosts Type Start End Affected VO(s)

Blocking Issues

Description Requested Date Required By Date Priority Status
Hardware for Testbed Medium Required for change validation, load testing, etc. Also for phased rollout (which replaces PPS).

Have initial hardware.

[2010-02-22] More hardware expected by end of March.

Developments/Plans

Highlights for Tier-1 Ops Meeting

  • CMS: RAL will be getting the data (custodial) for the possible 900 GeV collisions
  • FTS2.2 upgrade on Wednesday
  • Disk Deployment meeting Tuesday at 10:00; small number of ongoing issues; moved to F51.

Highlights for Tier-1 VO Liaison Meeting

  • FTS2.2 upgrade done.
  • Announced plans to non-LHC VOs to support them on SL5 batch.
  • Deploying glexec on WN.

Detailed Individual Reports

Alastair

  • Invesitage ways of installing ATLAS software in a new AFS test area.
  • Monitor ATLAS MC production and re-processing currently going on at RAL. [Ongoing]
  • Continue ATLAS disk deployment.

Andrew

  • February accounting [Done]
  • Renewed certificates for lcgvo0598 & lcgvo0599 [Done]
  • Sent draft Tier-1 VO survey questions to Glenn for comment [Done]
  • CMS data ops
    • Ran 2 MC reprocessing workflows at RAL [Done]
    • Ran 1 rereco MC preproduction workflow at IN2P3 [Done]
    • Installed and setup ProdAgent on new SLC5 CMS VOBOX (at CERN) [Done]
    • Re-started backfill at RAL and IN2P3 [Ongoing]

Catalin

  • tidying up Nagios configurations (ALICE VOBOX, CE, SCAS) [done]
  • LHCb LFC re-configuration [done]
  • work on LFC schema tidying up (w/ Carmine) [ongoing]
  • work on Dataguard replication (w/ Carmine) [ongoing]
  • quattorise additional LFC frontends (w/ Ian) [ongoing]
  • various grid services updates (following TOAST)

Derek

  • Change Control and Deploying SCAS servers and glexec
  • Deploying SL5 CREAMCE for non-LHC vos
  • Deploying infrastructure host for testbed
  • Writing talks for batch system training

Matt

  • Tier-1 talk.
  • FTS2.2:
    • Submit Change Control request. [Done]
    • Fix t2k/t2k.org configuration problems. [Done]
    • Upgrade confirmed for Wednesday.
  • Test SL5 CREAM CE installation. [Done]
  • Disk deployment meeting.
  • Update resource profiles for Q2/10.
  • Organise testbed strategy strand meetings. [Done]

Richard

  • Using stress-testing script developed for CASTOR to test behaviour of new BDII server
  • Re-working the Grid Services Quattorisation Roadmap as a WIKI page
  • Working on proposal on intra/inter -team communication to meet an action from the team awayday
  • Reviewing G/S process documentation
  • Further Nagios items from the to-do list (https://wiki.e-science.cclrc.ac.uk/web1/bin/view/EScienceInternal/NagiosTasksToDo)
  • CASTOR items:
    • Working on benchmarking plan to establish baseline performance before upgrading to new CASTOR release(s)

Mayo

  • Uploaded new Metric system Documentation to the Tier1 wiki[Done]
  • Fixed Bug in Metric system pie charts [Done]
  • TSBN spreadsheet backend script to copy data form castoradm1 to TSBN spreadsheet [Done]
  • Create Batch job to run TSBN backend script and update web interface automatically [Done]
  • implement feedback into TSBN web interface
  • Set up scripts that update TSBN interface to run as scheduled jobs on a windows machine
  • Begin collaboration with SCT on NGS certificate wizard project
  • writing and configuring Nagios nrpe plugins

VO Reports

ALICE

ATLAS

CMS

  • RAL will be getting the data (custodial) for the possible 900 GeV collisions
  • Unforeseen collisions at 2.36 TeV for 40 minutes on 14th March

LHCb

OnCall/AoD Cover

  • Primary OnCall: Catalin
  • Grid OnCall:
  • AoD: