RAL Tier1 weekly operations Grid 20100308

From GridPP Wiki
Revision as of 13:20, 10 March 2010 by Matt hodges (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Operational Issues

Description Start End Affected VO(s) Severity Status
Job status monitoring from CREAMCE 2-Feb-2010 CMS medium [10-Feb-2010] WMS patch available soon; CREAMCE new version available soon

Downtimes

Description Hosts Type Start End Affected VO(s)

Blocking Issues

Description Requested Date Required By Date Priority Status
Hardware for testing LFC/FTS resilience High DataServices want to deploy a DataGuard configuration to test LFC/FTS resilience; request for HW made through RT Fabric queue

Production hardware will be available soon.

[2010-02-22] Test hardware available; some config tweaks needed.

Hardware for Testbed Medium Required for change validation, load testing, etc. Also for phased rollout (which replaces PPS).

Have initial hardware.

[2010-02-22] More hardware expected by end of March.

Developments/Plans

Highlights for Tier-1 Ops Meeting

  • ATLAS test of CERN SRMs tomorrow (08:00 to 15:00).
  • Submit FTS2.2 change control.

Highlights for Tier-1 VO Liaison Meeting

  • FTS2.2 upgrade scheduling.
  • Testing CREAMCE for non-LHC VOs.

Detailed Individual Reports

Alastair

  • Invesitage ways of installing ATLAS software in a new AFS test area.
  • Monitor ATLAS MC production and re-processing currently going on at RAL. [Ongoing]
  • Continue ATLAS disk deployment.

Andrew

  • Installed GSL rpms (incl devel) onto lcgui01 (LHCb request) [Done]
  • Disk server deployment: 10 to lhcbNonProd to lhcbDst; 1 from lhcbNonProd to lhcbRawRdst [Done]
  • Changed PhEDEx debug instance to use the test FTS 2.2 endpoint; FTS 2.2 adjustments for Chris Brew [Done]
  • CMS data ops
    • Ran reprocessing of some Commissioning10 cosmics data (prompt reco failed at Tier-0, reprocessing required at RAL) [Done]
    • Ran two MC reprocessing workflows [Ongoing]
    • Continued backfill at RAL; completed backfill at ASGC [Ongoing]
  • Next week
    • Do the February 2010 UB schedule
    • Prepare draft questions for RAL Tier-1 VO survey
    • Add per-VO ganglia monitoring for CEs
    • Add additional page showing additional channel information for FTS 2.2

Catalin

  • enabled ngs.ac.uk on LFC catalog [done]
  • work on LFC schema tidying up (w/ Carmine) [ongoing]
  • work on Dataguard replication (w/ Carmine) [ongoing]
  • quattorise additional LFC frontends (w/ Ian) [ongoing]
  • tidying up Nagios configurations (ALICE VOBOX, CE)

Derek

  • Debugged problem with magic job submissions
  • Deploying SL5 CREAMCE for non-LHC vos

Matt

  • Tier-1 talk.
  • FTS2.2:
    • Change Control.
    • t2k.org configuration problems.
    • Confirming upgrade procedure for FTS2.1 to FTS2.2. [Done]
  • SL5 CREAM CE installation.
  • Update resource profiles for Q2/10.

Richard

  • Checking behaviour of new/old BDII servers to ensure that important information is not being suppressed
  • Working on the Grid Services Quattorisation Roadmap
  • Working on proposal on intra/inter -team communication to meet an action from the team awayday
  • Reviewing G/S process documentation
  • Further Nagios items from the to-do list (https://wiki.e-science.cclrc.ac.uk/web1/bin/view/EScienceInternal/NagiosTasksToDo)
  • CASTOR items:
    • Working on benchmarking plan to establish baseline performance before upgrading to new CASTOR release(s)
    • Adding ability to spreadsheet results of new benchmarking tool

Mayo

  • TSBN spreadsheet backend script to copy data form castoradm1 to TSBN spreadsheet [Done]
  • Create Batch job to run TSBN backend script and update web interface automatically [Done]
  • implement feedback into TSBN web interface
  • Set up scripts that update TSBN interface to run as scheduled jobs on a windows machine
  • Begin collaboration with SCT on NGS certificate wizard project
  • writing and configuring Nagios nrpe plugins

VO Reports

ALICE

ATLAS

CMS

  • Continuous data taking continued
  • Reprocessing of some Commissioning10 cosmics data completed
  • CASTOR problems with MC reprocessing due to "hot" files
    • Resolved due to the setting up of cmsHotDisk service class
  • Backfill, rereco & MC reprocessing jobs over the past week: 20575 jobs; 5616 KSI2K days CPU time; CPU efficiency 80% (low due to problem with "hot" files)
  • Transfers to/from RAL over the past week:
    • from CERN: 4.6 TB
    • from T2s: 7.0 TB
    • from T1s: 1.4 TB
    • to T1s: 5.0 TB
    • to T2s: 3.9 TB
    • migrated to tape: 15.6 TB

LHCb

OnCall/AoD Cover

  • Primary OnCall:
  • Grid OnCall: Derek
  • AoD: