RAL Tier1 weekly operations Grid 20100301

From GridPP Wiki
Jump to: navigation, search

Operational Issues

Description Start End Affected VO(s) Severity Status
Job status monitoring from CREAMCE 2-Feb-2010 CMS medium [10-Feb-2010] WMS patch available soon; CREAMCE new version available soon
ATLAS s/w server overloaded Sun 21 Feb 2010 Thu 26 Feb 2010 ATLAS medium

Downtimes

Description Hosts Type Start End Affected VO(s)

Blocking Issues

Description Requested Date Required By Date Priority Status
Hardware for testing LFC/FTS resilience High DataServices want to deploy a DataGuard configuration to test LFC/FTS resilience; request for HW made through RT Fabric queue

Production hardware will be available soon.

[2010-02-22] Test hardware available; some config tweaks needed.

Hardware for Testbed Medium Required for change validation, load testing, etc. Also for phased rollout (which replaces PPS).

Have initial hardware.

[2010-02-22] More hardware expected by end of March.

Hardware for additional SL4 LFC frontends Medium Required to improve resilience of existing LFC services

Developments/Plans

Highlights for Tier-1 Ops Meeting

  • Confirming upgrade procedure for FTS2.1 to FTS2.2.
  • CMS: Cosmics data taking continued, then splash events over the weekend. 27 splashes from beam 1, 30 splashes from beam 2.

Highlights for Tier-1 VO Liaison Meeting

  • Disk deployments for ATLAS and LHCb. No overall change for ATLAS.
  • FTS2.2 upgrade path tested, and endpoint available for further testing.

Detailed Individual Reports

Alastair

  • Work with Brian in deploying disk servers for the new ATLAS space token requests.
  • Monitor first ATLAS powerusers that have started to use Tier 1
  • Monitor/investigate ATLAS MC production and re-processing currently going on at RAL. [Ongoing]

Andrew

  • CMS backfill
    • Continued running at RAL; also started at ASGC
    • Will be responsible for running at IN2P3 from 1st March (as well as RAL)
  • Investigating why the skimming jobs currently running on Commissioning10 data have very low CPU efficiency
  • Disk server deployment (gdss119 to lhcbNonProd, gdss393,414,415 to atlasNonProd then atlasSimStrip) [Done]
  • Added a DN (for CMS) to renewer/retriver host list on RAL MyProxy [Done]
  • Deleted CMS data from /store/unmerged & /store/testfile-put-*.txt files [Done]
  • Made a number of adjustments to maui.cfg due to the ATLAS software disk problems [Done]
  • LHCb disk server deployment [To do]

Catalin

  • lcgce07 downtime - disk replacement, memory swap [done]
  • install APEL patches on CEs [ongoing]
  • work on LFC schema tidying up (w/ Carmine) [ongoing]
  • quattorise additional LFC frontends (w/ Ian - pending on HW provisioning)
  • enable ngs.ac.uk on LFC catalog

Derek

  • A/L

Matt

  • Tier-1 talk
  • FTS2.2
    • Confirming upgrade procedure for FTS2.1 to FTS2.2.
    • Initial test of upgrade path from FTS2.1 to FTS2.2 on orisa [Done]
  • CA updates (again) on service nodes (including CEs in Derek's absence)
  • Test APEL publication with latest patches [Done]
  • Request dedicated diskpool for T2K (depends on allocation)

Richard

  • Checking behaviour of new/old BDII servers to ensure that important information is not being suppressed
  • Working on the Grid Services Quattorisation Roadmap
  • Working on proposal on intra/inter -team communication to meet an action from the team awayday
  • Reviewing G/S process documentation
  • Further Nagios items from the to-do list (https://wiki.e-science.cclrc.ac.uk/web1/bin/view/EScienceInternal/NagiosTasksToDo)
  • CASTOR items:
    • Working on benchmarking plan to establish baseline performance before upgrading to new CASTOR release(s)
    • Adding support for lcg-cp command to stress testing suite

Mayo

  • TSBN spreadsheet web interface (first version) [Done]
  • TSBN spreadsheet backend script to copy data form castoradm1 to TSBN spreadsheet
  • Create Batch job to run TSBN backend script and update web interface automatically
  • writing and configuring Nagios nrpe plugins

VO Reports

ALICE

ATLAS

CMS

  • Problems with jobs failing (backfill and skimming) due to gdss364 RAID card issue
  • Cosmics data taking continued
  • Splash events over the weekend: 27 splashes from beam 1, 30 splashes from beam 2

LHCb

OnCall/AoD Cover

  • Primary OnCall:
  • Grid OnCall: Catalin (Mon, Wed-Sun)
  • AoD: