RAL Tier1 weekly operations Grid 20100412

From GridPP Wiki
Jump to: navigation, search

Operational Issues

Description Start End Affected VO(s) Severity Status
Job status monitoring from CREAMCE 2-Feb-2010 CMS medium [10-Feb-2010] WMS patch available soon; CREAMCE new version available soon [07-Apr-2010] CMS tests have shown that WMS patches resolve the problem; still waiting for patch to be installed on the production WMSs in Italy
LFC/FTS downtime 12-Apr-2010 ~11:30 12-Apr-2010 ~14:30 all high Failure of one RAC node in the database behind the LFC and FTS services

Downtimes

Description Hosts Type Start End Affected VO(s)

Blocking Issues

Description Requested Date Required By Date Priority Status
Hardware for Testbed Medium Required for change validation, load testing, etc. Also for phased rollout (which replaces PPS).

Have initial hardware.

[2010-02-22] More hardware expected by end of March.

HW for SL5 CMS Phedex Vobox High Required to replace the existing SL4 machine

Developments/Plans

Highlights for Tier-1 Ops Meeting

  • CMS started MC reprocessing at all Tier-1s (~500 workflows) on 9th April. No problems at RAL so far.

Highlights for Tier-1 VO Liaison Meeting

Detailed Individual Reports

Alastair

  • All disk servers for ATLAS deployed or back with Fabric.
  • Working on ATLAS software server upgrade
  • Working on setting up and testing ATLASGROUP disk at RAL.

Andrew

  • Deployed gdss121,136,137 from cmsNonProd to cmsWanOut [Done]
  • Upgraded prod & debug instances of PhEDEx to 3_3_0 [Done]
  • Added new DESY VOMS certificate to tier1-yaim-config [Done]
  • Added ganglia CPU efficiency plots with 10-minute time resolution; added new gmetric to lcgbatch01 [Done]
  • Added CE metrics to the grid services "special pages" in ganglia [Done]
  • March accounting [Done]
  • Made corrections to various scripts involving KSI2K to/from HEP-SPEC06 [Done]
  • Deleted some CMS files from /store/unmerged [Done]
  • VO support survey
    • 3 responses so far, one saying that don't use RAL
    • Started preparing results document
  • CMS data ops
    • Continued running backfill at FNAL and CNAF (moved to PA 0_12_17_patch3)
    • Run MC reprocessing workflows at FNAL and CNAF [Ongoing]
  • From previous week:
    • Added new IN2P3 hosts to renewers/retrievers list [Done]
    • Converted fts script to use Oracle DB rather than webpage [Done]
    • Updated tier1-vobox-config on the 3 CMS VOBOXs [Done]
    • Added per-VO job monitoring to lcgce05 [Done]
    • Upgraded PhEDEx dev instance to 3_3_0 (clean install) [Done]
    • Updated eff-stats.pl to generate eff-stats.csv in HEP-SPEC06 [Done]
    • CMS data ops
      • Ran urgent MC rereco at FNAL for media event [Done]
      • Continued backfill at FNAL & CNAF [Done]

Catalin

  • work on various Nagios checks on grid services hosts (new BDII, NOCALLOUT checks) [ongoing]
  • work on Dataguard replication (w/ Carmine) [ongoing]
  • install squid on LHCb VOBOX
  • GridPP24 (Wed, Thu)

Derek

  • Deployed 5 new top level BDIIS [Done]
  • Stopped publishing lhcb on grid1000M queue [Done]
  • Testing config changes to CE for mapping updates for glexec
  • Testing queue addition to batch system for atlas
  • At GridPP + Deployment Board (Wed-Fri)

Matt

  • Write up production plans.

Richard

  • 2 days at GridPP meeting (Wed-Thu)
  • Applied latest OpenLDAP RPMs to quattor-built BDII -- now watching this machine to check behaviour
  • Re-working the Grid Services Quattorisation Roadmap as a WIKI page [done]
  • Working on proposal on intra/inter -team communication to meet an action from the team awayday
  • Reviewing G/S process documentation
  • Further Nagios items from the to-do list (https://wiki.e-science.cclrc.ac.uk/web1/bin/view/EScienceInternal/NagiosTasksToDo)
  • CASTOR items:
    • Updating benchmarking tool to meet requirements of pre-prod stress testing [Done]
    • Found d/b performance problem when running stress tests on pre-prod (missing table indexes)
    • Re-start run of p/p stress testing

Mayo

  • Implement feedback into TSBN web interface
  • Set up scripts that update TSBN interface to run as scheduled jobs on a windows machine
  • Writing and configuring Nagios nrpe plugins [Done]
  • Certificate viewer for NGS cert wizard
  • Write PDU power controller query script

VO Reports

ALICE

  • upgrade Alien v2.18 on VOBOXes

ATLAS

CMS

  • For MinimumBias during 7 TeV stable beams, average event size: 152 kB RAW, 125 kB RECO
  • (8th April) Request for tape families to be setup for custodial & non-custodial data for upcoming real high energy physics runs
  • (9th April) Started reprocessing of all Summer09 MC samples at all Tier-1s. No problems at RAL so far.

LHCb

OnCall/AoD Cover

  • Primary OnCall: Catalin (Mon-Sun excl Wed)
  • Grid OnCall:
  • AoD: