RAL Tier1 weekly operations Grid 20101220

From GridPP Wiki
Jump to: navigation, search

Operational Issues

Description Start End Affected VO(s) Severity Status

Downtimes

Description Hosts Type Start End Affected VO(s)

Blocking Issues

Description Requested Date Required By Date Priority Status
DNS change request for Atlas squids 07 Dec 2010 RT#70487

Developments/Plans

Highlights for Tier-1 Ops Meeting

Highlights for Tier-1 VO Liaison Meeting

Detailed Individual Reports

Alastair

  • ATLAS TaskForce [ongoing]
  • Draining SL08 disk servers deployed to ATLAS service classes.
  • Working on ATLAS permission change. [On hold]

Andrew

  • Capacity planning system [Ongoing]
  • Removed ATLAS/LHCb disk caches from UB schedule scripts [Done]
  • Wrote Nagios plugin to monitor CMS job monitoring [Done]
  • Updates to CMS job monitoring XML file format [Done]
  • Dealing with corrupt files
  • CMS data ops
    • Rereco, skims at RAL, IN2P3, KIT [Ongoing]
    • Dec4 rereco postmortem

Catalin

  • test squid deployments for ATLAS [done]
  • finalise quattor templates for ATLAS squid machines [ongoing]
  • work on Tier1 DB migration plans [ongoing]

Derek

  • Deploying testbed batch system [ongoing]
  • Debugging issue with Magic jobs [ongoing]
  • Initial rollout of setting Operating System config on pbs mom on batch workers to sl5 [ongoing]
  • Removed reservation and increased job limit for atlassgm to 10 to allow more cvmfs validation jobs over holiday

Matt

  • T2K FTS configuration. [Done]
  • Prep for A/L. [Done]
  • Quattorisation of FTM. [Done]
  • Deploying PBS JobMon monitoring tools. [Stalled]
  • Test FTS SRM/GridFTP ratio configuration. [Stalled]

Richard

  • Wrote a gmetric tool to measure Quattor deploy hitrate (i.e. percentage of deploys (as found in SVN repo) that were "seen" by a machine) [Done]
  • Working prototype of tool for automatic the checking of middleware baselines now in place [Done]
  • Developing a set of Quattor templates for an ARGUS server [Ongoing]
  • Developing a "pseudo-update" to apply gLite update 19 to BDIIs [Ongoing]
  • Working on the "team status page" being developed as an action from team awayday [Ongoing]
  • Reviewing G/S process documentation [Ongoing]
  • CASTOR items:
    • Added an LSF server to the "cert-in-a-box" cluster. [Ongoing]

VO Reports

ALICE

ATLAS

CMS

LHCb

OnCall/AoD Cover

OnCall Rota

  • Primary OnCall:
  • Grid OnCall: Catalin (Mon-Thu), Derek (Sat-Sun)
  • AoD: