RAL Tier1 weekly operations Grid 20101213

From GridPP Wiki
Jump to: navigation, search

Operational Issues

Description Start End Affected VO(s) Severity Status

Downtimes

Description Hosts Type Start End Affected VO(s)

Blocking Issues

Description Requested Date Required By Date Priority Status
DNS change request for Atlas squids 07 Dec 2010 RT#70487

Developments/Plans

Highlights for Tier-1 Ops Meeting

Highlights for Tier-1 VO Liaison Meeting

Detailed Individual Reports

Alastair

  • ATLAS TaskForce [ongoing]
  • Draining SL08 disk servers deployed to ATLAS service classes.
  • Working on ATLAS permission change. [On hold]

Andrew

  • Capacity planning system project [Ongoing]
  • Wrote new script for generating eff-stats.csv [Done]
  • SL09 disk server deployment for CMS to nonprod (delayed deployment to prod due to Overwatch unavailable)
  • CMS data ops
    • Rereco at RAL, IN2P3, KIT [Ongoing]

Catalin

  • test squid deployments for ATLAS [done]
  • finalise quattor templates for ATLAS squid machines [ongoing]
  • work on Tier1 DB migration plans [ongoing]

Derek

  • Investigating solutions for whole node scheduling [ongoing]
  • Deploying testbed batch system [ongoing]

Matt

  • T2K FTS configuration. [Done]
  • Prep for A/L. [Done]
  • Quattorisation of FTM. [Done]
  • Deploying PBS JobMon monitoring tools. [Stalled]
  • Test FTS SRM/GridFTP ratio configuration. [Stalled]

Richard

  • Wrote a gmetric tool to measure Quattor deploy hitrate (i.e. percentage of deploys (as found in SVN repo) that were "seen" by a machine) [Done]
  • Working prototype of tool for automatic the checking of middleware baselines now in place [Done]
  • Developing a set of Quattor templates for an ARGUS server [Ongoing]
  • Developing a "pseudo-update" to apply gLite update 19 to BDIIs [Ongoing]
  • Working on the "team status page" being developed as an action from team awayday [Ongoing]
  • Reviewing G/S process documentation [Ongoing]
  • CASTOR items:
    • Added an LSF server to the "cert-in-a-box" cluster. [Ongoing]

VO Reports

ALICE

ATLAS

CMS

  • CMS will be busy at the Tier-1s over Christmas break
  • More network problems preventing squid access causing production jobs to fail (Wednesday night, Friday night)
  • Some production jobs failing due to exceeding 2 GB memory limit

LHCb

OnCall/AoD Cover

OnCall Rota

  • Primary OnCall:
  • Grid OnCall: Catalin (Mon-Thu), Derek (Sat-Sun)
  • AoD: