RAL Tier1 weekly operations Grid 20101206

From GridPP Wiki
Revision as of 15:23, 13 December 2010 by Alastair dewhurst (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Operational Issues

Description Start End Affected VO(s) Severity Status

Downtimes

Description Hosts Type Start End Affected VO(s)

Blocking Issues

Description Requested Date Required By Date Priority Status

Developments/Plans

Highlights for Tier-1 Ops Meeting

Highlights for Tier-1 VO Liaison Meeting

Detailed Individual Reports

Alastair

  • ATLAS TaskForce
  • Keeping ATLAS up to date with castor upgrade.
  • Helping Lancaster with File Loss.
  • Working on ATLAS permission change.

Andrew

  • Capacity planning system project [Ongoing]
  • Resource review meeting [Done]
  • November accounting [Done apart form csv file]
  • Update CMS squids to latest version; updated Nagios plugin [Done]
  • FTS adjustments ATLAS and T2K [Done]
  • Investigated networking problems on Monday affecting CMS jobs
  • CMS data ops
    • WMAgent testing
    • Rereco at RAL, IN2P3, KIT [Ongoing]

Catalin

  • test squid deployments for ATLAS [ongoing]
  • work on (x)ROOT(d); deploy test infrastructure [ongoing]
  • kernel updates and last errata applied on various systems [done]
  • apply latest updates (squid, frontier) on Atlas Frontier node [done]
  • decommission various old systems [done]
  • work on Tier1 DB migration plans [ongoing]
  • work on WMS monitoring [stalled]

Derek

  • Investigation of secure deployment of ssh keys to hosts [ongoing]
  • Reinstalling lcgce08 [Done]
  • Investigating solutions for whole node scheduling [ongoing]
  • A/L (29th-3rd)

Matt

  • T2K FTS configuration. [New]
  • Handover for A/L. [New]
  • Quattorisation FTM. [Ongoing]
  • Tier-1 Resources meeting prep. [Done]
  • Deploying PBS JobMon monitoring tools. [Stalled]
  • Test FTS SRM/GridFTP ratio configuration. [Stalled]
  • Deploy top BDII on EC2. [Done]
  • Blog top BDII on EC2. [Done]

Richard

  • Wrote a gmetric tool to measure Quattor deploy hitrate (i.e. percentage of deploys (as found in SVN repo) that were "seen" by a machine) [Done]
  • Working prototype of tool for automatic the checking of middleware baselines now in place [Done]
  • Developing a set of Quattor templates for an ARGUS server [Ongoing]
  • Developing a "pseudo-update" to apply gLite update 19 to BDIIs [Ongoing]
  • Working on the "team status page" being developed as an action from team awayday [Ongoing]
  • Reviewing G/S process documentation [Ongoing]
  • CASTOR items:
    • Added an LSF server to the cert-in-a-box" cluster. [Ongoing]

VO Reports

ALICE

ATLAS

CMS

  • Deleted ~ 80 TB old data last week
  • Last week's CMS problems:
    • 2 x proxy renewal problems on CMS VOBOX causing ~ 1 hour of failed transfers to RAL. Restarter didn't seem to successfully restart it.
    • Failing transfers (mainly outgoing) and SAM tests on Sunday
    • There was a cmssgm job in running state but forgotten by batch system, preventing new software release from being installed (required for current reprocessing). Delayed start of reprocessing at RAL by 1.5 days.
    • Network/DNS issues
      • Squids denying access from some worker nodes, causing some reprocessing jobs to fail because they couldn't failover to CERN
      • Central CMS monitoring of squids had this at 2010-12-06 05:20 "Skipping host lcgsquid01.grid pp.rl.ac.uk as it does not resolve to an IPv4 address"

LHCb

OnCall/AoD Cover

OnCall Rota

  • Primary OnCall: Catalin (Mon-Sun)
  • Grid OnCall:
  • AoD: