Difference between revisions of "RAL Tier1 weekly operations Grid 20101129"

From GridPP Wiki
Jump to: navigation, search
 
(No difference)

Latest revision as of 15:22, 13 December 2010

Operational Issues

Description Start End Affected VO(s) Severity Status
lcgwms03 unresponsive Fri 26 Nov ~18:05 Mon 29 Nov 09:00 non LHC High machine rebooted; back into production

Downtimes

Description Hosts Type Start End Affected VO(s)

Blocking Issues

Description Requested Date Required By Date Priority Status

Developments/Plans

Highlights for Tier-1 Ops Meeting

Highlights for Tier-1 VO Liaison Meeting

Detailed Individual Reports

Alastair

  • In CERN for ATLAS software week!

Andrew

  • Capacity planning system project [Ongoing]
  • Wrote Python script to generate XML for CMS batch system monitoring; wrote & submitted change control form; deployed [Done]
  • Wrote & deployed CMS tape usage monitoring script [Done]
  • Investigated disappearance of files from gdss381 (cmsTemp) [Done]
  • Updated errata on lcgvo-02-21, lcg0677, lcg0678, lcgapel0676 [Done]
  • CMS data ops
    • Skims at FNAL [Done]
    • MC rereco at PIC, KIT [Done]
    • Data rereco preprod at RAL, IN2P3 [Ongoing]
    • WMAgent testing at all 7 CMS Tier-1s [Ongoing]
  • Upgrade CMS squids to frontier-squid-2.7.STABLE9-5 [To do]

Catalin

  • work on (x)ROOT(d); deploy test infrastructure [ongoing]
  • kernel updates and last errata applied on various systems [ongoing]
  • apply latest updates (squid, frontier) on Atlas Frontier node
  • decommission various old systems
  • update glite-WMS [done]
  • work on Tier1 DB migration plans [ongoing]
  • work on WMS monitoring [stalled]

Derek

  • Investigation of secure deployment of ssh keys to hosts [ongoing]
  • Reinstalling lcgce08 [Done]
  • Investigating solutions for whole node scheduling [ongoing]
  • A/L (29th-3rd)

Matt

  • Tier-1 Resources meeting prep. [New]
  • Deploy top BDII on EC2. [Ongoing]
  • Quattorisation FTM. [Ongoing]
  • Deploying PBS JobMon monitoring tools. [Stalled]
  • Test FTS SRM/GridFTP ratio configuration. [Stalled]
  • Switch to gLite 3.2 FTS frontends (November 24). [Done]
  • Reprofile disk capacity. [Done]
  • Writing storage testbed proposal. [Done]

Richard

  • Applied kernel + OS updates to CIP and site and top level BDIIs (including those in testbed) [Done]
  • Wrote a gmetric tool to measure Quattor deploy hitrate (i.e. percentage of deploys (as found in SVN repo) that were "seen" by a machine) [Done]
  • Updated the ShowQuattorChanges CGI script to show the deploy list [Done]
  • Working on the tool for automatic the checking of middleware baselines [Ongoing]
  • Developing a set of Quattor templates for an ARGUS server [Ongoing]
  • Developing a "pseudo-update" to apply gLite update 19 to BDIIs [Ongoing]
  • Final touches to the CGI script before releasing initial version [Done]
  • Working on the "team status page" being developed as an action from team awayday [Ongoing]
  • Reviewing G/S process documentation [Ongoing]
  • CASTOR items:
    • Built a new cluster within Quattor for building "cert-in-a-box". Stager server built -- now adding other headnode types. [Ongoing]

VO Reports

ALICE

ATLAS

CMS

  • CMS CASTOR Job Manager outage during 2010-11-27 22:59-23:34
  • Large number of FTS timeouts in outgoing tranfers is ongoing (since CASTOR upgrade)
  • RAL is the worst CMS Tier-1 over the past 2 weeks (only 30% readiness)

LHCb

OnCall/AoD Cover

OnCall Rota

  • Primary OnCall: Catalin (Mon-Sun)
  • Grid OnCall:
  • AoD: