RAL Tier1 weekly operations Grid 20100809

From GridPP Wiki
Jump to: navigation, search

Operational Issues

Description Start End Affected VO(s) Severity Status
Job status monitoring from CREAMCE 2-Feb-2010 CMS medium [10-Feb-2010] WMS patch available soon; CREAMCE new version available soon [07-Apr-2010] CMS tests have shown that WMS patches resolve the problem; still waiting for patch to be installed on the production WMSs in Italy [13-Jul-2010] CNAF WMSs have been updated; testing using backfill is in progress [19-Jul-2010] So far everything looks good
FTS02 21-Jul-2010 All High SMART errors on both FTS02 disks, Fabric have replacements and wish to arrange swap out

Downtimes

Description Hosts Type Start End Affected VO(s)

Blocking Issues

Description Requested Date Required By Date Priority Status
HW needed to test Dataguard technology for LFC/FTS 19 May 2010 15 June 2010 Medium [24-05-2010]HW available; needs to be deployed by Fabric and then handed over to Dataservices

Developments/Plans

Highlights for Tier-1 Ops Meeting

  • Baseline updates (WMS, BDII, CE)
  • Quattor development for FEs (LFC, FTS)
  • Comparison between CIP, overwatch, and Grid disk accounting.
  • Testing FTS new timeout parameters

Highlights for Tier-1 VO Liaison Meeting

Detailed Individual Reports

Alastair

  • Working on ATLAS software server [ongoing]
  • Working on ATLAS Frontier service, monitoring and backup.
  • Working on testing FTS timeout limits.
  • Working on ATLAS B-Physics software code.

Andrew

  • Updated PhEDEx prod & debug instances to 3_3_2 [Done]
  • CMS CASTOR 2.1.9 testing, including xroot with CMSSW [Ongoing]
  • DAC-Overwatch-BDII disk capacity consistency web pages
  • July accounting [Done]
  • A/L from 10th August, back on 20th August

Catalin

  • submitted change control request for glite-WMS update [done]
  • ATLAS frontier monitoring [ongoing]
  • test LFC quattor profiles (SL4 and SL5) [ongoing]
  • prepare gLite updates for WMS03
  • work on improving ganglia monitoring for Grid Services

Derek

  • Writing Strawman Cloud strategy [ongoing]
  • CREAM CE quattor profile [ongoing]
  • Investigating CREAM CE instability [ongoing]
  • Handed over blog maintenance to production team

Matt

  • Build gLite3.2 FTS test node
  • Add timeout configuration to local FTS information (SVN)
  • Audit wLCG pledges vs. deployed disk
  • Finish first pass of ascii FTS docs; look at build system

Richard

  • Submitted c/c request for s/w update on RAL top-level BDIIs [done]
  • Working on the "team status page" being developed as an action from team awayday [ongoing]
  • Reviewing G/S process documentation [ongoing]
  • Demonstrated a prototype version of tool for automating the wiki page on grid middleware versions
  • Testing a pair of quattorised site-level BDIIs
  • Preparing a talk on how to write a Quattor component
  • CASTOR items:
    • Finishing the running of 2.1.9 functional tests on pre-prod instance

VO Reports

ALICE

  • still happy with 1250 jobs limit

ATLAS

CMS

LHCb

OnCall/AoD Cover

  • Primary OnCall:
  • Grid OnCall: Derek
  • AoD: