RAL Tier1 weekly operations Grid 20100816

From GridPP Wiki
Jump to: navigation, search

Operational Issues

Description Start End Affected VO(s) Severity Status
Job status monitoring from CREAMCE 2-Feb-2010 CMS medium [10-Feb-2010] WMS patch available soon; CREAMCE new version available soon [07-Apr-2010] CMS tests have shown that WMS patches resolve the problem; still waiting for patch to be installed on the production WMSs in Italy [13-Jul-2010] CNAF WMSs have been updated; testing using backfill is in progress [19-Jul-2010] So far everything looks good
FTS02 21-Jul-2010 All High SMART errors on both FTS02 disks, Fabric have replacements and wish to arrange swap out

Downtimes

Description Hosts Type Start End Affected VO(s)

Blocking Issues

Description Requested Date Required By Date Priority Status
HW needed to test Dataguard technology for LFC/FTS 19 May 2010 15 June 2010 Medium [24-05-2010]HW available; needs to be deployed by Fabric and then handed over to Dataservices

Developments/Plans

Highlights for Tier-1 Ops Meeting

  • Baseline updates (WMS, BDII, CE)
  • Quattor development for FEs (LFC, FTS)
  • Comparison between CIP, overwatch, and Grid disk accounting.
  • Testing FTS new timeout parameters

Highlights for Tier-1 VO Liaison Meeting

Detailed Individual Reports

Alastair

  • Working on ATLAS software server, testing CVMFS[ongoing]
  • Working on testing FTS timeout limits.
  • Working on Hammer cloud test of castor 2.1.9

Andrew

  • Updated PhEDEx prod & debug instances to 3_3_2 [Done]
  • CMS CASTOR 2.1.9 testing, including xroot with CMSSW [Ongoing]
  • DAC-Overwatch-BDII disk capacity consistency web pages
  • July accounting [Done]
  • A/L from 10th August, back on 20th August

Catalin

  • gLite updates WMS03 non-LHC [ongoing]
  • ATLAS frontier monitoring [ongoing]
  • test LFC quattor profiles (SL4 and SL5) [ongoing]
  • work on improving ganglia monitoring for Grid Services

Derek

  • Writing Strawman Cloud strategy [done]
  • CREAM CE quattor profile [ongoing]
  • Investigating CREAM CE instability [ongoing]

Matt

  • Build gLite3.2 FTS test node
  • Audit wLCG pledges vs. deployed disk
  • Look at asciidoc build system
  • Develop better CRL checking Nagios plugin [Done]
  • Add timeout configuration to local FTS information (SVN) [Done]
  • Finish first pass of ascii FTS docs [Done]

Richard

  • Submitted c/c request for s/w update on RAL top-level BDIIs [done]
  • Working on the "team status page" being developed as an action from team awayday [ongoing]
  • Reviewing G/S process documentation [ongoing]
  • Preparing for Tue/16th upgrade of RAL top-level BDIIs
  • CASTOR items:
    • Write up observations and issues from process of running 2.1.9 functional tests
    • Chase up last few "non-runners" in 2.1.9 tests

VO Reports

ALICE

ATLAS

CMS

LHCb

OnCall/AoD Cover

  • Primary OnCall:
  • Grid OnCall: Catalin
  • AoD: