RAL Tier1 weekly operations Grid 20100802

From GridPP Wiki
Jump to: navigation, search

Operational Issues

Description Start End Affected VO(s) Severity Status
Job status monitoring from CREAMCE 2-Feb-2010 CMS medium [10-Feb-2010] WMS patch available soon; CREAMCE new version available soon [07-Apr-2010] CMS tests have shown that WMS patches resolve the problem; still waiting for patch to be installed on the production WMSs in Italy [13-Jul-2010] CNAF WMSs have been updated; testing using backfill is in progress [19-Jul-2010] So far everything looks good
FTS02 21-Jul-2010 All High SMART errors on both FTS02 disks, Fabric have replacements and wish to arrange swap out

Downtimes

Description Hosts Type Start End Affected VO(s)

Blocking Issues

Description Requested Date Required By Date Priority Status
HW needed to test Dataguard technology for LFC/FTS 19 May 2010 15 June 2010 Medium [24-05-2010]HW available; needs to be deployed by Fabric and then handed over to Dataservices
#61658: HW request for CMS Squid VOBOX 30 June 2010 Medium [30-06-2010]Request made

Developments/Plans

Highlights for Tier-1 Ops Meeting

  • Working on testing FTS timeout limits.
  • Understanding CPU/disk capacities
  • Build gLite3.2 FTS test node

Highlights for Tier-1 VO Liaison Meeting

Detailed Individual Reports

Alastair

  • Working on ATLAS software server [ongoing]
  • Working on ATLAS Frontier service, monitoring and backup.
  • Working on testing FTS timeout limits.
  • Working on ATLAS B-Physics software code.

Andrew

  • Implementing & testing CMS t1production role [Done]
  • CMS storage consistency check; deleted 7 TB of dark data [Done]
  • Added VOBOX proxy renewal daemon restarter to SL5 VOBOX [Done]
  • Updated PhEDEx Dev instance to 3_3_2 [Done]
  • Checking checksums (with PhEDEx) of the last file written to each T10KB tape [Done]
  • WMS bulk submission testing - missing jobs on CREAM CEs
  • Understanding CPU/disk capacities
  • CMS Data Ops
    • Data rereco & skims at RAL & FNAL [Done]
  • Update PhEDEx prod & debug instances to 3_3_2 [To do]
  • CMS CASTOR 2.1.9 testing [To do]

Catalin

  • ATLAS frontier monitoring [ongoing]
  • LFC quattor profiles (SL4 and SL5) [ongoing]
  • prepare various gLite updates

Derek

  • Configured cms t1 prod role on CEs [Done]
  • Applied 1250 job limit to Alice in maui [Done]
  • Investigated (CREAM) ce requirements functionality as a way to limit job cpu times per vo [Done]
  • Configured NGS-UEE publishing for lcgce05
  • Writing Strawman Cloud strategy [ongoing]
  • CREAM CE quattor profile [ongoing]
  • Investigating CREAM CE instability

Matt

  • Build gLite3.2 FTS test node
  • Add timeout configuration to local FTS information (SVN)
  • Audit wLCG pledges vs. deployed disk
  • Finish first pass of ascii FTS docs; look at build system

Richard

  • Implemented change on RAL top-level BDIIs [done]
  • Added pre-prod CIP into site BDII
  • Working on the "team status page" being developed as an action from team awayday [ongoing]
  • Reviewing G/S process documentation [ongoing]
  • Further work on tool to help with automating the wiki page on grid middleware versions [done]
  • CASTOR items:
    • Continue trying to get 2.1.9 functional tests running on pre-prod
    • Update the pre-prod resources being published to support VO testing

VO Reports

ALICE

ATLAS

CMS

LHCb

OnCall/AoD Cover

  • Primary OnCall: Catalin
  • Grid OnCall:
  • AoD: