RAL Tier1 weekly operations Grid 20100524

From GridPP Wiki
Jump to: navigation, search

Operational Issues

Description Start End Affected VO(s) Severity Status
Job status monitoring from CREAMCE 2-Feb-2010 CMS medium [10-Feb-2010] WMS patch available soon; CREAMCE new version available soon [07-Apr-2010] CMS tests have shown that WMS patches resolve the problem; still waiting for patch to be installed on the production WMSs in Italy

Downtimes

Description Hosts Type Start End Affected VO(s)

Blocking Issues

Description Requested Date Required By Date Priority Status
HW needed to test Dataguard technology for LFC/FTS 19 May 2010 15 June 2010 Low [24-05-2010]HW available; needs to be deployed by Fabric and then handed over to Dataservices

Developments/Plans

Highlights for Tier-1 Ops Meeting

  • Installed FTS2.2.4 pre-release on test endpoint for functional testing by ATLAS
  • Testing FTS Group configuration (to replace cloud configuration)
  • Added service owner info to oncall alarm response document
  • VOBOXs: squid service for LHCb; migration of PhEDEX for CMS

Highlights for Tier-1 VO Liaison Meeting

  • Fixed FTS configuration to impose per-VO file limits; default changed in FTS2.2.3.
  • Expect all diskservers needed to meet wLCG pledges to be in nonProd by end of week.

Detailed Individual Reports

Alastair

  • Working on ATLAS software server upgrade [ongoing]
  • Looking into ATLAS PFC (Pool File Catalogue) problems.
  • Testing FTS and check summing at RAL.
  • Deploying 22 disk servers into NonProd.

Andrew

  • Job plan
  • APEL consistency checking [Done]
  • Installing & setting up PhEDEx on SL5 VOBOX [Ongoing]
  • Migration to use of FTS groups in FTS "cloud" channels [Ongoing]
  • Started V09 disk server deployment into cmsNonProd [Ongoing; delay due to SL5 LSF issues]
  • A few FTS channel adjustments for Bristol & Estonia [Done]
  • CMS data ops
    • Backfill at RAL & PIC [Ongoing]
    • Started MC production workflow at RAL, PIC, CNAF (52472 jobs)

Catalin

  • Atlas Frontier server updates [done]
  • ATLAS Frontier documentation in SVN [done]
  • work on CMS Phedex and blparser Nagios monitoring [ongoing]
  • configure squid on LHCb VOBOX [ongoing]
  • gLite updates on LHCB VOBOX [done]
  • LFC/FTS replication (w/ Carmine) [ongoing]
  • job plans [ongoing]

Derek

  • Intervention on lcgce06 for glexec [Done]
  • Intervention on lcgce07 for glexec
  • Sync of templates with QWG for glite 3.1 and 3.2 [done]
  • Testing CREAM CE 1.6

Matt

  • Job Plans
  • Adjust FTS channel config policies that lead to opportunistic use of empty slots by other VOs
  • Investigate problem with FTS file limits being exceeded [Done]
  • Install FTS2.2.4 pre-release on test endpoint [Done]
  • Add service owner info to oncall alarm response document [Done]
  • Team development talk

Richard

  • APR-Signoff [Done]
  • Entered Job Plan info SSC [Done]
  • Worked with Jonathan to get NIS netgroups up to date (partly for convenience of having ~ mounted when logging into machines but also for the sake of reducing the number of messages that Production Team need to wade through)
  • Worked on the "missing CIP" problem
  • Built an additional top-level BDII server on testbed machine (lcg0628) to test behaviour on removing "schemacheck off" directive from /opt/bdii/etc/bdii-slapd.conf
  • Looking at the site-bdii timeout problem
  • Working on proposal on intra/inter -team communication to meet an action from the team awayday
  • Reviewing G/S process documentation
  • Further Nagios items from the to-do list (https://wiki.e-science.cclrc.ac.uk/web1/bin/view/EScienceInternal/NagiosTasksToDo)
  • CASTOR items:
    • Wrote up results from p/p stress tests [Done]
    • Ran functional test suite on p/p [Done]

Mayo

  • Implement David Meredith's feedback into Certificate viewer [Done]
  • integrate certificate viewer module with existing NGS certificate wizard code
  • Write script to control ports on multiple PDUs
  • Create Handover Document tation for finished projects [ongoing]
  • Enter job plan into ssc

VO Reports

ALICE

  • waiting for CREAM-CE 1.6 deployment at RAL
  • asked about Castor@RAL status and plans

ATLAS

CMS

  • Now using 8 primary datasets. Every CMS T1 site now receives custodially one primary dataset.

LHCb

OnCall/AoD Cover

  • Primary OnCall: Catalin (Mon-Fri)
  • Grid OnCall: Derek (Fri-Sun)
  • AoD: Catalin (Wed)