RAL Tier1 weekly operations Grid 20100222

From GridPP Wiki
Jump to: navigation, search

Operational Issues

Description Start End Affected VO(s) Severity Status
Job status monitoring from CREAMCE 2-Feb-2010 CMS medium [10-Feb-2010] WMS patch available soon; CREAMCE new version available soon
CRL issues for SL4 batch Tue 16 Feb 2010 Wed 17 Feb 2010 non-LHC medium solved; CRLs updated on NFS server
ATLAS s/w server overloaded Sun 21 Feb 2010 Ongoing ATLAS medium

Downtimes

Description Hosts Type Start End Affected VO(s)
RAID and memory issues lcgce07 and lcg0280 SD Fri 19 Feb 2010 14:00 Tue 23 Feb 2010 16:00 CMS, Alice, LHCb


Blocking Issues

Description Requested Date Required By Date Priority Status
Hardware for testing LFC/FTS resilience High DataServices want to deploy a DataGuard configuration to test LFC/FTS resilience; request for HW made through RT Fabric queue

Production hardware will be available soon.

[2010-02-22] Test hardware available; some config tweaks needed.

Hardware for Testbed Medium Required for change validation, load testing, etc. Also for phased rollout (which replaces PPS).

Have initial hardware.

[2010-02-22] More hardware expected by end of March.

Hardware for additional SL4 LFC frontends Medium Required to improve resilience of existing LFC services

Developments/Plans

Highlights for Tier-1 Ops Meeting

  • Ongoing load issues on ATLAS s/w server.
  • ATLAS 4GB jobs having minimal affect regarding blocked job starts (~1%).
  • FTS 2.2 released; starting to test upgrade path.
  • Disk deployment: 100TB requested for ATLAS to enable LHCb drain to commence; capacity needed in SimStrip, which was filled over the weekend.

Highlights for Tier-1 VO Liaison Meeting

  • Disk deployment: 100TB requested for ATLAS to enable LHCb drain to commence; capacity needed in SimStrip, which was filled over the weekend.
  • FTS2.2 testing ongoing; CNAF experiencing problems with upgrade.

Detailed Individual Reports

Alastair

  • Work with Brian + Chris in re-deploying disk servers to ATLAS space tokens. [Ongoing]
  • Write scripts to monitor effect of 4GB memory limit change on batch system. [Done]
  • Monitor/investigate ATLAS MC production and re-processing currently going on at RAL. [Ongoing]

Andrew

  • Running backfill at RAL (re-reco of BeamCommissioning09 Cosmics) [ongoing]
  • Ran a production workflow: reprocessing of a Summer09 MC sample (generated data is custodial) [Done]
  • Added ganglia monitoring of usage of CMS tape pools (per tape pool & combined stack plot) [Done]
  • Testing new disk server on CASTOR pre-prod instance with CMSSW (skimming & reconstruction) [ongoing]
  • Added .tr (for T2_TR_METU) to CLOUD-CMS-CERN FTS channel [Done]

Catalin

  • 're-certified' ATLAS Frontier after 3D migration (with Alastair) [done]
  • install APEL patches on CEs [ongoing]
  • work on LFC schema tidying up (with Carmine) [ongoing]
  • quattorise additional LFC frontends (with Ian - pending on HW provisioning)
  • lcgce07 downtime - disk replacement, memory swap

Derek

  • A/L

Matt

  • FTS2.2
    • Look at GGUS bug regarding checksum scenarios [Done]
    • Test upgrade path from FTS2.1 to FTS2.2 on orisa
  • Disk deployment: request 100TB for ATLAS to enable LHCb drain to commence [Done]
  • Tier-1 Open Day talk for Grid Services
  • Test FTS functionality for T2K [Done]
  • CA updates on service nodes (including CEs in Derek's absence) [Done]
  • Test APEL publication with latest patches
  • Request dedicated diskpool for T2K (depends on allocation)

Richard

  • Submitted change control request for rolling out quattorised BDII server [Done]
  • Now working with Ian C to "factorise" the template so that non-machine specific items are distributed to the appropriate points in the hierarchy of templates
  • Working on the Grid Services Quattorisation Roadmap
  • Writing a proposal on intra/inter -team communication to meet an action from the team awayday
  • Reviewing G/S process documentation
  • Further Nagios items from the to-do list (https://wiki.e-science.cclrc.ac.uk/web1/bin/view/EScienceInternal/NagiosTasksToDo)
  • CASTOR items:
    • Working on benchmarking plan to establish baseline performance before upgrading to new CASTOR release(s)
    • Set up a "Plan B" CASTOR LSF server in case the need arises [Done]

Mayo

  • Adding bar chart to Metric system [Done]
  • Admin interface for Metric System [Done]
  • TSBN spreadsheet web interface and backend automation script
  • writing and configuring Nagios nrpe plugins

VO Reports

ALICE

ATLAS

CMS

  • Problems over the past week: Oracle problems affecting transfers (x2); writes to cmsWanIn pending for too long causing transfers to fail (x4); tape migration; tape recall problems (one tape); gdss364 problems caused jobs to fail on 19th-20th Feb
  • Transfers to/from RAL over the past week:
    • from CERN: 13.1 TB (Commissioning10 cosmics)
    • from T2s: 2.4 TB
    • to T1s: 19.5 TB
    • to T2s: 20.4 TB
    • migrated to tape: 25.5 TB
  • CPU usage over the past week:
    • backfill (re-reco) & MC reprocessing: 7384 KSI2K days, CPU efficiency 92%
    • skimming: 464 KSI2K days, CPU efficiency 51%

LHCb

OnCall/AoD Cover

  • Primary OnCall:
  • Grid OnCall: Matt (Mon-Sun)
  • AoD: