RAL Tier1 weekly operations Grid 20100215

From GridPP Wiki
Jump to: navigation, search

Operational Issues

Description Start End Affected VO(s) Severity Status
Job status monitoring from CREAMCE 2-Feb-2010 CMS medium [10-Feb-2010] WMS patch available soon; CREAMCE new version available soon

Blocking Issues

Description Requested Date Required By Date Priority Status
Hardware for testing LFC/FTS resilience High DataServices want to deploy a DataGuard configuration to test LFC/FTS resilience; request for HW made through RT Fabric queue

Production hardware will be available soon.

Hardware for Testbed Medium Required for change validation, load testing, etc. Also for phased rollout (which replaces PPS).

Have initial hardware.

Hardware for additional SL4 LFC frontends Medium Required to improve resilience of existing LFC services

Developments/Plans

Highlights for Tier-1 Ops Meeting

  • CMS: 158 files lost from CASTOR last week due to the two bad tapes. All MC data. Files have been invalidated, one dataset will be globally invalidated due to the number of missing files; Shaun double checking list of lost files.
    • Note: for real data CMS may request tapes to be sent to a data recovery service
  • ATLAS: problems with transfers (LSF), and transfer slots being blocked. Will be mitigated to some extent by FTS 2.2.
  • CMS: problem with skimming jobs not running with lazy download; high load on the link between cmsFarmRead and WNs. Reason for problems on batch service?
    • At a CMS meeting today it was agreed to make LazyDownload site-configurable
  • CMS: some reprocessing of MC data will take place at RAL soon
  • Disk deployment meeting Tuesday 10:00

Highlights for Tier-1 VO Liaison Meeting

  • Alastair putting in place monitoring to check impact of 4GB ATLAS jobs.
  • Disk deployment: prioritising deployment to enable drain of existing LHCb servers.
  • Seeking clarification on FTS2.2.3 release, and details of upgrade path.

Detailed Individual Reports

Alastair

  • Write Nagios script to warn when space token are near full. [To be implemented]
  • Work with Brian + Chris in re-deploying disk servers to ATLAS space tokens. [Ongoing]
  • Write scripts to monitor effect of 4GB memory limit change on batch system.
  • Investigate low efficiency ATLAS pilot jobs. [Ongoing]
  • Monitor/investigate ATLAS MC production and re-processing currently going on at RAL. [Ongoing]

Andrew

  • Running backfill at RAL (rereco of BeamCommissioning09 Cosmics) [ongoing]
  • Further investigations into CMS CREAMCE issues [Done]
  • YAIM configuration for Super B; updates to "add new VO" documentation [Done]
  • Deleted old backfill data (~ 117TB) [Done]
  • Investigated & resolved CMS CE SAM test failures (incl. GGUS ticket) [Done]
  • Setup cron to check for stalled cmssgm jobs [Done]
  • Setup monthly disk-accounting consistency check cron [Done]
  • Update CMS Squid configuration for Brunel, new IP address range [Done]
  • Removed gdss119 (atlasSimStrip) from CASTOR [Done]

Catalin

  • installed APEL patches on MONbox (for a correct published installed capacity ) [Done]
  • deployed SL5 LFC frontend (for tests) [Done]
  • install APEL patches on CEs
  • work on LFC schema tidying up (with Carmine) [ongoing]
  • quattorise additional LFC frontends (with Ian - pending on HW provisioning)

Derek

  • Installing SL5 SCAS server
  • Testing SL5 GLexec WN

Matt

  • Tier-1 Open Day talk for Grid Services
  • Test upgrade path from FTS2.1 to FTS2.2 on orisa
  • Plan ATLAS/R89 co-hosting of Grid Services [Done]
  • T2K configuration of FTS [Done]
  • Test APEL publication with latest patches
  • Request dedicated diskpool for T2K

Richard

  • Completed quattor template for top-level BDII server
  • Now working with Ian C to "factorise" the template so that non-machine specific items are distributed to the appropriate points in the hierarchy of templates
  • Finalising testing document and C/C request.
  • CASTOR items:
    • Reconfigured rtcpclientd Nagios plugin for stager servers to get around problem with argument transmission between shell scripts [Done]
    • Currently using the CERN stress tests to test new pre-prod instance

Mayo

  • Adding bar chart to Metric system [Done]
  • Admin interface for Metric System [Done]
  • TSBN spreadsheet web interface and backend automation script
  • writing and configuring Nagios nrpe plugins

VO Reports

ALICE

News about the LHC startup for 2010

  • For the next two following years, the accelerator will be running from Running from mid-Feb – end Nov
  • Pb-PB collisions expected in November
  • In principle stop after 1 fb^-1 ; but we should plan to run for a full 2 years
    • 182 days pp and 28 days HI in each 2010, 2011
    • 7.9 Ms pp + 1.2 Ms HI each year assuming 50% availability

ATLAS

CMS

  • CMS closed on Feb 5th; beam pipe pump-down started on Feb 6th
  • 24/7 operation started on Feb 8th; magnet ramped to 3T & cosmics data taking with all detectors started on Feb 10th
  • Commissioning10 cosmics data transferring from CERN to RAL for the past 7 days. Average 40 MB/s, max 90 MB/s.

LHCb

OnCall/AoD Cover

  • Primary OnCall:
  • Grid OnCall: Catalin (Mon-Sun)
  • AoD: