RAL Tier1 weekly operations Grid 20091130

From GridPP Wiki
Jump to: navigation, search

Summary of Previous Week

Developments

  • Alastair
    • Away
  • Andrew
    • Added documentation for adding VO for LFC and WMS; improvements to YAIM
    • Completed tests on CMS efficiencies/network traffic for different read/cache hints and different numbers of simultaneous running jobs
    • Updated CMS VOBOX SLA
    • Started work on CMS computing model spreadsheet
    • Started preparing plan for KSI2K-HEPSPEC06 migration
  • Catalin
    • worked on systems audit - 1 (backup, recovery)
    • worked on systems audit - 2 (90-day log retention policy)
    • t2k -> t2k.org migration on LFC, WMS
    • Frontier checks (java, squid)
    • Frontier issues on slow ATLAS 3D DB access
  • Derek
    • Continuing work on quattorising helpdesk frontend
    • Updated lcgce02 for T2K name change
    • Examined gstat2 errors for RAL-LCG2 CEs
  • Matt
  • Richard
    • CASTOR activities: Worked with CK and d/b folk to be able to script database setup for new pre-prod instance; also looking at using custom ncm- components for configuration
    • Built and two 64-bit flavours version of a top BDII server (for different rev's of glite)
  • Mayo
    • Worked on New Metrics system: added new features in preparation for November results entry
    • Wrote a report on the feasibility and possible issues of extending the new metric system to include Gridpp users
    • Worked on automating tape robot spreadsheet project

Operational Issues and Incidents

Description Start End Affected VO(s) Severity Status
WMS Jobdirs full Wed 18 Nov Thu 19 Nov All Medium Resolved
FroNTier crash Wed 11 Nov Fri 20 Nov ATLAS Low Resolved

Plans for Week(s) Ahead

Plans

  • Alastair
    • Add check_world_writable.sh to Nagios
    • Make wiki page for Computing requirements
    • Run tests for user analysis at RAL.
  • Andrew
    • Complete November accounting (& update docs where necessary); apply December fairshares
    • Complete CMS computing model spreadsheet
    • Add checksum checking into PhEDEx for files being migrated to tape
  • Catalin
    • fix Frontier issue on slow ATLAS 3D DB access
    • continue working on backup/recovery
    • install 2nd ALICE SL5 VOBOX
    • ready to start deployment on LHCB SL5 VOBOX (waiting for "Quattor ready to go")
  • Derek
    • Test SCAS
    • Working on helpdesk end to end restore
    • Change control process via RT
  • Matt
  • Richard
    • CASTOR activities: Setting up Quattor templates for SLC 4.8 plus misc updates to pps templates
    • Quattor template(s) for a production CIP server
  • Mayo
    • Collect feedback on recent changes to new Metric system
    • Work on possible exstention of system to include Gridpp
    • Continue working on automated spreadsheet project
    • Continue working on importing Nagios alarm data into svn

Resource Requests

Downtimes

Description Hosts Type Start End Affected VO(s)

Requirements and Blocking Issues

Description Required By Priority Status
LHCb SL5 64bit VOBOX deployment using Quattor 25 Nov 2009 Medium Quattor recipe not yet available (RT#53392)
Hardware for testing LFC/FTS resilience High DataServices want to deploy a DataGuard configuration to test LFC/FTS resilience; request for HW made through RT Fabric queue
Hardware for PPS High We have made a commitment to test PPS pre-releases, and have no hardware dedicated for this.
Hardware for Grid Services testbed Medium

OnCall/AoD Cover

  • Primary OnCall:
  • Grid OnCall:
  • AoD: