Difference between revisions of "RAL Tier1 weekly operations Grid 20100419"

From GridPP Wiki
Jump to: navigation, search
 
(No difference)

Latest revision as of 14:51, 19 April 2010

Operational Issues

Description Start End Affected VO(s) Severity Status
Job status monitoring from CREAMCE 2-Feb-2010 CMS medium [10-Feb-2010] WMS patch available soon; CREAMCE new version available soon [07-Apr-2010] CMS tests have shown that WMS patches resolve the problem; still waiting for patch to be installed on the production WMSs in Italy
LFC/FTS downtime 12-Apr-2010 ~11:30 12-Apr-2010 ~14:30 all high Failure of one RAC node in the database behind the LFC and FTS services
APEL publishing problem all low APEL publishing isn't working

Downtimes

Description Hosts Type Start End Affected VO(s)

Blocking Issues

Description Requested Date Required By Date Priority Status
Hardware for Testbed Medium Required for change validation, load testing, etc. Also for phased rollout (which replaces PPS).

Have initial hardware.

[2010-02-22] More hardware expected by end of March.

HW for SL5 CMS Phedex Vobox 19-Mar-2010 High Required to replace the existing SL4 machine

Developments/Plans

Highlights for Tier-1 Ops Meeting

Highlights for Tier-1 VO Liaison Meeting

Detailed Individual Reports

Alastair

  • Working on ATLAS software server upgrade (testing with Jonathan starting tomorrow)
  • Working on setting up and testing ATLASGROUP disk at RAL.
  • Working with B-Physics Group on group analysis requirements (TAG based analysis).
  • Looking into ATLAS PFC (Pool File Catalogue) problems.

Andrew

  • FTS monitor 1.3 now publicly available on lcgwww (modified code so that DNs are not displayed)
  • Attended GridPP 24 on Wed
  • CMS data ops
    • Running Spring10 MC redigi/rereco workflows at FNAL & CNAF
  • At Bristol on Wed 21st for UK CMS computing F2F

Catalin

  • GridPP24 (last Wed-Thus) [done]
  • GOCDB tidying-up (VOBOXes) [done]
  • install and configure squid on LHCb VOBOX [ongoing]
  • work on various Nagios checks on grid services hosts (NOCALLOUT checks) [ongoing]
  • Alice VOBOX updates (alien, glite, OS) [ongoing]

Derek

  • Consulted about effects due to upcoming uid change for pilot roles [Done]
  • Wrote change control requests for CE reconfiguration [Done]
  • Testing config changes to CE for mapping updates for glexec [Done]
  • Testing queue addition to batch system for atlas [Done]
  • At GridPP + Deployment Board (Wed-Fri) [Done]
  • Evaluating cloud technology for Grid Services testbed use
  • Documenting Grid Services testbed

Matt

  • Look at draft User Board allocations; update CPU/disk capacity profiles
  • Write up production plans [Done]

Richard

  • 2 days at GridPP meeting (Wed-Thu)
  • Applied latest OpenLDAP RPMs to quattor-built BDII -- currently running timings on this machine to check behaviour
  • Working on proposal on intra/inter -team communication to meet an action from the team awayday
  • Reviewing G/S process documentation
  • Further Nagios items from the to-do list (https://wiki.e-science.cclrc.ac.uk/web1/bin/view/EScienceInternal/NagiosTasksToDo)
  • CASTOR items:
    • Reviewed CASTOR Server Recovery document with Cheney
    • P/P Stress testing found Oracle error "ORA-00257: archiver error. Connect internal only, until freed" (caused by lack of disk space on Oracle server)
    • Continuing p/p stress testing

Mayo

  • Implement feedback into TSBN web interface
  • Set up scripts that update TSBN interface to run as scheduled jobs on a windows machine
  • Writing and configuring Nagios nrpe plugins [Done]
  • Certificate viewer for NGS cert wizard
  • Write PDU power controller query script [Done]
  • Write a script to turn PDU ports off

VO Reports

ALICE

VOBOX sites updated to AliEn v2.18

RAL has been reconfigured to use both available VOBOXes in failover mode (no more use of lcg-CE, only CREAM-CE)

ATLAS

  • Not heard anything about LHC technical stop. (Maybe I missed something/it will be announced tomorrow)
  • LHC continuing to islowly increase luminosity.
  • ATLAS Software and computing week. (More chaotic than usual due to Volcano)
  • Fast 'fast re-processing' could start this week.
  • ATLAS announced that it would like all disk space in production by June 1st.

CMS

  • Spring10 reprocessing of Summer09 MC samples almost complete (all Tier-1s involved)
  • This week RAL has the number 1 position in the CMS site readiness ranking

LHCb

OnCall/AoD Cover

  • Primary OnCall:
  • Grid OnCall: Derek (Mon-Sun excl Wed), Catalin (Wed)
  • AoD: