RAL Tier1 weekly operations Grid 20100426

From GridPP Wiki
Jump to: navigation, search

Operational Issues

Description Start End Affected VO(s) Severity Status
Job status monitoring from CREAMCE 2-Feb-2010 CMS medium [10-Feb-2010] WMS patch available soon; CREAMCE new version available soon [07-Apr-2010] CMS tests have shown that WMS patches resolve the problem; still waiting for patch to be installed on the production WMSs in Italy
APEL publishing problem all low APEL publishing isn't working

Downtimes

Description Hosts Type Start End Affected VO(s)

Blocking Issues

Description Requested Date Required By Date Priority Status
Hardware for Testbed Medium Required for change validation, load testing, etc. Also for phased rollout (which replaces PPS).

Have initial hardware.

[2010-02-22] More hardware expected by end of March.

HW for SL5 CMS Phedex Vobox 19-Mar-2010 High Required to replace the existing SL4 machine [2010-04-21] PhEDEx is no longer supprted on SL4

Developments/Plans

Highlights for Tier-1 Ops Meeting

Highlights for Tier-1 VO Liaison Meeting

Detailed Individual Reports

Alastair

  • Working on ATLAS software server upgrade (testing with Jonathan starting tomorrow)
  • Working on setting up and testing ATLASGROUP disk at RAL.
  • Working with B-Physics Group on group analysis requirements (TAG based analysis).
  • Looking into ATLAS PFC (Pool File Catalogue) problems.

Andrew

  • Fixes to CMS tape pool & high time resolution CPU efficiency ganglia plots [Done]
  • Changed PhEDEx custom CASTOR stager agent to generic stager agent; limited number of files staged to 150/10mins. [Done]
  • CMS data ops
    • Dealing with last processing & merge jobs from remaining workflows
    • Re-started backfill at CNAF for testing changes to their storage system
  • Attended CMS UK computing meeting in Bristol on Wednesday
  • Writing script to compare checksums of random files from specific files in CASTOR with PhEDEx [Ongoing]

Catalin

  • ALICE VOBOXes gLite updates [done]
  • various OS updates [done]
  • Self Service Tools training [done]
  • APR [ongoing]
  • ATLAS, Alice phone calls
  • install and configure squid on LHCb VOBOX [ongoing]

Derek

  • Investigating scheduler avoidance of new WNs [Ongoing]
  • Evaluating cloud technology for Grid Services testbed use [Ongoing]
  • APR [Ongoing]
  • SSC Training [Done]
  • Requested renewal of 3 CE certificates

Matt

  • Arrange meeting to discuss 09 disk deployment [Done]
  • Add FTS check for jobs stuck in Preparing state [Done]
  • Review batch system change controls [Done]
  • Look at draft User Board allocations; update CPU/disk capacity profiles [ongoing]

Richard

  • 1/2 days Oracle/SSC Training (Thu)
  • Drafted a Change Control request to move some of the BDII servers to the Atlas building for greater resilience
  • Working on proposal on intra/inter -team communication to meet an action from the team awayday
  • Reviewing G/S process documentation
  • Further Nagios items from the to-do list (https://wiki.e-science.cclrc.ac.uk/web1/bin/view/EScienceInternal/NagiosTasksToDo)
  • CASTOR items:
    • Continuing p/p stress testing

Mayo

  • Implement feedback into TSBN web interface
  • Set up scripts that update TSBN interface to run as scheduled jobs on a windows machine
  • Writing and configuring Nagios nrpe plugins [Done]
  • Certificate viewer for NGS cert wizard
  • Write PDU power controller query script [Done]
  • Write a script to turn PDU ports off

VO Reports

ALICE

Would like CREAM-CE v1.6 to be installed asap

ATLAS

  • Not heard anything about LHC technical stop. (Maybe I missed something/it will be announced tomorrow)
  • LHC continuing to islowly increase luminosity.
  • ATLAS Software and computing week. (More chaotic than usual due to Volcano)
  • Fast 'fast re-processing' could start this week.
  • ATLAS announced that it would like all disk space in production by June 1st.

CMS

  • PhEDEx on SL4 is no longer supported. 3.3.1 has just been released for SL5 only.

LHCb

OnCall/AoD Cover

  • Primary OnCall:
  • Grid OnCall: Derek
  • AoD: