RAL Tier1 weekly operations Grid 20100503

From GridPP Wiki
Jump to: navigation, search

Operational Issues

Description Start End Affected VO(s) Severity Status
Job status monitoring from CREAMCE 2-Feb-2010 CMS medium [10-Feb-2010] WMS patch available soon; CREAMCE new version available soon [07-Apr-2010] CMS tests have shown that WMS patches resolve the problem; still waiting for patch to be installed on the production WMSs in Italy

Downtimes

Description Hosts Type Start End Affected VO(s)

Blocking Issues

Description Requested Date Required By Date Priority Status
Hardware for Testbed Medium Required for change validation, load testing, etc. Also for phased rollout (which replaces PPS).

Have initial hardware.

[2010-02-22] More hardware expected by end of March.

Developments/Plans

Highlights for Tier-1 Ops Meeting

Highlights for Tier-1 VO Liaison Meeting

Detailed Individual Reports

Alastair

  • Working on ATLAS software server upgrade (testing with Jonathan starting tomorrow)
  • Working on setting up and testing ATLASGROUP disk at RAL.
  • Working with B-Physics Group on group analysis requirements (TAG based analysis).
  • Looking into ATLAS PFC (Pool File Catalogue) problems.

Andrew

  • APR
  • Updated PBS gmetric scripts to include pilot accounts [Done]
  • Added DNs back into FTS monitor, but they are now anonymized [Done]
  • CMS data ops
    • More MC reprocessing at FNAL; pile-up reprocessing at CNAF
  • Installing & setting up PhEDEx on new VOBOX
  • Script to check checksums in CASTOR of random files from specific datasets [Done]

Catalin

  • ALICE VOBOXes gLite updates [done]
  • various OS updates [done]
  • Self Service Tools training [done]
  • APR [ongoing]
  • ATLAS, Alice phone calls
  • install and configure squid on LHCb VOBOX [ongoing]

Derek

  • Investigating scheduler avoidance of new WNs [Ongoing]
  • Evaluating cloud technology for Grid Services testbed use [Ongoing]
  • APR [Ongoing]
  • SSC Training [Done]
  • Requested renewal of 3 CE certificates

Matt

  • APRs
  • Distribute notes for deployment of 09 disk capacity meeting [Done]
  • Catch up on change controls (batch related) [Done]
  • Add Maui reservations to test CASTOR 32-bit libraries [Done]
  • Fix CPU capacity plots (Viglen09) [Done]
  • Look at draft User Board allocations [Done]
  • Update short-range CPU/disk capacity profiles [Done]

Richard

  • 1/2 days Oracle/SSC Training (Thu)
  • Drafted a Change Control request to move some of the BDII servers to the Atlas building for greater resilience
  • Working on proposal on intra/inter -team communication to meet an action from the team awayday
  • Reviewing G/S process documentation
  • Further Nagios items from the to-do list (https://wiki.e-science.cclrc.ac.uk/web1/bin/view/EScienceInternal/NagiosTasksToDo)
  • CASTOR items:
    • Continuing p/p stress testing

Mayo

  • Implement feedback into TSBN web interface
  • Set up scripts that update TSBN interface to run as scheduled jobs on a windows machine
  • Writing and configuring Nagios nrpe plugins [Done]
  • Certificate viewer for NGS cert wizard
  • Write PDU power controller query script [Done]
  • Write a script to turn PDU ports off

VO Reports

ALICE

Would like CREAM-CE v1.6 to be installed asap

ATLAS

  • Not heard anything about LHC technical stop. (Maybe I missed something/it will be announced tomorrow)
  • LHC continuing to islowly increase luminosity.
  • ATLAS Software and computing week. (More chaotic than usual due to Volcano)
  • Fast 'fast re-processing' could start this week.
  • ATLAS announced that it would like all disk space in production by June 1st.

CMS

  • PhEDEx on SL4 is no longer supported. 3.3.1 has just been released for SL5 only.

LHCb

OnCall/AoD Cover

  • Primary OnCall: Catalin
  • Grid OnCall:
  • AoD: