RAL Tier1 weekly operations Grid 20100621

From GridPP Wiki
Jump to: navigation, search

Operational Issues

Description Start End Affected VO(s) Severity Status
Job status monitoring from CREAMCE 2-Feb-2010 CMS medium [10-Feb-2010] WMS patch available soon; CREAMCE new version available soon [07-Apr-2010] CMS tests have shown that WMS patches resolve the problem; still waiting for patch to be installed on the production WMSs in Italy

Downtimes

Description Hosts Type Start End Affected VO(s)

Blocking Issues

Description Requested Date Required By Date Priority Status
HW needed to test Dataguard technology for LFC/FTS 19 May 2010 15 June 2010 Medium [24-05-2010]HW available; needs to be deployed by Fabric and then handed over to Dataservices

Developments/Plans

Highlights for Tier-1 Ops Meeting

  • FTS upgraded: since the update, no blocking transfers seen, or high load due to this on SRMs.
  • Progressing: SL4 batch shutdown, deployment of second ALICE CREAM CE.
  • CPU accounting pages switched from KSI2K to HEP-SPEC06.

Highlights for Tier-1 VO Liaison Meeting

  • FTS upgraded: since the update, no blocking transfers seen, or high load due to this on SRMs.
  • Progressing: SL4 batch shutdown, deployment of second ALICE CREAM CE.
  • CPU accounting pages switched from KSI2K to HEP-SPEC06.

Detailed Individual Reports

Alastair

  • Working on ATLAS software server on /afs
  • Group production work at RAL.
  • Working to improve pbsjobs database to allow easier monitoring of production work.
  • Work on ATLAS Frontier service, monitoring and backup.

Andrew

  • Accounting change from KSI2K to HEP-SPEC06
    • Ganglia capacity monitoring updated & accounting pages (eff-stats.pl) [Done]
    • UB schedule scripts updated [Done]
  • FTS
    • Downgraded lcgfts01 from 2.2.4 to 2.2.3 [Done]
    • Added MICE to test endpoint [Done]
    • Updated services.xml & added file limits for MICE to production endpoint [Done]
    • Various file limit & timeout changes [Done]
  • Updated RGMA ACL [Done]
  • CMS data ops
    • Running rereco & skimming at RAL, IN2P3, FNAL
    • Running MC rereco at RAL, CNAF, IN2P3

Catalin

  • test LFC deployment using quattor [ongoing]
  • configure squid on LHCb VOBOX [ongoing]
  • job plans into Oracle [ongoing]

Derek

  • Testbed Strategy [ongoing]
  • E-mailing experiment contacts about Sl4 shutdown [done]
  • Setting up NGS UEE on worker nodes
  • Change control for deploying lcgce03 [ongoing]
  • Testing glexec update
  • Configuring pool accounts in quattor [ongoing]
  • Fixed corrupt ICE database on lcgwms02

Matt

  • Produce FTS training material
  • Talk on ongoing SVN work for OnCall meeting
  • Upgrade FTS to 2.2.4 [Done]
  • Change Control workflow [Done]

Richard

  • Further work on the "team status page" being developed as an action from team awayday
  • Reviewing G/S process documentation
  • Developed a tool to help with automating the wiki page on grid middleware versions
  • Adding a Nagios check to look for the error that gave rise to the weekend's BDII problems
  • CASTOR items:
    • Carried out latest phase in pre-prod upgrade
  • Next Week
    • Finishing off 2.1.7 metrics documentation
    • Run functional tests on pre-prod
    • Run stress tests on pre-prod
  • 1 day Tier1 AwayDay

Mayo

  • Implement David Meredith's feedback into Certificate viewer [Done]
  • integrate certificate viewer module with existing NGS certificate wizard code
  • Write script to control ports on multiple PDUs
  • Create Handover Document tation for finished projects [ongoing]
  • Enter job plan into ssc

VO Reports

ALICE

  • waiting for CREAM-CE 1.6 deployment at RAL
  • cannot roll out new xrootd version (20100510-1509_dbg) on Castor 2.1.7

ATLAS

CMS

  • Major disruption to data and MC reprocessing at all T1s due to central WMS problems (CMS normally only use CNAF WMSs for production jobs). Started to use some CERN and RAL WMSs in addition to CNAF.

LHCb

OnCall/AoD Cover

  • Primary OnCall:
  • Grid OnCall: Derek (Mon-Sun)
  • AoD: