RAL Tier1 weekly operations Grid 20100628

From GridPP Wiki
Jump to: navigation, search

Operational Issues

Description Start End Affected VO(s) Severity Status
Job status monitoring from CREAMCE 2-Feb-2010 CMS medium [10-Feb-2010] WMS patch available soon; CREAMCE new version available soon [07-Apr-2010] CMS tests have shown that WMS patches resolve the problem; still waiting for patch to be installed on the production WMSs in Italy

Downtimes

Description Hosts Type Start End Affected VO(s)

Blocking Issues

Description Requested Date Required By Date Priority Status
HW needed to test Dataguard technology for LFC/FTS 19 May 2010 15 June 2010 Medium [24-05-2010]HW available; needs to be deployed by Fabric and then handed over to Dataservices

Developments/Plans

Highlights for Tier-1 Ops Meeting

  • Testing updates to WN/glexec (to meet baseline requirements).
  • Finalising Change Control for second ALICE CE (CE03).
  • Ongoing work to finalise close of SL4 batch service.
  • Looking at SARA-RAL transfer problems.

Highlights for Tier-1 VO Liaison Meeting

  • Lots of errors on SARA-RAL channel due to missing source files.
  • SL4 decommissioning deadline approaching (August 1).
  • Approved change to rollout second CREAM CE for ALICE; need to schedule.
  • Testing to further investigate CMS issues with WMS bulk job submission.

Detailed Individual Reports

Alastair

  • Working on ATLAS software server on /afs [ongoing]
  • Written script to identify unavaliable files when a disk server is taken out of production. [testing]
  • Looking into Slow LHCb transfers between SARA and RAL. (fix with James T now)
  • Working to improve pbsjobs database to allow easier monitoring of production work.
  • Working on ATLAS Frontier service, monitoring and backup.

Andrew

  • e-Science "away" day
  • CMS data ops
    • Running data rereco & skimming at FNAL, PIC, KIT
    • Running MC rereco at RAL, CNAF, IN2P3
  • Next week: adjust TFC for cmsTemp & do some testing

Catalin

  • various WMS issues [ongoing]
  • test LFC deployment using quattor [ongoing]
  • LFC talk for NGS
  • Frontier monitoring
  • mysql pbsjobs DB issue

Derek

  • Testbed Strategy [ongoing]
  • E-mailing experiment contacts about Sl4 shutdown [done]
  • Setting up NGS UEE on worker nodes
  • Change control for deploying lcgce03 [done]
  • Testing glexec update [ongoing]
  • Configuring pool accounts in quattor [ongoing]
  • Implementing new updated change control process on dev helpdesk
  • Added plugin to sync between blog and twitter

Matt

  • Produce FTS training material
  • Talk on ongoing SVN work for OnCall meeting

Richard

  • Planning updates to RAL top-level BDIIs
  • Further work on the "team status page" being developed as an action from team awayday
  • Reviewing G/S process documentation
  • Developed a tool to help with automating the wiki page on grid middleware versions
  • Wrote a gmetric script to monitor the # of entries in RAL BDII servers
  • Writing a Nagios plugin to check the "deltas" in # of entries in RAL BDII servers
  • CASTOR items:
    • Carried out latest phase in pre-prod upgrade
    • Ran 2.1.8 functional tests on latest pre-prod s/w
  • Next Week
    • Finishing off 2.1.7 metrics documentation
    • Run functional tests on pre-prod
    • Run stress tests on pre-prod
  • 2 days A/L

Mayo

  • Implement David Meredith's feedback into Certificate viewer [Done]
  • integrate certificate viewer module with existing NGS certificate wizard code
  • Write script to control ports on multiple PDUs
  • Create Handover Document tation for finished projects [ongoing]
  • Enter job plan into ssc

VO Reports

ALICE

  • waiting for CREAM-CE 1.6 deployment at RAL
  • cannot roll out new xrootd version (20100510-1509_dbg) on Castor 2.1.7

ATLAS

CMS

  • Splitting cmsFarmRead into cmsFarmRead & a D1T0 service class called cmsTemp. Everything in /store/unmerged will go into cmsTemp rather than cmsFarmRead, as these are only temporary files that don't need to go to tape.

LHCb

OnCall/AoD Cover

  • Primary OnCall: Catalin (Mon-Sun)
  • Grid OnCall:
  • AoD: