RAL Tier1 weekly operations Grid 20100705

From GridPP Wiki
Jump to: navigation, search

Operational Issues

Description Start End Affected VO(s) Severity Status
Job status monitoring from CREAMCE 2-Feb-2010 CMS medium [10-Feb-2010] WMS patch available soon; CREAMCE new version available soon [07-Apr-2010] CMS tests have shown that WMS patches resolve the problem; still waiting for patch to be installed on the production WMSs in Italy

Downtimes

Description Hosts Type Start End Affected VO(s)

Blocking Issues

Description Requested Date Required By Date Priority Status
HW needed to test Dataguard technology for LFC/FTS 19 May 2010 15 June 2010 Medium [24-05-2010]HW available; needs to be deployed by Fabric and then handed over to Dataservices
#61658: HW request for CMS Squid VOBOX 30 June 2010 Medium [30-06-2010]Request made

Developments/Plans

Highlights for Tier-1 Ops Meeting

  • CE03 now deployed
  • Ongoing work to finalise close of SL4 batch service.
  • Working on failover CMS Phedex vobox
  • Grid Team thin on ground this week (A/L & WLCG workshop)

Highlights for Tier-1 VO Liaison Meeting

Detailed Individual Reports

Alastair

  • Working on ATLAS software server on /afs [ongoing]
  • Written script to identify unavaliable files when a disk server is taken out of production. [testing]
  • Looking into Slow LHCb transfers between SARA and RAL. (fix with James T now)
  • Working to improve pbsjobs database to allow easier monitoring of production work.
  • Working on ATLAS Frontier service, monitoring and backup.

Andrew

  • Adjustments to TFC & testing of new service class (cmsTemp) using a backfill workflow [Done]
  • Put in H/W request for Fabric team for new CMS VOBOX for Squid / PhEDEx failover [Done]
  • Writing call-out documentation for restarting PhEDEx on another VOBOX [Ongoing]
  • Updated FTS services.xml; added new domain to RGMA ACL; updated Maui fairshares [Done]
  • Accounting
    • June accounting [Ongoing]
    • Investigated CESGA/PBS differences due to dates used in queries [Done]
  • CMS data ops
    • Accounting for previous rereco/skims
    • Data rereco at KIT
    • MC rereco at RAL & CNAF

Catalin

  • test LFC deployment using quattor [ongoing]
  • LFC talk for NGS [done]
  • Frontier monitoring [ongoing]
  • Alice castor+xrootd issues [ongoing]

Derek

  • Testing glexec update [ongoing]
  • Setting up NGS UEE on worker nodes
  • Deployed lcgce03 [done]
  • Implementing new updated change control process on dev helpdesk
  • Attending WLCG Workshop Wed-Fri

Matt

Richard

  • Planning updates to RAL top-level BDIIs
  • Further work on the "team status page" being developed as an action from team awayday
  • Reviewing G/S process documentation
  • Developed a tool to help with automating the wiki page on grid middleware versions
  • Writing a Nagios plugin to check the "deltas" in # of entries in RAL BDII servers
  • CASTOR items:
    • Carried out latest phase in pre-prod upgrade
    • Re-ran 2.1.8 functional tests on latest pre-prod s/w after latest re-config
    • Started running stress tests
  • Next Week
    • Finishing off 2.1.7 metrics documentation
    • Continuing to run stress tests on pre-prod
  • 4.5 days A/L

Mayo

  • Implement David Meredith's feedback into Certificate viewer [Done]
  • integrate certificate viewer module with existing NGS certificate wizard code[Done]
  • Create Handover Documentation for finished projects [ongoing]
  • Enter job plan into ssc [Done]
  • Create Certificate Query class for David Meredith [Done]

VO Reports

ALICE

  • waiting for CREAM-CE 1.6 deployment at RAL
  • cannot roll out new xrootd version (20100510-1509_dbg) on Castor 2.1.7

ATLAS

CMS

  • Data loss: 877 files were lost from gdss67

LHCb

OnCall/AoD Cover

  • Primary OnCall:
  • Grid OnCall: Derek
  • AoD: