RAL Tier1 weekly operations Grid 20100906

From GridPP Wiki
Revision as of 15:47, 6 September 2010 by Alastair dewhurst (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Operational Issues

Description Start End Affected VO(s) Severity Status
Job status monitoring from CREAMCE 2-Feb-2010 CMS medium [10-Feb-2010] WMS patch available soon; CREAMCE new version available soon [07-Apr-2010] CMS tests have shown that WMS patches resolve the problem; still waiting for patch to be installed on the production WMSs in Italy [13-Jul-2010] CNAF WMSs have been updated; testing using backfill is in progress [19-Jul-2010] So far everything looks good

Downtimes

Description Hosts Type Start End Affected VO(s)
gLite-WMS update + maintenance lcgwms01 Wed 1 Sep 10:00 8 Sep 11:00 LHC

Blocking Issues

Description Requested Date Required By Date Priority Status
HW needed to test Dataguard technology for LFC/FTS 19 May 2010 15 June 2010 Medium [24-05-2010]HW available; needs to be deployed by Fabric and then handed over to Dataservices
LFC and FTS to be moved in UPS room 02 Sep 2010 15 Sep 2010 Medium

Developments/Plans

Highlights for Tier-1 Ops Meeting

Highlights for Tier-1 VO Liaison Meeting

Detailed Individual Reports

Alastair

  • Working on ATLAS software server, testing CVMFS [waiting for farm to empty before running test]
  • Writing script to graph transfer times for FTS transfers
  • Made twiki page from Brian's disk draining/checking scripts.
  • Working on Hammer cloud test of castor 2.1.9 [Analysis queue setup]
  • Looking into gdss547, 554 transfer problems to Non UK sites.

Andrew

  • CMS CASTOR 2.1.9 testing
    • Preparing for transfers from RAL to Imperial (1.8 TB loadtest data transferred from CERN to RAL) for cmsWanOut stress testing [Done]
    • Stress-testing on cmsFarmRead (with & without lazy-download) [Ongoing]
  • Investigating CMS software server file access problems [Done]
  • 2009-2010 VO support survey [Ongoing]
  • August accounting [Done]
  • CMS data ops
    • Running MC redigi/rereco at CNAF [Ongoing]
    • Running data rereco preproduction at KIT, FNAL, RAL [Ongoing]

Catalin

  • add new frontends to non-LHC LFC alias
  • gLite updates WMS01 LHC [ongoing]
  • ATLAS frontier monitoring [ongoing]
  • test SL5 LFC quattor profiles [ongoing]
  • work on improving ganglia monitoring for Grid Services

Derek

  • CREAM CE quattor profile [ongoing]
  • Investigating CREAM CE instability [ongoing]
  • At GridPP meeting Mon-Thu, A/L Friday and following week

Matt

  • Change Controls for FTS FE updates.
  • Quattorisation of FTS Agents host.

Richard

  • Preparation for next week's roll-out of Quattorised site-level BDIIs
  • Tracking the mystery BDII problem reported by Chris Walker at QMUL
  • Working on the "team status page" being developed as an action from team awayday [ongoing]
  • Reviewing G/S process documentation [ongoing]
  • CASTOR items:
    • Helping Cheney with Quattor profile for "combo" CASTOR headnodes

VO Reports

ALICE

ATLAS

CMS

  • RAL in an ERROR state for CMS on 3rd & 5th September due to CE/batch system problems affecting SAM tests, JobRobot & production jobs.
  • Deleted ~200 TB of old MC data.
  • Around 400-500 CMS jobs per day recently are being killed for exceeding 2 GB memory limit. These are WMAgent test jobs with a known bad config file (WMAgent is the replacement for ProdAgent).

LHCb

OnCall/AoD Cover

  • Primary OnCall: Catalin (Mon-Sun)
  • Grid OnCall:
  • AoD: