RAL Tier1 weekly operations Grid 20100913

From GridPP Wiki
Jump to: navigation, search

Operational Issues

Description Start End Affected VO(s) Severity Status
Job status monitoring from CREAMCE 2-Feb-2010 CMS medium [10-Feb-2010] WMS patch available soon; CREAMCE new version available soon [07-Apr-2010] CMS tests have shown that WMS patches resolve the problem; still waiting for patch to be installed on the production WMSs in Italy [13-Jul-2010] CNAF WMSs have been updated; testing using backfill is in progress [19-Jul-2010] So far everything looks good

Downtimes

Description Hosts Type Start End Affected VO(s)
gLite-WMS update + maintenance lcgwms02 Thu 9 Sep 15:00 Thu 16 Sep 15:00 LHC

Blocking Issues

Description Requested Date Required By Date Priority Status
HW needed to test Dataguard technology for LFC/FTS 19 May 2010 15 June 2010 Medium [24-05-2010]HW available; needs to be deployed by Fabric and then handed over to Dataservices

Developments/Plans

Highlights for Tier-1 Ops Meeting

Highlights for Tier-1 VO Liaison Meeting

Detailed Individual Reports

Alastair

  • Working on ATLAS software server, testing CVMFS
    • 825 test jobs have been run.
    • lcg0805 has been setup for production style testing, need to add queue into ATLAS system.
  • Writing script to graph transfer times for FTS transfers
  • Working on Hammer cloud test of castor 2.1.9
    • Analysis queue setup
    • Need to copy DBrelease into pre-prod and replicate
  • A/L Wednesday, Thursday and Friday

Andrew

  • CMS CASTOR 2.1.9 testing
    • Investigated problems with loadtest injection from RAL to Imperial in dev instance [Done]
  • Investigated lazy-download problem with CASTOR 2.1.7 & 2.1.9 [Done]
  • Updated published CPU capacity [Done]
  • Tested the two SL5 CMS Squids by running test rereco jobs [Done]
  • Testing glite 3.2 FTS test instance using PhEDEx debug instance [Done]
  • I/O testing with CMSSW 3.8 series with new I/O settings [Next week]
  • CMS data ops
    • Running data rereco preproduction at RAL [Ongoing]

Catalin

  • add new frontends to non-LHC LFC alias [done]
  • add new frontends to LHCb LFC alias [done]
  • gLite updates WMS01 LHC [done]
  • gLite updates WMS02 LHC [ongoing]
  • improve WMS monitoring [ongoing]
  • add new frontends to Atlas LFC alias
  • work on improving ganglia monitoring for Grid Services [ongoing]
  • work on Helpdesk MySQL database migration [ongoing]

Derek

  • Catching up
  • CREAM CE quattor profile [ongoing]
  • Investigating CREAM CE instability [ongoing]
  • Deployed quattorised sudo config
  • Refactored quattorised atlasbackup configuration
  • Intervened on lcgce01 over weekend(11-12) to resolve job submission issue

Matt

  • Capacity Signoff meeting. [New]
  • Further testing of Quattorised FTS FEs. [Ongoing]
  • Quattorisation of MyProxy nodes (write up Change Control). [New]
  • Assisting Richard with Top BDII problems. [Done]
  • Analysis of LHCb job efficiencies during disk server problem period. [Done]
  • Change Controls for FTS FE updates. [Done]
  • Quattorisation of FTS Agents host. [Done]

Richard

  • Some clean-up tasks after last week's upgrade to the RAL top-level BDIIs
  • Working on the "team status page" being developed as an action from team awayday [ongoing]
  • Reviewing G/S process documentation [ongoing]
  • CASTOR items:
    • Helped Cheney with quattor issues building head nodes for facilities instance

VO Reports

ALICE

ATLAS

CMS

  • CMS Daily Metric for RAL was ERROR on 11 Sep due to a worker node with a read-only /pool filesystem causing SAM tests and Job Robot jobs to fail.
  • CMS will start producing the AOD when data taking resumes. This was always in the plan, but will be implemented now. Will result in modest increase (~10%) increase in the rate from CERN to Tier-1s.

LHCb

OnCall/AoD Cover

OnCall Rota

  • Primary OnCall: Catalin (Fri-Sun)
  • Grid OnCall: Derek (Mon-Thu)