Difference between revisions of "RAL Tier1 weekly operations Grid 20101101"

From GridPP Wiki
Jump to: navigation, search
 
(No difference)

Latest revision as of 11:33, 2 November 2010

Operational Issues

Description Start End Affected VO(s) Severity Status
SW RAID problems on lcgwms03 (non-LHC) Fri 22-Oct-2010 non-LHC Fabric aware of the problem

Downtimes

Description Hosts Type Start End Affected VO(s)

Blocking Issues

Description Requested Date Required By Date Priority Status

Developments/Plans

Highlights for Tier-1 Ops Meeting

Highlights for Tier-1 VO Liaison Meeting

Detailed Individual Reports

Alastair

  • Working on ATLAS software server, testing CVMFS
    • 825 test jobs have been run.
    • lcg0805 has been setup for production style testing, need to add queue into ATLAS system.
    • Production tasks submitted.
    • ANALY_RAL is now open to normal users and they are successfully using CVMFS.
  • Writing script to graph transfer times for FTS transfers [on hold]
  • Emergency SRM upgrade
  • Emergency ATLAS permission upgrade
  • Preparing for ATLAS UK meeting tomorrow.

Andrew

  • Capacity planning system project [Ongoing]
  • Dealing with APEL problems (both MON box & glite-APEL)
  • Fixed pbslogs2mysql (ignores jobs from job arrays) [Done]
  • Completed gmetric & ganglia scripts for per-site CMS squid acess monitoring [Done]
  • Put glite-APEL into production [To do]
  • October accounting [To do]
  • Investigate setting up batch system plugin required for producing XML job information for CMS [To do]
  • CMS data ops
    • Pile-up MC reprocessing at RAL & CNAF [Ongoing]

Catalin

  • deploy lcglb03 (glite3.2 LB) in full (LHC and non-LHC) production
  • work on (x)ROOT(d); deploy test infrastructure [ongoing]
  • drain lcglb01

Derek

  • Updated blparser on lcgbatch01 to fix job state issue on CREAM CEs [done]
  • purged 60,000 jobs from lcgce03 stuck in Running state [done]
  • Deployed new change control process [done]
  • Investigation of secure deployment of ssh keys to hosts [ongoing]
  • Change control for providing additional CREAM CE for Atlas
  • Investigating solutions for whole node scheduling

Matt

  • Testing PBS monitoring tools (pbswebmon, JobMon) [Ongoing]
  • Further testing of Quattorised gLite3.2 FTS FEs. [Ongoing]
  • Quattorisation of MyProxy nodes. [Ongoing]
  • Test FTS SRM/GridFTP ratio configuration.
  • Disk Deployment meeting.

Richard

  • Prepping for Wednesday's update to RAL site-level BDIIs
  • Developing a set of Quattor templates for an ARGUS server [Ongoing]
  • Developing a "pseudo-update" to apply a gLite update to BDIIs
  • Wrote a CGI script for logging hardware requests from G/S team in the Fabric queue in RT [Ongoing]
  • Working on the "team status page" being developed as an action from team awayday [Ongoing]
  • Reviewing G/S process documentation [Ongoing]
  • CASTOR items:
    • Running functional tests on Facilities instance
    • Using grid to run many jobs so as to stress test Facilities instance

VO Reports

ALICE

ATLAS

CMS

  • All CMS Tier-1s have been asked to provide XML job information (produced every 10 mins) to be consumed by central monitoring.
  • Current work at RAL: pile-up MC reprocessing started last week and finished over the weekend.
  • Upcoming T1 reprocessing plans (dates maybe subject to change):
    • 2010-11-04 Data rereco
    • 2010-11-13 Pile-up MC redigi/rereco
    • 2010-12-15 Data rereco / MC redigi/rereco

LHCb

OnCall/AoD Cover

OnCall Rota

  • Primary OnCall: Catalin (Mon-Sun)
  • Grid OnCall: