RAL Tier1 weekly operations Grid 20100517

From GridPP Wiki
Jump to: navigation, search

Operational Issues

Description Start End Affected VO(s) Severity Status
Job status monitoring from CREAMCE 2-Feb-2010 CMS medium [10-Feb-2010] WMS patch available soon; CREAMCE new version available soon [07-Apr-2010] CMS tests have shown that WMS patches resolve the problem; still waiting for patch to be installed on the production WMSs in Italy

Downtimes

Description Hosts Type Start End Affected VO(s)

Blocking Issues

Description Requested Date Required By Date Priority Status

Developments/Plans

Highlights for Tier-1 Ops Meeting

  • Handover BDII services to Richard
  • Disk deployment meeting on Tuesday
  • Upcoming meetings
    • wLCG T0/T1/T2 workshop (July 7-9, Imperial)
    • EGI Technical Forum (September 14-17, Amsterdam)

Highlights for Tier-1 VO Liaison Meeting

  • Deployed 35 disk servers into production for ATLAS
  • Requested deployment of V08/V09 servers to nonProd to meet 2010 wLCG pledges
  • Testing CREAM CE 1.6 (required by ALICE)
  • Test FTS2.2.4 upgrade

Detailed Individual Reports

Alastair

  • Working on ATLAS software server upgrade
  • Deploying 35 disk servers into production for ATLAS
  • Working on testing ATLASGROUP disk at RAL.
  • Looking into ATLAS PFC (Pool File Catalogue) problems.

Andrew

  • APR [Ongoing]
  • Tidying up APEL problems (replacing missing data on 19-20th March; fixing SpecInt2000 for April, May)
  • April accounting [Done]
  • Added a new endpoint to FTS (for T2_EE_Estonia) [Done]
  • Installing & setting up PhEDEx on SL5 VOBOX [Ongoing]
    • Writing change-control & new service checklist documents for PhEDEx
  • Migration to use of FTS groups in FTS "cloud" channels [Ongoing]
  • CMS data ops
    • Backfill at RAL & PIC [Ongoing]

Catalin

  • Atlas Frontier server updates
  • work on CMS Phedex Nagios monitoring [ongoing]
  • configure squid on LHCb VOBOX [ongoing]
  • gLite updates on LHCB VOBOX [ongoing]
  • LFC/FTS replication (w/ Carmine) [ongoing]
  • job plans [ongoing]
  • WMS reconfiguration (ops/lcgadmin, fusion) [done]

Derek

  • Intervention on lcgce06 for glexec [Done]
  • Intervention on lcgce07 for glexec
  • Sync of templates with QWG for glite 3.1 and 3.2 [done]
  • Testing CREAM CE 1.6

Matt

  • Job Plans
  • Test FTS2.2.4 upgrade
  • Handover BDII services to Richard [Done]
  • APRs [Done]
  • Request disk deployments to meet 2010 wLCG pledges [Done]
  • Capacity Planning (meeting with Andrew L) [Done]
  • Site BDII performance problems [Done]
  • Propose to UB schedule for decommissioning of SL4 capacity [Done]

Richard

  • APR [Done] - Job Plan [ Completing / SSC-ing ]
  • Looking at the site-bdii timeout problem
  • Working on proposal on intra/inter -team communication to meet an action from the team awayday
  • Reviewing G/S process documentation
  • Further Nagios items from the to-do list (https://wiki.e-science.cclrc.ac.uk/web1/bin/view/EScienceInternal/NagiosTasksToDo)
  • CASTOR items:
    • Writing up results from p/p stress tests
    • Preparing ground for using a per-instance nameserver (rather than the central one)

Mayo

  • Implement feedback into TSBN web interface [Done]
  • Set up scripts that update TSBN interface to run as scheduled jobs on a windows machine
  • Certificate viewer for NGS cert wizard first prototype [Done]
  • Implement David Meredith's feedback into Certificate viewer
  • Write a script to turn PDU ports on/off [Done]
  • Write script to control ports on multiple PDUs
  • Create Handover Document tation for finished projects

VO Reports

ALICE

ATLAS

CMS

  • This week moving to multiple primary datasets due to the recent (and upcoming) increases in luminosity. This means each Tier-1 will get at least one PD. Acquisition era changing from Commissioning10 to Run2010A.

LHCb

OnCall/AoD Cover

  • Primary OnCall:
  • Grid OnCall: Derek
  • AoD: