RAL Tier1 weekly operations Grid 20100614

From GridPP Wiki
Jump to: navigation, search

Operational Issues

Description Start End Affected VO(s) Severity Status
Job status monitoring from CREAMCE 2-Feb-2010 CMS medium [10-Feb-2010] WMS patch available soon; CREAMCE new version available soon [07-Apr-2010] CMS tests have shown that WMS patches resolve the problem; still waiting for patch to be installed on the production WMSs in Italy

Downtimes

Description Hosts Type Start End Affected VO(s)

Blocking Issues

Description Requested Date Required By Date Priority Status
HW needed to test Dataguard technology for LFC/FTS 19 May 2010 15 June 2010 Medium [24-05-2010]HW available; needs to be deployed by Fabric and then handed over to Dataservices
Firewall change for lcgce03 17 May 2010 15 June 2010 Medium Required to deploy lcgce03 as Production CREAM CE for Alice [11/06/10] Request made to networking on 9/6/10

Developments/Plans

Highlights for Tier-1 Ops Meeting

Highlights for Tier-1 VO Liaison Meeting

  • FTS2.2.4 upgrade done; MICE support added.
  • BDII: mitigation for problem seen at the weekend; Tuesday problem did not have the same cause

Detailed Individual Reports

Alastair

  • Working on ATLAS software server on /afs
  • Writing scripts/checks to allow faster identification of causes of high transfer rate problem seen last week.
  • Group production work at RAL.
  • Working to improve pbsjobs database to allow easier monitoring of production work.

Andrew

  • Putting job plan into Oracle [Ongoing]
  • Accounting
    • May accounting [Done]
    • Wrote script to add CASTOR tape usage to UB schedule MySQL database without requiring tape spreadsheet [Done]
    • Updated published CPU capacities, scaling factors for SL09 WNs, Ganglia capacity scripts, wrote documentation [Done]
  • Added some ops users to lcgvo-02-21 so that SAM tests will work [Done]
  • FTS
    • Updated services.xml due to missing endpoint [Done]
    • Changed STAR-UKISOUTHGRIDRALPP from srmcopy to urlcopy; increased transfer marker timeout [Done]
  • Updated RGMA ACL [Done]
  • CMS data ops
    • Running skims at FNAL [Done]
    • Running Run2010A rereco at RAL, IN2P3, FNAL
  • Attended Oracle finance & OTL training [Done]

Catalin

  • test LFC deployment using quattor [ongoing]
  • configure squid on LHCb VOBOX [ongoing]
  • job plans into Oracle [ongoing]

Derek

  • Testbed Strategy [ongoing]
  • E-mailing experiment contacts about Sl4 shutdown
  • Setting up NGS UEE on worker nodes
  • Change control for deploying lcgce03 [ongoing]
  • Testing glexec update
  • Configuring pool accounts in quattor

Matt

  • Produce FTS training material
  • Talk on ongoing SVN work for OnCall meeting
  • Test upgrade path to FTS2.2.4 [Done]
  • Submit Change Control request for FTS2.2.4 upgrade [Done]
  • Construct end-to-end timeline for 08 and 09 disk deployments [Done]

Richard

  • Added extra logic into the CIP->site BDII "bridging" script to check for existence of particular items rather than just non-zero volume of output [Done]
  • Built LCG0630 as a top-level BDII to test quattor configuration of the "cachesize" directive in the glue-slapd.conf
  • Further work on the "team status page" being developed as an action from team awayday
  • Reviewing G/S process documentation
  • Adding a Nagios check to look for the error that gave rise to the weekend's BDII problems
  • CASTOR items:
    • Upgraded central name server in pre-prod
    • Ran functional tests on pre-prod
    • Finishing adding metrics to pre-prod benchmark results wiki page
  • Next Week
    • Complete running of the pre-prod stress tests
    • Take the logic developed for the CIP->site BDII script and create a Nagios check to see how often the condition arises

Mayo

  • Implement David Meredith's feedback into Certificate viewer [Done]
  • integrate certificate viewer module with existing NGS certificate wizard code
  • Write script to control ports on multiple PDUs
  • Create Handover Document tation for finished projects [ongoing]
  • Enter job plan into ssc

VO Reports

ALICE

  • waiting for CREAM-CE 1.6 deployment at RAL
  • cannot roll out new xrootd version (20100510-1509_dbg) on Castor 2.1.7

ATLAS

CMS

  • Very large MC reprocessing will begin soon

LHCb

OnCall/AoD Cover

  • Primary OnCall:
  • Grid OnCall: Derek (Mon-Sun)
  • AoD: