RAL Tier1 weekly operations Grid 20100607

From GridPP Wiki
Jump to: navigation, search

Operational Issues

Description Start End Affected VO(s) Severity Status
LFC/FTS unsched d/time Wed 2 June 14:00 Wed 2 June 17:30 All Critical Following the Oracle updates to DB systems, one Oracle RAC node became unstable. The patch had to be rolled back and services restarted.
Job status monitoring from CREAMCE 2-Feb-2010 CMS medium [10-Feb-2010] WMS patch available soon; CREAMCE new version available soon [07-Apr-2010] CMS tests have shown that WMS patches resolve the problem; still waiting for patch to be installed on the production WMSs in Italy

Downtimes

Description Hosts Type Start End Affected VO(s)

Blocking Issues

Description Requested Date Required By Date Priority Status
HW needed to test Dataguard technology for LFC/FTS 19 May 2010 15 June 2010 Medium [24-05-2010]HW available; needs to be deployed by Fabric and then handed over to Dataservices
Firewall change for lcgce03 17 May 2010 15 June 2010 Medium Required to deploy lcgce03 as Production CREAM CE for Alice

Developments/Plans

Highlights for Tier-1 Ops Meeting

  • Construct end-to-end timeline for 08 and 09 disk deployments
  • Site BDII checks to detect null output from CIP (but not resource BDII)

Highlights for Tier-1 VO Liaison Meeting

  • FTS2.2.4 testing
  • Disk deployments from nonProd to Prod for ATLAS and CMS

Detailed Individual Reports

Alastair

  • Away end of last week, working from home beginning of this week.
  • Assisting with debugging new ATLAS release. Ongoing problems with missing libraries.
  • Testing FTS and check summing at RAL trying to reduce backlog problems.
  • Deploying 22 disk servers into production.

Andrew

  • Putting job plan into Oracle [Ongoing]
  • Working on analysis of VO support survey responses [Ongoing]
  • Completed sorting out files from bad tape CS6000 [Done]
  • May accounting, including more APEL fixing; updates to scripts for T2K; wrote script to automate checking of CESGA; investigating why ATLAS has split into 2 distinct VOs on CESGA webpage
  • RGMA ACL update [Done]
  • Investigated new CMS VOBOX proxy renewal daemon failure; added cron to do additional checks [Done]
  • Added additional checking for PhEDEx for detecting problems staging files [Done]
  • CMS data ops
    • Running skims at FNAL
  • Next week: deploy CMS V09 disk servers to cmsFarmRead

Catalin

  • gLite updates on WMS [done]
  • kernel upgrades [done]
  • configure squid on LHCb VOBOX [ongoing]
  • LFC/FTS replication (w/ Carmine) [ongoing]
  • job plans [ongoing]
  • test LFC deployment using quattor

Derek

  • A/L all week [Done]
  • CPU Capacity publishing update [Done]
  • Testbed Strategy
  • E-mailing experiment contacts about Sl4 shutdown
  • Setting up NGS UEE on worker nodes
  • Change control for deploying lcgce03
  • Quattor helpdesk queue [Done]
  • Testing glexec update
  • Attending HEPSYSMAN (Thu-Fri)

Matt

  • Test upgrade path to FTS2.2.4/agree schedule with production team
  • Construct end-to-end timeline for 08 and 09 disk deployments
  • Extra checking of CIP output on site BDIIs

Richard

  • Added extra logic into the CIP->site BDII "bridging" script to check for existence of particular items rather than just non-zero volume of output
  • Built LCG0630 as a top-level BDII to test quattor configuration of the "cachesize" directive in the glue-slapd.conf
  • Further work on the "team status page" being developed as an action from team awayday
  • Reviewing G/S process documentation
  • CASTOR items:
    • Ran pre-prod stress tests
  • Next Week
    • Complete running of the pre-prod stress tests
    • Take the logic developed for the CIP->site BDII script and create a Nagios check to see how often the condition arises

Mayo

  • Implement David Meredith's feedback into Certificate viewer [Done]
  • integrate certificate viewer module with existing NGS certificate wizard code
  • Write script to control ports on multiple PDUs
  • Create Handover Document tation for finished projects [ongoing]
  • Enter job plan into ssc

VO Reports

ALICE

  • waiting for CREAM-CE 1.6 deployment at RAL

ATLAS

CMS

  • Reprocessing requests have started appearing for ICHEP
  • Mostly skimming as been running at T1s over the past week or so

LHCb

OnCall/AoD Cover

  • Primary OnCall: Catalin (Tue-Sun)
  • Grid OnCall: Derek (Mon)
  • AoD: Catalin (Tue)