RAL Tier1 weekly operations Grid 20110314

From GridPP Wiki
Jump to: navigation, search

Operational Issues

Description Start End Affected VO(s) Severity Status
ATLAS Frontier server (lcgce04) affected by DNS changes Thu 10 Mar 18:15 Fri 11 Mar 13:00 ATLAS Medium Wrong DNS change. Reverted back next day.
non-LHC WMS (lcgwms03) unavailable Mon 14 Mar 00:30 Mon 14 Apr 09:30 non-LHC High Host affected by high no of I/O operations; reboot needed

Downtimes

Description Hosts Type Start End Affected VO(s)
Oracle DB down - installation of more isolating transformers lcglfc0{669,670,671,672,673,674,675}, lcgvo-s3-03, lcgvo-s3-04, lcgsql-s3-12 Tue 15 Mar 07:40 Tue 15 Mar 11:00 All

Blocking Issues

Description Requested Date Required By Date Priority Status

Developments/Plans

Highlights for Tier-1 Ops Meeting

Highlights for Tier-1 VO Liaison Meeting

Detailed Individual Reports

Alastair

  • Working on ATLAS permission change. [On hold]
  • Setting up xrootd for ATLAS at RAL.
    • Talking to ALICE
    • Looking into upgrading castor client on all WN.
  • Disk pool merging and DB change.
    • Cleaning up dark data [Ongoing]
    • Writing change control [Done]
    • Moving files! [Done!]
  • Preparing for Beauty 2011 conference.
  • Requested new VO box for ATLAS Frontier.

Andrew

  • CMS squid name changes [Done]
  • Learning about setting up tape families for data & MC; responded to 3 tickets [Done]
  • Attended CMS UK computing meeting at Imperial [Done]
  • PhEDEx Dev instance upgraded to 4_0_0; Prod & Debug to do [Ongoing]
  • Improved APEL Nagios plugin [Done]
  • Deleted CMS dark data, tidying up empty directories [Done]
  • CMS Data Ops
    • MC rereco at FNAL [Ongoing]

Catalin

  • preparation for electrical intervention on Tuesday
  • investigate another problem/crash on lcgwms03
  • involved with CREAM CEs installation and configuration [ongoing]
  • work on quattorised ATLAS Frontier installation [ongoing]
  • tomcat v6.0.32 upgrade on ATLAS Frontier server [done]
  • apply latest errata and kernel [done]
  • assist work on LFC Oracle DB change [done]

Derek

  • Investigating BLParser isssues on lcgce09 [ongoing]
  • Publishing whole node queue [done]
  • Errata updates [done]
  • Improving config of small vos in quattor [ongoing]
  • Metrics report [done]

Matt

  • Deploy testbed LFC and MyProxy. [New]
  • Management of FTS groups. [New]
  • Prep for training course (Mon-Wed next week). [New]
  • Testing Hadoop instance. [Ongoing]
  • Contact NFS users. [Ongoing]

Richard

  • Dealing with fall-out from moving a top BDII into the UPS room. [Ongoing]
  • Trying out new hypervisor (hv-10) to see how much performance has improved (have moved an existing VM across to the new h/v) [Ongoing].
  • Building an ARGUS server using the new QWG templates [Ongoing]
  • CASTOR items:
    • Built cfssh09.gridpp.rl.ac.uk as a StorageD server. [Done]
    • Running some stress tests on preprod instance. [Ongoing]

VO Reports

ALICE

ATLAS

CMS

  • MC reprocessing has started
  • Deleted 63 TB dark data and over 443000 empty directories
  • PhEDEx 4_0_0 released: supports FTS checksumming and Twitter

LHCb

OnCall/AoD Cover

OnCall Rota

  • Primary OnCall:
  • Grid OnCall: Derek
  • AoD: