RAL Tier1 weekly operations Grid 20091207

From GridPP Wiki
Jump to: navigation, search

Summary of Previous Week

Developments

  • Alastair
    • Add check_world_writable.sh to Nagios
    • Make wiki page for Computing requirements
    • Run tests for user analysis at RAL.
  • Andrew
    • Wrote script to generate my metrics from MySQL; script to do disk accounting consistency check with overwatch
    • Applied December fairshares in Maui; started Novemember accounting (waiting for tape info now)
    • Added checksum checking for CMS tape migrations in PhEDEx; updated PhEDEx dev instance; testing with transfers from CERN
    • Corrected check_pbs_efficiencies Nagios plugin
    • Prepared & gave short presentation at FacOps meeting about IO testing
    • Started work on CMSSW TTreeCache patches IO testing
    • CMS computing model spreadsheet
  • Catalin
    • added 2nd ALICE VOBOX into production
    • continued work on MySQL systems audit
    • put Frontier fix, not yet confirmed
    • svn'ed the Alice xrootd installation docs
    • still waiting from LFC@CERN feedback for recovery and consistency checks
  • Derek
    • Continuing work on quattorising helpdesk frontend
  • Matt
  • Richard
    • Quattor template(s) for a production CIP server
    • Read through JW's Nagios slides as prep. for the upcoming NRPE class
    • Tuned into GDB meeting
    • Added a couple of items to the Fabric team's quattor documentation (https://wiki.e-science.cclrc.ac.uk/web1/bin/view/EScienceInternal/QuattorImplementationNotes)
    • CASTOR activities:
      • Added to quattor a set of templates for SLC 4.6 as a "dry run" for the SLC 4.8 version of the templates
      • Worked with CERN folk to try and get a set of quattor templates for SLC 4.8
      • Re-arranged some of the quattor templates for PPS instance to simplify config file handling
      • Built further instances of the server types to check installation process
      • Reported "incomplete build" quattor issue to mailing list and found others are seeing the same problem
  • Mayo
    • Worked on New Metrics system: Took feedback on newly added features and fixed any bugs testing revealed
    • began work on admin interface for metrics sytem
    • Had a meeting of extending the new metric system to include Gridpp users
    • Worked on automating tape robot spreadsheet project

Operational Issues and Incidents

Description Start End Affected VO(s) Severity Status
250 Atlas jobs deleted (GGUS #53813) Wed 2 Dec 17:40 Wed 2 Dec 18:00 Atlas Medium Resolved - believed not to be a site issue

Plans for Week(s) Ahead

Plans

  • Alastair
    • Deploy Disk server.
    • Contact Ganga developers about adding better error information to ATLAS jobs for normal users running at RAL.
    • Test poweruser analysis at RAL.
    • Away on Wednesday.
  • Andrew
    • PhEDEx: complete checksum checking tests on dev instance; check if dev instance can be run from another VOBOX easily
    • Complete November accounting
    • Continue CMSSW TTreeCache IO testing
    • Training: Nagios plugins
    • Attend relevant meetings at CMS week
  • Catalin
    • follow up Frontier fix
    • continue working on backup/recovery
    • ready to start deployment on LHCB SL5 VOBOX
    • plans to migrate MySQL server(s) to SL5 64-bit
    • decommission SL4 ALICE VOBOXes
  • Derek
    • Change control process via RT
  • Matt
    • Swap in resilient CIP plugin on site BDIIs
    • Tier-1 Review resilience talk
  • Richard
    • NRPE training
    • CASTOR activities:
      • Complete the "data configurator" tool to handle disk servers as well as other server types
      • Progress the initial setup of databases on a new instance
      • Continue activity on SLC 4.8 templates
  • Mayo
    • Work Metric system admin interface and documentation
    • Add Sarah Pearce to metrics system for testing with regards to gridpp extension
    • Continue working on automated spreadsheet project
    • nrpe nagios plugins training

Resource Requests

Downtimes

Description Hosts Type Start End Affected VO(s)

Requirements and Blocking Issues

Description Required By Priority Status
LHCb SL5 64bit VOBOX deployment using Quattor 25 Nov 2009 Medium Quattor recipe not yet available (RT#53392)
Hardware for testing LFC/FTS resilience High DataServices want to deploy a DataGuard configuration to test LFC/FTS resilience; request for HW made through RT Fabric queue
Hardware for PPS High We have made a commitment to test PPS pre-releases, and have no hardware dedicated for this.
Hardware for Grid Services testbed Medium
Hardware for SL5 64-bit MySQL main server Medium Plan to migrate to SL5 64-bit by mid January

OnCall/AoD Cover

  • Primary OnCall:
  • Grid OnCall: Catalin
  • AoD: