RAL Tier1 weekly operations Grid 20091221

From GridPP Wiki
Jump to: navigation, search

Summary of Previous Week

Developments

  • Alastair
    • Deployed 2 Disk servers.
    • Contacted Panda/Ganga developers to improve error information for ATLAS jobs at RAL.
    • Tested poweruser analysis at RAL, found problem with CERN WMS.
  • Andrew
    • Submitted change request document for FTS channel timeout adjustment; applied change
    • Updates check_pbs_efficiencies.pl Nagios script to allow automatic killing of low efficiency jobs for selected VOs
    • Resolved failing SRMv2-user CMS SAM test
    • TTreeCache & read-coalescing IO testing on reco & skimming jobs
    • Added Ganglia monitoring of CMS tape migrations (from PhEDEx logs, not CASTOR)
    • Investigated various CMS issues
  • Catalin
    • worked on old/new ALICE VOBOXes
    • no progress on LHCb VOBOX quattorising
    • still waiting from LFC@CERN feedback for recovery and consistency checks
    • some work on MySQL migration
    • attended various meetings
  • Derek
    • Moved change control system from dev helpdesk to prod helpdesk
    • Produced metrics report
    • Implemented cron jobs to back up lcgcenfs files to CEs
  • Matt
    • Tested new production CIP on test site BDII
    • Tier-1 Review
  • Richard
    • Continued plan for proposed BDII changes during January
    • Wrote a script to dump our DNS domain to simplify "which machine is that" type queries arising from monitoring alerts/emails
    • CASTOR activities:
      • Defined disk and tape servers to use with new pre-prod instance
  • Mayo
    • Created admin UI for metric system and wrote system user documentation
    • created user account for Sarah Pearce to enable testing with regads to the possible gridpp extension
    • Attended Cheney's NRPE training
    • Worked on automating tape robot spreadsheet project

Operational Issues and Incidents

Description Start End Affected VO(s) Severity Status

Plans for Week(s) Ahead

Plans

  • Alastair
    • Try and fix Poweruser issues
    • Look into "slow" FTS rates in UK Cloud.
  • Andrew
    • On A/L next week
  • Catalin
    • continue work on MySQL migration
    • follow up issue with t2k.org 'zero size' LFC entries
    • minor issues with ALICE VOBOXes central monitoring
    • decomission old SL4 ALICE VOBOXes
  • Derek
    • Document process for coping with catastrophic failure of lcgcenfs
    • Document process for breaking helpdesk mail loops
  • Matt
    • Switch Site BDIIs to new CIPs
    • GridPP4 input
    • R-GMA Registry recovery testing
    • Investigate APEL publishing problems (lcgbatch01)
  • Richard
    • Finish off the plan for proposed BDII changes during January
    • Work with MB on getting a DNS zone delegated to Tier1
    • Work with JA/DR on placing a link to "DNS dump" script on Tier1 web page
    • CASTOR activities:
      • Rebuild disk servers to be used in new pre-prod instance
      • Update the software on tape server for new pre-prod instance
      • Continue activity on SLC 4.8 templates
  • Mayo
    • Work on Metric system: adding change password feature for users / report printing features
    • Work on possible exstention of system to include Gridpp
    • Continue working on automated spreadsheet project

Resource Requests

Downtimes

Description Hosts Type Start End Affected VO(s)

Requirements and Blocking Issues

Description Required By Priority Status
LHCb SL5 64bit VOBOX deployment using Quattor 25 Nov 2009 Medium Quattor recipe not yet available (RT#53392)
Hardware for testing LFC/FTS resilience High DataServices want to deploy a DataGuard configuration to test LFC/FTS resilience; request for HW made through RT Fabric queue
Hardware for PPS High We have made a commitment to test PPS pre-releases, and have no hardware dedicated for this.
Hardware for Grid Services testbed Medium

OnCall/AoD Cover

  • Primary OnCall:
  • Grid OnCall:
  • AoD: