RAL Tier1 weekly operations Grid 20091123

From GridPP Wiki
Jump to: navigation, search

Summary of Previous Week

Developments

  • Alastair
    • Away
  • Andrew
    • CMS PhEDEx ganglia monitoring
    • FTS channels: adjustments to STAR-UKILT2BRUNEL, STAR-UKILT2ICHEP, STAR-UKISTHGRIDRALPP for CMS
    • Updated kernel on csflnx414
    • Deleted old CMS files from /store/unmerged
    • Completed automatic generation of UB schedule CPU & disk emails
    • Started work on CMS computing model spreadsheet
    • Training: attending Nagios training session
    • Out sick 1 day
  • Catalin
    • **no** progress on remaining SL5 VOBOXes
    • started work on backup, recovery (machines audit)
    • dealt with FronTier following java update
    • sorted out the WMS ICE issue
  • Derek
    • Metric report
    • Testbed proposal
    • Adding SL53 i386 to quattor for dev helpdesk
  • Matt
    • Kernel updates for FTS/MyProxy
    • Caching CIP provider script (not deployed)
    • Disaster recovery planning
    • Backup/recovery planning
    • Checked batch system for signs of SL4/SL5 crosstalk, and other job allocation problems; appears clean since restart of pbs_server daemon
  • Richard
    • CASTOR activities: Finished the new structure for the family of pre-production Quattor templates
    • Built a 32-bit version of a BDII server and updated template to place log files etc in RAL-preferred location
    • Took 2 RT tickets on BDII server config's
  • Mayo
    • Anual leave Monday and Tuesday
    • Worked on New Metrics system
    • Exported data from new metrics gathering system to enable Derek to produce the monthly report
    • Worked on automating tape robot spreadsheet project

Operational Issues and Incidents

Description Start End Affected VO(s) Severity Status
WMS Jobdirs full Wed 18 Nov Thu 19 Nov All Medium Resolved
FroNTier crash Wed 11 Nov Fri 20 Nov ATLAS Low Resolved

Plans for Week(s) Ahead

Plans

  • Alastair
    • Away
  • Andrew
    • CMS computing model spreadsheet
    • t2k to t2k.org VO name change
  • Catalin
    • start deployment on 2nd Alice SL5 VOBOX (HW made available on Monday)
    • ready to start deployment on LHCB SL5 VOBOX (waiting for "Quattor ready to go")
    • implement Nagios checks for FronTier
    • continue working on systems audit (backup, recovery)
  • Derek
    • Test SCAS
    • Fix problems with CE information system
    • Working on helpdesk end to end restore
  • Matt
  • Richard
    • CASTOR activities: Working with CK and d/b folk to be able to script database setup for new pre-prod instance; also looking at using custom ncm- components for configuration
    • Building and testing a 64-bit version of BDII server
  • Mayo
    • Implement feedback into second version of metrics gathering system in prperation for November Metrics
    • Continue working on automated spreadsheet project
    • Continue working on importing Nagios alarm data into svn

Resource Requests

Downtimes

Description Hosts Type Start End Affected VO(s)

Requirements and Blocking Issues

Description Required By Priority Status
LHCb SL5 64bit VOBOX deployment using Quattor 25 Nov 2009 Medium HW allocated but Quattor recipe not yet available (RT#53392)
Hardware for testing LFC/FTS resilience High DataServices want to deploy a DataGuard configuration to test LFC/FTS resilience; request for HW made through RT Fabric queue
Hardware for PPS High We have made a commitment to test PPS pre-releases, and have no hardware dedicated for this.
Hardware for Grid Services testbed Medium

OnCall/AoD Cover

  • Primary OnCall: Catalin (Mon-Thu)
  • Grid OnCall:
  • AoD: