RAL Tier1 weekly operations Fabric 20101025

From GridPP Wiki
Jump to: navigation, search

Developments

  • All:
  • Martin:
  • Ian:
    • Worked with Atlas on cvmfs plans
    • Worked on virtualisation
    • Deployed pakiti on all Quattor managed systems
  • Tim:
    • Repack now down to dregs. Will take effort to see that can be recovered.
    • VTL has some duff tapes that need recovering ready to remove VTL from DMF system
    • Facilities castor tape system working now
    • new tape pool for Atlas.
    • CMS some funnies with recalls iinteractinbg with repack
  • Jonathan:
  • James A:
    • Started blanking and returning evaluation hardware.
    • Worked on mitigation and patching for several CVEs.
    • Developed monitoring of R89's UPS system.
  • James T
    • Produced summary of disk server HDD humidity tolerances
    • Facilities disk server configuration
    • Planning upgrade of SL4 disk servers to SL5 64-bit
    • Preparation for Gen castor upgrade
  • Cheney
    • nagios checks for database
    • tidy up quattor
    • script for show_castor_services job
    • fix stuck jobs in hinode
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Decommissioning old batch systems.(R 27)
    • gdss110 re-installed and given back to Tim.
    • gdss380 taken by Streamline for fix.(Crashed with single faulty drive)
    • gdss417 acceptance testing. (Crashed with single faulty drive)
    • gdss512 configured raid array and started acceptance test.
    • gdss280 replaced raid card borrowed from gdss338. (Testing)
    • gdss569 borrowed for Testing.
    • gdss463 replaced backplane but couldn't fix the problem. (Reported raid card)
    • Hardware failure stats/graphs.
    • Jetstor1 replaced drive in port 11.
    • gdss408 replaced memory.(Borrowed from gdss377) Back into production same day.
    • Update daily status of Streamline 2009 disk servers testing.
    • Streamline/areca disk servers crashed due to single faulty drive. (ongoing)

Absences

  • Jonathan on partial retirement (not in on Monday and Friday)
  • Cheney leave tues/weds 26th/27th.
  • Cheney early warning -likely to be off most of november- date subject to change
  • Tim out Nove 1st

Operational Issues and Incidents

Index Description Start End Severity Affected VO(s)

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB

Component Description Start End Affected VO(s) Type

Development priorities

  • All
  • Martin:
  • Ian:
    • Updating errata templates
    • Driving errata updates
    • cvmfs preparation for atlas user jobs
    • prepare for HEPiX
  • Tim:
    • Finish repack of CMS tapes
    • Facilities Castor developments
  • Cheney
    • scripts for db checks


  • Jonathan:
  • James T:
    • Gen CASTOR upgrade
    • Testing upgrade of SL4 disk servers to SL5 64-bit
    • Acceptance tests on Streamline 09 kit
    • A/L Friday 29th
  • James A:
    • Continue blanking and returning evaluation hardware.
    • Work on internal database developments.
    • Make changes as required to Overwatch to support CASTOR 2.1.9 upgrade.
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Update daily status of Streamline 2009 disk servers testing.
    • Continuous decommissioning old batch systems.(R 27)

Absences

  • Jonathan on partial retirement (not in on Monday and Friday)
  • Cheney leave tues/weds 26th/27th.
  • James A/L on Friday 29th
  • Cheney early warning -likely to be off most of november- date subject to change

Fabric On-Call

  • James T Mon-Thur
  • Kashif Hafeez Fri-Sun

Advanced Warning of Requirements and Blocking issues

Services Issues


RAL Tier1 weekly operations fabric

Category:RAL_Tier1