RAL Tier1 weekly operations Fabric 20100712

From GridPP Wiki
Jump to: navigation, search

Developments

  • All:
  • Martin:
    • CPU ITT finalisation
    • First look at disk ITT returms
    • Finance / spend planning
    • WLCG workshop
  • Ian:
    • WLCG Collaboration workshop Weds-Friday
    • Began configuring Hyper-V test server
  • Tim:
    • monitor repack progress
    • DMF single copy for cedar
    • DMF delete duplicate copies
  • Jonathan:
    • Administrator on Duty (Wednesday and Thursday)
    • fixed atlasbackup problem on csfnfs58 by killing old processes
    • stopped mysql server and mysqlhotcopy on lcgsql0365 after confirmation that MySQL database is no longer required
    • worked on new Nagios slave server for batch workers
    • issued version 1.0-54 of RPM tier1-sudo-config to correct problem with sudo sub-directory /etc/sudoers.d
    • created 1 AFS userid
    • 1 Nagios configuration update
    • changed TCP tuning parameter on nagios06
  • James A:
    • Provided interim network cabling for dell services nodes for Hyper-V evaluation.
    • Added some basic ganglia monitoring for Apache on quattor01.
    • Set up rsync mirroring of dell OpenManage tools and SL5.5 on install02.
    • Connected some network cabling in LPD room for Cheney.
    • Worked with Alastair to attempt to understand the interaction of Atlas software with the software server and the Frontier squid.
  • James T
    • WLCG Workshop
    • Fixed a bug in ncm-etcservices
      • Updated the quattor templates to set the RFIO port in /etc/services and tested on ATLAS nonProd disk servers.
    • Helped Kash with gdss67
    • Brainstormed alternative storage ideas.
  • Cheney
    • reboot of most of the database servers following disk fault
    • set up of spare acsls
    • fix c2probe for sls
    • investigate ssh key changes (it was a server rebuild)
    • investigate ssh intrusions
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Decommissioning old batch systems.(R 27)
    • gdss67 running 7 days acceptance test.
    • gdss78 running 7 days acceptance test.
    • gdss207 crashed again. (Intervention)
    • gdss474 replaced backplane. Another problem. (Intervention)
    • Hardware failure stats/graphs.
    • gdss231 & 420 low voltage on battery.
    • Streamline/areca disk servers crashed due to single faulty drive. (ongoing)

Absences

  • Jonathan on partial retirement (not in on Monday and Friday)

Operational Issues and Incidents

Index Description Start End Severity Affected VO(s)

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB

Component Description Start End Affected VO(s) Type

Development priorities

  • All
  • Martin:
    • Final CPU ITT
    • Begin work on Disk ITT evaluation
    • RHEL5 for DB systems
  • Ian:
    • Work on Virtualiasation testbed
    • Investigation of atlas software server
    • Facilities Castor instance
    • Initial Quattor config for gLite 3.2 LFC
  • Tim:
    • get back top facilities castor planning
  • Cheney
    • quat
  • Jonathan:
    • On leave all week
  • James T:
    • Streamline 09 testing
    • Roll out LHCb WAN tuning
    • Deploy fix for /etc/services RFIO port on SL5 disk servers
    • Security strategy team stuff
  • James A:
    • Start porting Quattor server templates to SL5.5.
    • Start planning for migration of Atlas software server to a new Dell services node.
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • gdss231 & 420 received battery need 2 more persons to do intervention. .
    • gdss474 send new logs to Viglen.
    • gdss380 run 7 days acceptance test.
    • Continuous decommissioning old batch systems.(R 27)

Absences

  • Jonathan on partial retirement (not in on Monday and Friday)
  • Jonathan on leave on Tuesday - Thursday (so out all week)
  • Martin A/L Friday pm

Fabric On-Call

Advanced Warning of Requirements and Blocking issues

Services Issues


RAL Tier1 weekly operations fabric

Category:RAL_Tier1