Difference between revisions of "Site monitoring status"

From GridPP Wiki
Jump to: navigation, search
m
 
(38 intermediate revisions by 14 users not shown)
Line 1: Line 1:
== Sites monitoring status ==
 
 
 
This page is intended to gather together the tools that sites are currently using to monitoring their local sites. Please fill in the details for your site with the following pieces of information:
 
This page is intended to gather together the tools that sites are currently using to monitoring their local sites. Please fill in the details for your site with the following pieces of information:
  
Line 6: Line 4:
 
1) Current solution(s): What tools are currently used at the site and for what purpose?
 
1) Current solution(s): What tools are currently used at the site and for what purpose?
  
2) Future plans: What plans (if any) do your site have for future monitoring?
+
2) Future plans: What plans (if any) does your site have for future monitoring?
 +
 
 +
3) Notes: Any other information you think might be useful.
  
3) Notes: Any other information you think might be useful
 
  
 
{|class="wikitable"
 
{|class="wikitable"
Line 14: Line 13:
  
 
|-style="background:#7C8AAF;color:white"
 
|-style="background:#7C8AAF;color:white"
|Site
+
|style="width: 10%"|
 +
Site
 
|Current solution(s)
 
|Current solution(s)
 
|Future plans
 
|Future plans
Line 20: Line 20:
  
 
|-
 
|-
|UKI-SCOTGRID-GLASGOW
+
|RAL Tier-1
|Naemon(status and alerting), Ganglia/Graphite (metric graphing), Cacti (network)
+
|Nagios/Icinga, ganglia, cacti for networking, home grown dashboard - mimic
|Dashboard (Dashing or similar), network monitoring
+
|Starting to look at elasticsearch
|We currently use Ganglia for systems metrics, Graphite for a higher cluster level view
+
|Use Thruk interface to Nagios which provides useful additional views.
 +
 
 +
|-
 +
|UKI-LT2-Brunel
 +
|
 +
|
 +
|
 +
 
 +
|-
 +
|UKI-LT2-IC-HEP
 +
|Nagios and Cacti
 +
|
 +
|
 +
 
 +
|-
 +
|UKI-LT2-QMUL
 +
| OpenNMS, APC NetBotz 550
 +
| Improve syslog monitoring
 +
| OpenNMS primarily monitors using SNMP. Use dell openmanage on dell servers to provides extended information and snmp traps. HP also has similar tools that are also used. Use OpenNMS to monitor syslogs for ERROR and higher messages. room  monitoring using APC netbotz solution. OpenNMS also now works with grafana to provides additional graphs.
 +
 
 +
|-
 +
|UKI-LT2-RHUL
 +
|
 +
|
 +
|
 +
 
 +
|-
 +
|UKI-LT2-UCL-HEP
 +
|
 +
|
 +
|
 +
 
 +
|-
 +
|UKI-NORTHGRID-LANCS-HEP
 +
| elasticsearch/logstash/kibana deployed widely on local cluster (as opposed to grid farms)
 +
 
 +
Usual suspects: icinga, ganglia, graphite on both local+grid machines
 +
| perhaps deploy logstash on grid farms
 +
 
 +
currently deploying Dashing for dashboards
 +
 
 +
| Logstash easy to deploy, powerful, and interesting to explore the rich content
 +
 
 +
Solid stuff, unlikely to replace these
 +
 
 +
|-
 +
|UKI-NORTHGRID-LIV-HEP
 +
| Nagios, Ganglia, Cacti
 +
| Replace Nagios with Icinga, upgrade Ganglia, investigate ELK, Graphite
 +
|
 +
 
 +
|-
 +
|UKI-NORTHGRID-MAN-HEP
 +
|
 +
|
 +
|
 +
 
 +
|-
 +
|UKI-NORTHGRID-SHEF-HEP
 +
|
 +
|
 +
|
 +
 
 +
|-
 +
|UKI-SCOTGRID-DURHAM
 +
|
 +
|
 +
|
 +
 
 +
|-
 +
|UKI-SCOTGRID-ECDF
 +
|Ganglia
 +
|
 +
|Ganglia used for general machine and batch monitoring, bash scripts used for monitoring of running services
 +
 
 +
|-
 +
|UKI-SCOTGRID-GLASGOW
 +
|Naemon (status and alerting), Ganglia/Graphite (metric & time series graphing), Cacti (network monitoring)
 +
|Dashboards (Dashing, Grafana), reconsidering network monitoring
 +
|We currently use Ganglia for systems metrics, Graphite for a higher cluster level view, nagios plugin to check graphite thresholds. Deploying Grafana as unified fronted.
 +
 
 +
|-
 +
|UKI-SOUTHGRID-BHAM-HEP
 +
|
 +
|
 +
|
 +
 
 +
|-
 +
|UKI-SOUTHGRID-BRIS
 +
|older cluster Ganglia & Pakiti; newer uses Nagios
 +
|Going to ditch Ganglia & Pakiti for nagios on older cluster.
 +
|Wish Munin scaled well!
 +
 
 +
 
 +
|-
 +
|UKI-SOUTHGRID-CAM-HEP
 +
|Nagios and Ganglia
 +
|
 +
|
 +
 
 +
|-
 +
|UKI-SOUTHGRID-OX-HEP
 +
|Nagios and Ganglia
 +
|Some testing of ELK
 +
|
 +
 
 +
|-
 +
|UKI-SOUTHGRID-RALPP
 +
|Nagios, Ganglia, Cacti, Pakiti, Dashing
 +
|ELK, new network monitoring (Observium?)
 +
|
 +
 
 +
|-
 +
|UKI-SOUTHGRID-SUSX
 +
|
 +
|
 +
|
  
 
|}
 
|}

Latest revision as of 10:58, 1 November 2016

This page is intended to gather together the tools that sites are currently using to monitoring their local sites. Please fill in the details for your site with the following pieces of information:


1) Current solution(s): What tools are currently used at the site and for what purpose?

2) Future plans: What plans (if any) does your site have for future monitoring?

3) Notes: Any other information you think might be useful.


Site

Current solution(s) Future plans Notes
RAL Tier-1 Nagios/Icinga, ganglia, cacti for networking, home grown dashboard - mimic Starting to look at elasticsearch Use Thruk interface to Nagios which provides useful additional views.
UKI-LT2-Brunel
UKI-LT2-IC-HEP Nagios and Cacti
UKI-LT2-QMUL OpenNMS, APC NetBotz 550 Improve syslog monitoring OpenNMS primarily monitors using SNMP. Use dell openmanage on dell servers to provides extended information and snmp traps. HP also has similar tools that are also used. Use OpenNMS to monitor syslogs for ERROR and higher messages. room monitoring using APC netbotz solution. OpenNMS also now works with grafana to provides additional graphs.
UKI-LT2-RHUL
UKI-LT2-UCL-HEP
UKI-NORTHGRID-LANCS-HEP elasticsearch/logstash/kibana deployed widely on local cluster (as opposed to grid farms)

Usual suspects: icinga, ganglia, graphite on both local+grid machines

perhaps deploy logstash on grid farms

currently deploying Dashing for dashboards

Logstash easy to deploy, powerful, and interesting to explore the rich content

Solid stuff, unlikely to replace these

UKI-NORTHGRID-LIV-HEP Nagios, Ganglia, Cacti Replace Nagios with Icinga, upgrade Ganglia, investigate ELK, Graphite
UKI-NORTHGRID-MAN-HEP
UKI-NORTHGRID-SHEF-HEP
UKI-SCOTGRID-DURHAM
UKI-SCOTGRID-ECDF Ganglia Ganglia used for general machine and batch monitoring, bash scripts used for monitoring of running services
UKI-SCOTGRID-GLASGOW Naemon (status and alerting), Ganglia/Graphite (metric & time series graphing), Cacti (network monitoring) Dashboards (Dashing, Grafana), reconsidering network monitoring We currently use Ganglia for systems metrics, Graphite for a higher cluster level view, nagios plugin to check graphite thresholds. Deploying Grafana as unified fronted.
UKI-SOUTHGRID-BHAM-HEP
UKI-SOUTHGRID-BRIS older cluster Ganglia & Pakiti; newer uses Nagios Going to ditch Ganglia & Pakiti for nagios on older cluster. Wish Munin scaled well!


UKI-SOUTHGRID-CAM-HEP Nagios and Ganglia
UKI-SOUTHGRID-OX-HEP Nagios and Ganglia Some testing of ELK
UKI-SOUTHGRID-RALPP Nagios, Ganglia, Cacti, Pakiti, Dashing ELK, new network monitoring (Observium?)
UKI-SOUTHGRID-SUSX