Difference between revisions of "Site monitoring status"

Latest revision as of 10:58, 1 November 2016

This page is intended to gather together the tools that sites are currently using to monitoring their local sites. Please fill in the details for your site with the following pieces of information:

1) Current solution(s): What tools are currently used at the site and for what purpose?

2) Future plans: What plans (if any) does your site have for future monitoring?

3) Notes: Any other information you think might be useful.


Site	Current solution(s)	Future plans	Notes
RAL Tier-1	Nagios/Icinga, ganglia, cacti for networking, home grown dashboard - mimic	Starting to look at elasticsearch	Use Thruk interface to Nagios which provides useful additional views.
UKI-LT2-Brunel
UKI-LT2-IC-HEP	Nagios and Cacti
UKI-LT2-QMUL	OpenNMS, APC NetBotz 550	Improve syslog monitoring	OpenNMS primarily monitors using SNMP. Use dell openmanage on dell servers to provides extended information and snmp traps. HP also has similar tools that are also used. Use OpenNMS to monitor syslogs for ERROR and higher messages. room monitoring using APC netbotz solution. OpenNMS also now works with grafana to provides additional graphs.
UKI-LT2-RHUL
UKI-LT2-UCL-HEP
UKI-NORTHGRID-LANCS-HEP	elasticsearch/logstash/kibana deployed widely on local cluster (as opposed to grid farms) Usual suspects: icinga, ganglia, graphite on both local+grid machines	perhaps deploy logstash on grid farms currently deploying Dashing for dashboards	Logstash easy to deploy, powerful, and interesting to explore the rich content Solid stuff, unlikely to replace these
UKI-NORTHGRID-LIV-HEP	Nagios, Ganglia, Cacti	Replace Nagios with Icinga, upgrade Ganglia, investigate ELK, Graphite
UKI-NORTHGRID-MAN-HEP
UKI-NORTHGRID-SHEF-HEP
UKI-SCOTGRID-DURHAM
UKI-SCOTGRID-ECDF	Ganglia		Ganglia used for general machine and batch monitoring, bash scripts used for monitoring of running services
UKI-SCOTGRID-GLASGOW	Naemon (status and alerting), Ganglia/Graphite (metric & time series graphing), Cacti (network monitoring)	Dashboards (Dashing, Grafana), reconsidering network monitoring	We currently use Ganglia for systems metrics, Graphite for a higher cluster level view, nagios plugin to check graphite thresholds. Deploying Grafana as unified fronted.
UKI-SOUTHGRID-BHAM-HEP
UKI-SOUTHGRID-BRIS	older cluster Ganglia & Pakiti; newer uses Nagios	Going to ditch Ganglia & Pakiti for nagios on older cluster.	Wish Munin scaled well!
UKI-SOUTHGRID-CAM-HEP	Nagios and Ganglia
UKI-SOUTHGRID-OX-HEP	Nagios and Ganglia	Some testing of ELK
UKI-SOUTHGRID-RALPP	Nagios, Ganglia, Cacti, Pakiti, Dashing	ELK, new network monitoring (Observium?)
UKI-SOUTHGRID-SUSX

@@ Line 1: / Line 1: @@
-== Sites batch system status ==
+This page is intended to gather together the tools that sites are currently using to monitoring their local sites. Please fill in the details for your site with the following pieces of information:
-This page has been setup to collect information from GridPP sites regarding their batch systems in February 2014. The information will help with wider considerations and strategy. The table seeks the following:
-) Current product (local/shared) - what is the current batch system at the site. Is it locally managed or shared with other groups?
+) Current solution(s): What tools are currently used at the site and for what purpose?
-) Concerns - has your site experienced any problems with the batch system in operation?
+) Future plans: What plans (if any) does your site have for future monitoring?
-) Interest/Investigating/Testing - Does your site already have plans to change and if so to what. If not are you actively investigating or testing any alternatives?
+) Notes: Any other information you think might be useful.
-) CE type(s) - What CE type (gLite, ARC...) do you currently run and do you plan to change this, perhaps in conjunction with a batch system move?
-) Cloud interface(s)? - Does your site offer access to resources in ways other than via a CE?
+{|class="wikitable"
+|+
-) Notes - Any other information you wish to share on this topic.
+|-style="background:#7C8AAF;color:white"
+|style="width: 10%"|
+Site
+|Current solution(s)
+|Future plans
+|Notes
+|-
+|RAL Tier-1
+|Nagios/Icinga, ganglia, cacti for networking, home grown dashboard - mimic
+|Starting to look at elasticsearch
+|Use Thruk interface to Nagios which provides useful additional views.
+|-
+|UKI-LT2-Brunel
+|
+|
+|
+|-
+|UKI-LT2-IC-HEP
+|Nagios and Cacti
+|
+|
-{|class="wikitable"
+|-
-|+
+|UKI-LT2-QMUL
+| OpenNMS, APC NetBotz 550
+| Improve syslog monitoring
+| OpenNMS primarily monitors using SNMP. Use dell openmanage on dell servers to provides extended information and snmp traps. HP also has similar tools that are also used. Use OpenNMS to monitor syslogs for ERROR and higher messages. room  monitoring using APC netbotz solution. OpenNMS also now works with grafana to provides additional graphs.
-|-style="background:#7C8AAF;color:white"
+|-
-|Site
+|UKI-LT2-RHUL
-|Current product (local/shared)
+|
-|Concerns and observations
+|
-|Interest/Investigating/Testing
+|
-|CE type(s) & plans at site
-|Cloud interface available/plans
+|-
-|Notes
+|UKI-LT2-UCL-HEP
+|
+|
+|
+|-
+|UKI-NORTHGRID-LANCS-HEP
+| elasticsearch/logstash/kibana deployed widely on local cluster (as opposed to grid farms)
+Usual suspects: icinga, ganglia, graphite on both local+grid machines
+| perhaps deploy logstash on grid farms
+currently deploying Dashing for dashboards
+| Logstash easy to deploy, powerful, and interesting to explore the rich content
+Solid stuff, unlikely to replace these
+|-
+|UKI-NORTHGRID-LIV-HEP
+| Nagios, Ganglia, Cacti
+| Replace Nagios with Icinga, upgrade Ganglia, investigate ELK, Graphite
+|
+|-
+|UKI-NORTHGRID-MAN-HEP
+|
+|
+|
+|-
+|UKI-NORTHGRID-SHEF-HEP
+|
+|
+|
+|-
+|UKI-SCOTGRID-DURHAM
+|
+|
+|
+|-
+|UKI-SCOTGRID-ECDF
+|Ganglia
+|
+|Ganglia used for general machine and batch monitoring, bash scripts used for monitoring of running services
+|-
+|UKI-SCOTGRID-GLASGOW
+|Naemon (status and alerting), Ganglia/Graphite (metric & time series graphing), Cacti (network monitoring)
+|Dashboards (Dashing, Grafana), reconsidering network monitoring
+|We currently use Ganglia for systems metrics, Graphite for a higher cluster level view, nagios plugin to check graphite thresholds. Deploying Grafana as unified fronted.
+|-
+|UKI-SOUTHGRID-BHAM-HEP
+|
+|
+|
+|-
+|UKI-SOUTHGRID-BRIS
+|older cluster Ganglia & Pakiti; newer uses Nagios
+|Going to ditch Ganglia & Pakiti for nagios on older cluster.
+|Wish Munin scaled well!
+|-
+|UKI-SOUTHGRID-CAM-HEP
+|Nagios and Ganglia
+|
+|
+|-
+|UKI-SOUTHGRID-OX-HEP
+|Nagios and Ganglia
+|Some testing of ELK
+|
+|-
+|UKI-SOUTHGRID-RALPP
+|Nagios, Ganglia, Cacti, Pakiti, Dashing
+|ELK, new network monitoring (Observium?)
+|
 |-
-|RAL Tier-1
+|UKI-SOUTHGRID-SUSX
-|<span style="color:green">HTCondor (local)</span>
+|
-|<span style="color:green">None</span>
+|
-|<span style="color:green">No reason to change</span>
-|<span style="color:green">ARC & CREAM CEs, but would like to decommission CREAM CEs eventually</span>
-|<span style="color:green"></span>
 |
 |}

Difference between revisions of "Site monitoring status"

Latest revision as of 10:58, 1 November 2016

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools