Large environments are one of the major places to run check_multi in order to
- reduce the number of Nagios servers
- avoid the complexity of a distributed environment
On the other hand the check_multi has clear disadvantages (''There can be only one'') when
- heterogeneous groups are participating in the monitoring
- one Nagios object is needed for every monitoring item (=check_multi child check).
This was the basic motivation for the improved implementation with
- the standard check_multi plugin as a active data collector
- an additional report mode which feeds all child results as passive checks into Nagios
- The communication between Nagios server and client happens only once per machine but not once per check.
- The multi check is active and therefore under Nagios scheduling control.
- Since the particular services are passive they do not put load on the Nagios scheduling queue.
- All services are nevertheless Nagios services, we have: Notification, Escalation, Reporting.
- The performance gain is enormeous: 25000 Services per server are possible (see Performance section).
- More complexity compared to a active service implementation (but less than a distributed setup)
- The configuration has to be set up twice, one time for the active check_multi check, one time for the passive checks.
- check_multi acts as a normal active Nagios check and collects checks from a remote host.
- Each child check has a corresponding passive check in Nagios with the same name.
- check_multi takes the child checks output and RC and feeds it into the corresponding passive Nagios check.
There is a design problem when executing multiple remote checks within one collector check and then return the results into the passive side of Nagios: the transport.
- If you run check_multi on the Nagios server, you need an remote connection for each child check: very expensive
- If you run check_multi on the remote server, you have problems to reach Nagios input queue on the passive side.
The solution is: use check_multi twice in a command chain:
- check_multi on the remote hosts gathers data.
- check_multi on the Nagios server feeds passive services. The first check_multi passes its results via XML to the second one.
''Note:'' the whole chain is started on the Nagios server. In case of DMZ host monitoring no inbound connections are used.
- SSH
check_by_ssh -H <hostname> -c '/path/to/check_multi -f multi.cmd -r 256' | check_multi -f - -r 8192+8+1
- NRPE
check_nrpe -H <hostname> -c check_multi -a '-f multi.cmd -r 256' | check_multi -f - -r 8192+8+1
- NSCA
check_nrpe -H <hostname> -c check_multi -a '-f multi.cmd -r 4096+8+1'
This method needs a running nsca daemon on Nagios server. Inbound connections are used, therefore this approach is not recommended for DMZ setups.
check_multi easily can be integrated into Sven Nierleins mod_gearman.
mod_gearman is a NEB module which runs checks over a gearman scheduling framework and passes results back to Nagios. This allows thousands of checks pretty efficiently to be run within a single Nagios instance.
A specific client 'send_multi' is part of the mod_gearman package and can be used to feed the particular child check results into the gearman queues.
This client is a small C binary and by far less resource consuming as if you call check_multi itself to pass checks into Nagios.
$ check_multi -f multi.cmd -r 256 | \
send_multi --server=`<job server>` --encryption=no --host="`<hostname>`" --service="`<service>`"
- If you want to use only check_multi and no other workers, you can achieve this with the following neb module settings
broker_module=/usr/local/share/nagios/mod_gearman.o \
server=localhost \
encryption=no \
eventhandler=no \
hosts=no \
services=no \
hostgroups=does_not_exist
- Encryption is not necessary if you both run the check_multi checks and the nagios check_results queue on the same server.
This example installation is part of the sample-config directory in the check_multi package.
Note: it's a setup for one machine, there is no remote access included in the configuration.
For the basic understanding of the principle this does not matter anyway ;-)
Receipt:
- download check_multi, latest SVN.
./configure; make all
{bash}cd sample-config/feed_passive
{bash}- Install the feed_passive example files with the
make install-config
{sh}this will add a directory ''/path/to/nagios/etc/check_multi/feed_passive''. - add the feed_passive subdirectory as cfg_dir to nagios.cfg:
cfg_dir=/usr/local/nagios/etc/check_multi/feed_passive
- reload / restart Nagios: et voila :-P
- Standard sizing is 10 Hosts with 10 feed services and 100 passive services
- If you want to put more load on your system, go to
<nagiosdir>
/etc/check_multi/feed_passive'' and runperl gencfg nhosts
{bash} then reload nagios.
- mandatory - perl module XML::Simple Install XML::Simple on Nagios server, either from your Linux distribution or directly from CPAN. Its only needed for the receiving side (the Nagios server), the senders (remote clients) do not need XML::Simple.
- optional - nagios.cfg settings
I recommend to set some attributes for performance tuning and to avoid unnecessary logging:
setting | comment |
---|---|
child_processes_fork_twice=0 | speeds up Nagios, one fork is enough |
free_child_process_memory=0 | Linux can free memory much faster than Nagios |
log_initial_states=0 | Otherwise each days log contains one unnecessary line per service |
log_passive_checks=0 | saves lots of space in the nagios.log |
use_large_installation_tweaks=1 | another performance boost (e.g. no summary macros |
None of these attributes is mandatory, but it will speed up your infrastructure in large setups.
Just as an example, your mileage may vary ;)
#--- multi.cmd
command [ system_disk ] = check_disk -w 5% -c 2% -p /
command [ system_load ] = check_load -w 10,8,6 -c 20,18,16
command [ system_swap ] = check_swap -w 90 -c 80
command [ system_users ] = check_users -w 5 -c 10
command [ procs_num ] = check_procs
command [ procs_cpu ] = check_procs -w 10 -c 20 --metric=CPU -v
command [ procs_mem ] = check_procs -w 100000 -c 2000000 --metric=RSS -v
command [ procs_zombie ] = check_procs -w 1 -c 2 -s Z
command [ proc_cron ] = check_procs -c 1: -C cron
command [ proc_syslogd ] = check_procs -c 1: -C syslogd
#--- avoid redundant states
state [ WARNING ] = IGNORE
state [ CRITICAL ] = IGNORE
state [ UNKNOWN ] = IGNORE
This service runs on the remote host and gathers data:
- check_multi report option ''-r 256+4+1'':
- Mandatory: ''-r 256'' as XML output option
- Recommended: ''-r 4'' for ERROR output
- last not least: ''-r 1'' for detailed results in the status line
- Example:
define service {
service_description multi_feed
host_name host1
check_command check_multi!-f multi_small.cmd -r 256+4+1 -v
event_handler multi_feed_passive
check_interval 5
use local-service
}
- Mandatory: ''passive_checks_enabled 1''
- Mandatory: ''active_checks_enabled 0''
- and the rest: YMMV
- Example:
define service {
service_description $THIS_NAME$
host_name $HOSTNAME$
passive_checks_enabled 1
active_checks_enabled 0
check_command check_dummy!0 "passive check"
use local-service
}
You can easily generate these passive services via check_multi report mode 2048:
check_multi -f multi.cmd -r 2048 -s service_definition_template=/path/to/service_definition.tpl > services_passive.cfg
Hint: create a oneliner which loops over your hosts and generates bulk service check definitions. Whenever a host is added, you rerun your script and reload Nagios to put the new passive services into effect.
- if you see the message ''Passive check result was received for service '%s' on host '%s', but the service could not be found!'', you have to double check the passive service definitions.
- If you want to test it with a XML file, you have to use a pipe, e.g.
cat test.xml | check_multi -f - -r 8192
. - Sometimes the error can be seen to specify the cmd file on the receiver side of the pipe, e.g. ''check_multi ... | check_multi -f - -f xyz.cmd''.
This lets you monitor your Nagios server, not the remote host. Please recall - the pipe has two parts:\ - the sender part for the remote side, where the information is being gathered (-> input)\ - the local receiver part on the Nagios server, where the information is presented (-> output). (Hint: check_multi -f - reads from STDIN)
- Dual core Athlon X2/64 3600 with 1 GB RAM
- Nagios configuration:
- 1000 hosts
- 1000 active services
- 25000 feeded passive services
# sar -u
06:00:00 PM CPU %user %nice %system %iowait %idle
06:00:01 PM all 32.28 0.00 28.37 0.71 38.63
06:10:01 PM all 31.66 0.00 27.86 1.05 39.43
06:20:31 PM all 31.60 0.00 28.06 1.25 39.09
06:30:01 PM all 31.61 0.00 28.40 1.19 38.79
06:40:01 PM all 31.55 0.00 28.39 1.16 38.90
06:50:01 PM all 33.68 0.00 28.54 0.92 36.86
# sar -q
06:00:00 PM runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15
06:00:01 PM 3 166 1.99 2.52 3.75
06:10:01 PM 3 158 1.91 2.19 2.95
06:20:31 PM 2 155 1.53 1.93 2.44
06:30:01 PM 2 159 2.22 2.11 2.28
06:40:01 PM 2 155 1.76 1.96 2.10
06:50:01 PM 2 165 1.90 2.14 2.14
Nagios Stats 3.1.2
Copyright (c) 2003-2008 Ethan Galstad (www.nagios.org)
Last Modified: 06-23-2009
License: GPL
CURRENT STATUS DATA
------------------------------------------------------
Status File: /usr/local/nagios/var/status.dat
Status File Age: 0d 0h 0m 5s
Status File Version: 3.1.2
Program Running Time: 0d 1h 40m 5s
Nagios PID: 4112
Used/High/Total Command Buffers: 0 / 0 / 4096
Total Services: 26001
Services Checked: 26001
Services Scheduled: 1001
Services Actively Checked: 1001
Services Passively Checked: 25000
Total Service State Change: 0.000 / 7.760 / 0.327 %
Active Service Latency: 0.000 / 1.054 / 0.160 sec
Active Service Execution Time: 0.300 / 3.266 / 0.917 sec
Active Service State Change: 3.750 / 7.760 / 4.246 %
Active Services Last 1/5/15/60 min: 187 / 988 / 1001 / 1001
Passive Service Latency: 0.000 / 0.000 / 0.000 sec
Passive Service State Change: 0.000 / 7.760 / 0.170 %
Passive Services Last 1/5/15/60 min: 4694 / 24676 / 25000 / 25000
Services Ok/Warn/Unk/Crit: 26001 / 0 / 0 / 0
Services Flapping: 186
Services In Downtime: 0
Total Hosts: 1002
Hosts Checked: 1002
Hosts Scheduled: 1002
Hosts Actively Checked: 1002
Host Passively Checked: 0
Total Host State Change: 0.000 / 0.000 / 0.000 %
Active Host Latency: 0.971 / 2.042 / 1.366 sec
Active Host Execution Time: 0.024 / 1.150 / 0.065 sec
Active Host State Change: 0.000 / 0.000 / 0.000 %
Active Hosts Last 1/5/15/60 min: 185 / 935 / 1002 / 1002
Passive Host Latency: 0.000 / 0.000 / 0.000 sec
Passive Host State Change: 0.000 / 0.000 / 0.000 %
Passive Hosts Last 1/5/15/60 min: 0 / 0 / 0 / 0
Hosts Up/Down/Unreach: 1002 / 0 / 0
Hosts Flapping: 0
Hosts In Downtime: 0
Active Host Checks Last 1/5/15 min: 220 / 968 / 2906
Scheduled: 220 / 968 / 2906
On-demand: 0 / 0 / 0
Parallel: 220 / 968 / 2906
Serial: 0 / 0 / 0
Cached: 0 / 0 / 0
Passive Host Checks Last 1/5/15 min: 0 / 0 / 0
Active Service Checks Last 1/5/15 min: 200 / 1001 / 3003
Scheduled: 200 / 1001 / 3003
On-demand: 0 / 0 / 0
Cached: 0 / 0 / 0
Passive Service Checks Last 1/5/15 min: 4975 / 25000 / 75000
External Commands Last 1/5/15 min: 0 / 0 / 0