|
1 |
| -# Abstract |
2 |
| -With the emergence of NetSaint/Nagios at the latest, this system and their successors/clones |
3 |
| -have relied on a loose group of programs called "Monitoring Plugins" to do the lower level |
4 |
| -task of actually determining the state of particular entity or conduct measurements of certain |
5 |
| -values. |
6 |
| - |
7 |
| -This document shall help users and especially developers of those programs as a basis |
8 |
| -on how they should be implemented, how they should work and how they should behave. |
9 |
| -It encourages the standardization of libraries, Monitoring Plugins and Monitoring Systems, |
10 |
| -to reduce the cognitive load on users, administrators and developers, if they work with |
11 |
| -different implementations. |
12 |
| - |
13 |
| -These guidelines aim to be mostly as general as possible and not to assume anticipate a special |
14 |
| -implementation detail, e.g. the programming language, the install mechanism or the monitoring |
15 |
| -system which executes the Monitoring Plugin. |
16 |
| - |
17 |
| -# Language |
18 |
| -The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", |
19 |
| -"SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and |
20 |
| -"OPTIONAL" in this document are to be interpreted as described in |
21 |
| -BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all |
22 |
| -capitals, as shown here. |
23 |
| - |
24 |
| -# Terminology |
25 |
| - |
26 |
| -## Monitoring Plugin |
27 |
| -Is an executable on a _normal_ computer system (meaning something like a commonly occurring system with an operating system |
28 |
| -like something bases on Linux, FreeBSD, Windows or something similar) |
29 |
| - |
30 |
| -## Monitoring System |
31 |
| -Is a software which, for the scope of this document, executes a *Monitoring Plugin* |
32 |
| - |
33 |
| - |
34 | 1 | # The Monitoring Plugin Interface
|
35 | 2 |
|
36 |
| -## The basic Monitoring Plugin usage |
37 |
| -A Monitoring System executes a Monitoring Plugin. The Monitoring Plugin MAY accept parameters in |
38 |
| -the form of command line arguments, environment variables or a configuration file (the location of which |
39 |
| -MAY in turn be given on the command line or via environment variable). |
40 |
| -The Monitoring Plugin then proceeds to execute it's duty and returns the result to the Monitoring System. |
41 |
| -Part of the process of returning the result is the termination of the execution of the Monitoring Plugin itself. |
42 |
| - |
43 |
| -## Input Parameters for a Monitoring Plugin |
44 |
| -A _Monitoring Plugin_ MUST expect input parameters as arguments during execution, if any are needed/expected at all. It MAY accept these parameters |
45 |
| -given as _environment variables_ and it MAY accept them in a configuration file (with a default path or a path given via arguments or _environment variables_). |
46 |
| - |
47 |
| -In general positional arguments are strongly discouraged. |
48 |
| - |
49 |
| -Some arguments MUST have this predetermined meaning, if they are used: |
50 |
| - |
51 |
| -| Argument (long) | Argument (short version, optional) | Argument | Meaning | optional | can be given multiple times | |
52 |
| -| --- | --- | --- | --- | --- | --- | |
53 |
| -| --help | -h | None | "Triggers the help functionality of the _Monitoring Plugin_, showing the individual parameters and their meaning, examples for usage of the _Monitoring Plugin_ and general remarks about the how and why of the _Monitoring Plugin_. SHOULD overwrite all other options, meaning, they are ignored if `--help` is given. The _Monitoring Plugin_ SHOULD exit with state UNKNOWN (3). | no | -- (makes no difference) | |
54 |
| -| --version | -V | None | Shows the version of the _Monitoring Plugin_ to allow users to report errors better and therefore help them and the developers. The _Monitoring Plugin_ SHOULD exit with state UNKNOWN (3). | no | -- (makes no difference) | |
55 |
| -| --timeout | -t | Integer (meaning seconds) or a time duration string | Sets a limit for the time which a _Monitoring Plugin_ is given to execute. This is there to enforce the abortion of the test and improve the reaction time of the _Monitoring System_ (e.g. in bad network conditions it might be helpful to abort the test prematurely and inform the user about that, than trying forever to do something which won't succeed. Or if soft real time constraints are present, a result might be worthless altogether after some time). A sane default is probably 30 seconds, although this depends heavily on the scenario and should be given a thought during development. If the execution is terminated by this timeout, it should exit with state UNKNOWN (3) and (if possible) give some helpful output in which stage of the execution the timeout occurred. | no | no | |
56 |
| -| --hostname | -H | String, meaning either a DNS name, an IPv4 or an IPv6 address of the targeted system | If the _Monitoring Plugin_ targets ONE other system on the network, this option should be used to tell it which one. If the _Monitoring Plugin_ does it's test just locally or the logic does not apply to it, this option is, of course, optional. | yes | yes | |
57 |
| -| --verbose | -v | None | Increases the verbosity of the output, thereby breaking the suggested rules about a short and concise output. Can be used to debug the _Monitoring Plugin_ and should there expose internals, intermediate results and so on. | yes | yes | |
58 |
| -| --detail | | None | Increases the level of detail in the output, thereby giving the user more information about the result of the test and helping with determining errors and problems or just satisfy some curiosity. SHOULD NOT be used to debug the _Monitoring Plugin_, so there is no need to expose internals. | yes | yes | |
59 |
| -| --exit-ok | | The _Monitoring Plugin_ exits unconditionally with OK (0). Mostly useful for the purpose of packaging and testing plugins, but might be used to always ignore errors (e.g. to just collect data). | yes | no | |
60 |
| - |
61 |
| -### Examples |
62 |
| -For the execution with `--help`: |
63 |
| -``` |
64 |
| -$ my_check_plugin --help |
65 |
| -``` |
66 |
| -the output might look like this: |
67 |
| -``` |
68 |
| -my_check_plugin version 3.1.4 |
69 |
| -Licensed under the AGPLv1. |
70 |
| -Repository: git.example.com/jdoe/my_check_plugin |
71 |
| -
|
72 |
| -This plugin just says hello. It fails if you don't give it a name. |
73 |
| -
|
74 |
| -Usage: |
75 |
| - my_check_plugin --name NAME [--greeting GREETING] |
76 |
| -
|
77 |
| -Options: |
78 |
| - --help |
79 |
| - this help |
80 |
| - --version |
81 |
| - Shows the version of the plugin |
82 |
| - --name NAME |
83 |
| - if given, uses NAME as a name to greet. |
84 |
| - --greeting GREETING |
85 |
| - if given, uses GREETING instead of Hello. |
86 |
| -
|
87 |
| -Examples: |
88 |
| -$ my_check_plugin --name Jane |
89 |
| -Hello Jane |
90 |
| -
|
91 |
| -$ my_check_plugin --greeting Ciao --name Alice |
92 |
| -Ciao Alice |
93 |
| -``` |
94 |
| -This imaginary _Monitoring Plugin_ tries to be really helpful here, |
95 |
| -displays the version, the license and the upstream repository with the help |
96 |
| -(although not necessary), has a short description about the purpose, |
97 |
| -lists the options in an easily readable way and even gives some examples. |
98 |
| - |
99 |
| -For the execution with `--version` |
100 |
| -``` |
101 |
| -$ my_check_plugin --version |
102 |
| -``` |
103 |
| -the output might be a bit shorter: |
104 |
| -``` |
105 |
| -my_check_plugin version 3.1.4 |
106 |
| -``` |
107 |
| -or even: |
108 |
| -``` |
109 |
| -3.1.4 |
110 |
| -``` |
111 |
| -where both show the necessary information. |
112 |
| - |
113 |
| - |
114 |
| -## Output of a Monitoring Plugin |
115 |
| -The output of a Monitoring Plugin consists of two parts on the first level, the *Exit Code* and |
116 |
| -output in textual form on _stdout_. |
117 |
| - |
118 |
| -### Exit Code |
119 |
| -The *Monitoring Plugin* MUST make use of the *Exit Code* as a method to communicate a result to |
120 |
| -the *Monitoring System*. Since the *Exit Code* is more or less standardized over different systems |
121 |
| -as an integer number with a width of or greater than 8bit, the following mapping is used: |
122 |
| - |
123 |
| -| *Exit Code* (numerical) | Meaning (short) | Meaning (extended) | |
124 |
| -| --- | --- | --- | |
125 |
| -| 0 | OK | The execution of the *Monitoring Plugin* proceeded as planned and whatever it test appeared to function properly and the measured values are with their respective thresholds | |
126 |
| -| 1 | WARNING | The execution of the *Monitoring Plugin* proceeded as planned and whatever it test appeared to *not* function properly or the measured values are *not* with their respective thresholds. The problem(s) do(es) *not* seem exceptionally grave though and do(es) *not* require immediate attention | |
127 |
| -| 2 | CRITICAL | The execution of the *Monitoring Plugin* proceeded as planned and whatever it test appeared to *not* function properly or the measured values are *not* with their respective thresholds. The problem(s) *do(es)* seem exceptionally grave though and *do(es)* require immediate attention | |
128 |
| -| 3 | UNKNOWN | The execution of the *Monitoring Plugin* *did not* proceed as planned. The reasons might be manifold, e.g. missing permissions, missing libraries, no available network connection to the destination, etc.. In summary: The *Monitoring Plugin* could *not* determine the state of whatever it should have been checking and can therefore make no reliable statement about it. | |
129 |
| -| 4-31 | reserved for future use | |
130 |
| - |
131 |
| -### Textual Output |
132 |
| -The original purpose of the output on _stdout_ was to provide human readable information for the user of the *Monitoring System*, |
133 |
| -a way for the *Monitoring Plugin* to communicate further details on what happened. |
134 |
| -This purpose still exists, but was expanded with the, so called, *perfdata* (performance data) to allow the machine readable |
135 |
| -communication of measured values for further processing in the *Monitoring System*, e.g. for the creation of diagrams. |
136 |
| - |
137 |
| -Therefore the further explanation is split into *human readable output* and *perfdata*. |
138 |
| - |
139 |
| -#### Human readable output |
140 |
| -This part of the output should give an user information about the state of the test and, in the case of problems, ideally hint what |
141 |
| -the origin of the problem might be or what the symptoms are. If the test relies on numeric values, this might be displayed to |
142 |
| -give an user more information about the specific problem. |
143 |
| -It might consist of one or more lines of printable symbols. |
144 |
| - |
145 |
| -Examples: |
146 |
| -``` |
147 |
| -Remaining space on filesystem "/" is OK |
148 |
| -
|
149 |
| -Sensor temperature is within thresholds |
150 |
| -
|
151 |
| -Available Memory is too low |
152 |
| -
|
153 |
| -Sensore temperature exceeds thresholds |
154 |
| -``` |
155 |
| -are OK, but |
156 |
| -``` |
157 |
| -Remaining space on filesystem "/" is OK ( 62GiB / 128GiB ) |
158 |
| -
|
159 |
| -Sensor temperature is within thresholds ( 42°C ) |
160 |
| -
|
161 |
| -Available Memory is too low ( 126MiB / 32GiB ) |
162 |
| -
|
163 |
| -Sensor temperature exceeds thresholds ( 78°C > 70°C ) |
164 |
| -``` |
165 |
| -are better. |
166 |
| - |
167 |
| -Although no strict guidelines for creating this part of the output can really be given, a developer should |
168 |
| -keep a potential user in mind. It might, for example, be OK to put the output in a single line if there are |
169 |
| -only one or two items of a similar type (think: multiple file systems, multiple sensors, etc.) are present, |
170 |
| -but not if there 10 or 100, although this might present a valid use case. |
171 |
| -If there are several different items exists in the output of the *Monitoring Plugin*, furthermore called *partial results*, |
172 |
| -they probably SHOULD be given their own line in the output. |
173 |
| - |
174 |
| -#### Performance data |
175 |
| -In addition to the human readable part the output can contain machine readable measurement values. These data points |
176 |
| -are separated from the human readable part by the "|" symbol which is in effect until the end of the line. |
177 |
| -The performance data then MUST consist of space separated single values, these MUST have the following format: |
178 |
| - |
179 |
| -`'label'=value[UOM][;warn[;crit[;min[;max]]]]` |
180 |
| - |
181 |
| -with the following definitions: |
182 |
| - |
183 |
| - 1. _label_ must consist of at least on non-space character, but can otherwise contain any printable characters except for the equals sign (`=`) or single quotes (`'`). |
184 |
| - If it contains spaces, it must be surrounded by single quotes |
185 |
| - 2. _value_ is a numerical value, might be either an integer or a floating point number. Using floating point numbers if the value is really discreet SHOULD be avoided. Also the |
186 |
| - representation of a floating point number SHOULD NOT use the "scientific notation" (e.g. `6.02e23` or `-3e-45`), since some systems might not be able to parse them correctly. |
187 |
| - Also values with a base other then 10 SHOULD be avoided (see below for more information on `Byte` values). |
188 |
| - 3. _UOM_ is the _Unit of measurement_ (e.g. "B" for _Bytes_, "s" for seconds) which gives more context to the _Monitoring System_. The following constraints MUST be applied: |
189 |
| - 1. An _UOM_ of `%` MUST be used for percentage values |
190 |
| - 2. An _UOM_ of `c` MUST be used for continuous counters (commonly used for the sum of bytes transmitted on an interface) |
191 |
| - |
192 |
| - The following recommendations SHOULD be applied: |
193 |
| - 1. The _UOM_ for `Byte` values is `B` and although many systems do understand units like |
194 |
| - `KB`,`KiB`, `MB`, `GB`, `TB` they SHOULD be avoided, at the least to avoid the ugly hassle about |
195 |
| - people misinterpreting the *base10* values as *base2* values and the other way round. |
196 |
| - This is also a prime example where floating point number SHOULD NOT be used, since there are |
197 |
| - obviously only integer numbers included. |
198 |
| - 2. The _UOM_ for time is `s`, meaning seconds, SI-Prefixes (e.g. `ms` for milli seconds) are allowed if |
199 |
| - necessary or useful, but be aware, that many systems may not understand `μs` for micro seconds and expect |
200 |
| - `us` instead. |
201 |
| - 3. In general, SI units and SI prefixes SHOULD be used as _UOM_ if applicable, but the _Monitoring System_ |
202 |
| - may not understand them correctly (mostly in uncommon cases), in that cases appropriate workarounds |
203 |
| - MAY be applied on the side of the _Monitoring Plugin_, but it would be nice make the developer |
204 |
| - of the _Monitoring System_ aware of the problem. |
205 |
| - |
206 |
| - 4. _warn_ and _crit_ are the threshold values for this measurement, which may have been given by the user as input, may be hardcoded in the _Monitoring Plugin_ |
207 |
| - or may be retrieved from a file or a device or somewhere else during the execution of the tests. The unit used MUST be the same as for _value_. |
208 |
| - These values are not simple numbers, but _range_ expressions. |
209 |
| - 5. _min_ and _max_ are the minimal respectively the maximal value the _value_ could possibly be. The unit is the same as for _value_. |
210 |
| - These values can be omitted, if the _value_ is a percentage value, since _min_ and _max_ are always `0` and `100` in this case. |
211 |
| - |
212 |
| -## Range expressions |
| 3 | +This repository aims to provide a project and language independent definition |
| 4 | +of the _Monitoring Plugin_ interface as a point of reference and a starting point |
| 5 | +for future developments. |
213 | 6 |
|
214 |
| -In many cases thresholds for metrics mark a certain range of values where the values is considered to be good or bad if it is inside or outside. |
215 |
| -While for significant number of metrics a upper (e.g. load on unixoid systems) or lower (e.g. effective throughput, free |
216 |
| -space in memory or storage) border might suffice, for some it does not, for example a temperature value from a temperature |
217 |
| -sensor should be within certain range (let's say 10℃ and 45℃). |
| 7 | +It intends to help to write plugins which are compatible with the plethora |
| 8 | +of _Monitoring Systems_ which are similar to Nagios, are easy to use and useful |
| 9 | +for users of those _Monitoring Systems_. |
218 | 10 |
|
219 |
| -Regarding input parameters this might be handled with with options like `--critical-upper-temperature` and `--critical-lower-temperature`, |
220 |
| -this presents a problem with the performance data, if only scalar values could be used. |
221 |
| -To resolve this situation the _Range expression_ format was introduced, with the following definition: |
| 11 | +The document is split into different parts to separate different topics |
| 12 | +(if possible) and make searching and changing parts easier. |
222 | 13 |
|
223 |
| -`[@][start:][end]` |
224 |
| -where: |
225 |
| - 1. `start` <= `end` |
226 |
| - 2. If `start` == 0, then it can be omitted. |
227 |
| - 3. If `end` is omitted, it has the "value" of positive infinity. |
228 |
| - 4. Negative infinity can be specified with `~`. |
229 |
| - 5. If the prefix `@` is NOT given, the value exceeds the threshold if it is OUTSIDE of the range between `start` and `end` (including the endpoints). |
230 |
| - 6. If the prefix `@` IS given, the value exceeds the threshold if it is INSIDE the range between `start` and `end` (including the endpoints). |
231 |
| - 7. Contrary to the short definition above, an empty _Range expression_ is not a valid one, at least either `start` or `end` must be provided. |
| 14 | +[Preface - Introduction to the document](preface.md) |
232 | 15 |
|
233 |
| -### Examples |
| 16 | +## The Monitoring Plugins Interface |
| 17 | + 1. [Basics](monitoring_plugins_interface/01.Basics.md) |
| 18 | + 1. [Input specification](monitoring_plugins_interface/02.Input.md) |
| 19 | + 1. [Output specification](monitoring_plugins_interface/03.Output.md) |
234 | 20 |
|
235 |
| -| Range definition | Exceeds threshold if x...| |
236 |
| -| --- | --- | |
237 |
| -| 10 | < 0 or > 10, (outside the range of {0 .. 10}) | |
238 |
| -| 10: | < 10, (outside {10 .. ∞}) | |
239 |
| -| ~:10 | > 10, (outside the range of {-∞ .. 10}) | |
240 |
| -| 10:20 | < 10 or > 20, (outside the range of {10 .. 20}) | |
241 |
| -| @10:20 | ≥ 10 and ≤ 20, (inside the range of {10 .. 20}) | |
| 21 | +## Additional definitions for relevant parts |
| 22 | + 1. [Range Expressions](definitions/01.range_expressions.md) |
0 commit comments