fix(input.mqtt_consumer): Correct issue where a network blip could result in no messages flowing through the mqtt consumer despite a connection to the topic. #18167
+15
−13
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.


Summary
There is an issue in the MQTT consumer where the client will reconnect to the topic after a network blip, but may fail to receive messages from the topic. Using
tcpdump, one can see the messages flowing over the network, but the MQTT consumer does not receive / process them. Restartingtelegrafis the only way to fix this issue.This issue can also come up if the server hosting the topic is rebooted.
It appears that the MQTT library used "under the hood" does not like it when an external entity manually calls disconnect and reconnect. Comments in the library suggest that auto-reconnect should be enabled instead so the library can reconnect all by itself. To accommodate this change, a new handler function was needed to handle when the MQTT library reconnects. This function resubscribes to the topics of interest on reconnect, as these are lost during a re-connection.
Testing
To replicate / test this change, I setup a system that had a network connection that toggled between connected and not connected every 1 minute (1 minute of connection, one minute of no connection, and so on forever).
telegraf(version: 1.37.0) would stop processing messages from the topic between 3 and 10 minutes of the test.Checklist
Related issues
resolves #16293, #16035
Misc
This is my first pull request for this repo. Let me know if anything looks off or not to standards and I can get it corrected.