-
Notifications
You must be signed in to change notification settings - Fork 311
Description
Processing of data depends on bottlenecks in the whole workflow. On the same time, the prepared events sit in queues, and some reports can produce a huge amount of events. Current ways to rate-limit bots processing are sleeps between iterations and the wait expert, that can hold a message until given condition is met (the queue size).
On the other hand, our queueing system is based primarily on Redis and entirely in-memory, what is great for performance, but leads to problematic behaviour when events are getting accumulated: Redis can fill up the memory causing OOM errors (both just because of the amount of live events as well as during performing RDS backup). In addition, an OOM kill during RDS backup leads to broken RDS files that can also fill up the disk space. Some workarounds are possible, like e.g. using KeyDB with storing data on disk (although KeyDB seems to be abandoned).
This current rate-limiting ways available in the IntelMQ fail to prevent this. They can only hold events, but are unable to prevent generating new once a message reaches a bot.
As a real example, ShadowServer has a few informative reports, like Device Identification. They can generate a huge amount of events, and the parser works significantly quicker as saving to the database. We can use wait bot to hold a report until the DB bot's queue is empty, but once the report came to the parser, we cannot stop flooding the system.
As a solution, I propose to built-in an optional simple rate-limiting, similar to how the wait bot works: after sending the generated event to the pipeline, the bot's class could check a size of a queue and eventually wait until it's free enough.
I would not like to implement it on a bot-base, but at least directly in the ParserBot class, preferably in the generic Bot or Pipeline (as it's Redis-related solution, other pipelines may not need it or require different solution). This way, the rate-limiting can be used in any bot, and slow-down production of new events if a designed bottleneck is not keeping up with the work.
An alternative would be to provide more advanced conditions, like available RAM etc., but it's in my eyes to complicated solution for a simple problem.