-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
Fluent Bit enters an endless loop on reload under specific conditions. It stops processing logs and no longer handles SIGHUP or SIGTERM signals properly.
Steps to reproduce the problem
- Start loki (on localhost)
- Start fluent-bit with provided config. Make sure
retry
-no_limits
is set, orHot_Reload.Ensure_Thread_Safety
set toTrue
./build/bin/fluent-bit -c ../fluent-test/fluent.conf -v -Y -W
- Stop loki service. Fluent will detect that connection is down. Will reconcile to establish connection.
[2025/08/01 11:13:58] [trace] [upstream] get new connection for localhost:3100, net setup:
net.connect_timeout = 10 seconds
net.source_address = any
net.keepalive = enabled
net.keepalive_idle_timeout = 30 seconds
net.max_worker_connections = 0
[2025/08/01 11:13:58] [debug] [net] socket #63 could not connect to 127.0.0.1:3100
[2025/08/01 11:13:58] [debug] [net] could not connect to localhost:3100
[2025/08/01 11:13:58] [debug] [upstream] connection #-1 failed to localhost:3100
[2025/08/01 11:13:58] [trace] [upstream] destroy connection #-1 to localhost:3100
[2025/08/01 11:13:58] [error] [output:loki:loki.0] no upstream connections available
- Push some messages to logs mentioned in fluent.conf
echo foo >> /home/raskin/kind/fluent-test/fluent/log/foo.log
Input plugin handles new lines, collecting it into chunks on filesystem and tries to push to output
2025/08/01 11:13:58] [debug] [input:tail:tail.0] inode=25839013, /home/raskin/kind/dev/fluent-test/fluent/log/app.log, events: IN_MODIFY
[2025/08/01 11:13:58] [trace] [input chunk] update output instances with new chunk size diff=4096, records=1, input=tail.0
[2025/08/01 11:13:58] [trace] [task 0x7d003006b750] created (id=0)
[2025/08/01 11:13:58] [debug] [task] created task=0x7d003006b750 id=0 OK
[2025/08/01 11:13:58] [debug] [retry] new retry created for task_id=0 attempts=1
[2025/08/01 11:13:58] [ warn] [engine] failed to flush chunk '4077192-1754036038.387954750.flb', retry in 11 seconds: task_id=0, input=tail.0 > output=loki.0 (out_id=0)
- Send SIGHUP to fluent-bit
kill -SIGHUP $(pidof fluent-bit)
Flush chunk retry interval has changed to 1 sec. Fluent-bit tries to deliver logs.
[2025/08/01 11:14:31] [debug] [retry] re-using retry for task_id=0 attempts=4
[2025/08/01 11:14:31] [ warn] [engine] failed to flush chunk '4077192-1754036038.387954750.flb', retry in 1 seconds: task_id=0, input=tail.0 > output=loki.0 (out_id=0)
[2025/08/01 11:14:32] [ info] [task] tail/tail.0 has 1 pending task(s):
[2025/08/01 11:14:32] [ info] [task] task_id=0 still running on route(s): loki/loki.0
- Start loki:
Fluent-bit will reconcile loki connection and deliver chunk.
[2025/08/01 11:14:44] [ info] [engine] flush chunk '4077192-1754036038.387954750.flb' succeeded at retry 16: task_id=0, input=tail.0 > output=loki.0 (out_id=0)
[2025/08/01 11:14:44] [debug] [task] destroy task=0x7d003006b750 (task_id=0)
Observe results:
Fluent doesn't handle new log inputs.
In case you send SIGHUP again, you will get the message in fluent log file:
[2025/08/01 11:14:58] [engine] caught signal (SIGHUP)
[2025/08/01 11:14:58] [error] reloading in progress, aborting.
This shows that reload didn't completed properly.
Expected behavior
Fluent-bit reloaded, begin to handle input/output. Should respond to signals SIGHUP, SIGTERM.
abnormal_behavior.log
hangs_gdb_output.txt
Your Environment
Operating System and version: any linux, mine: Ubuntu 24.04.2 LTS, kernel 6.8.0-60-generic
Fluent-bit (any last stable version).
Run arguments:
./build/bin/fluent-bit -c ../fluent-test/fluent.conf -v -Y -W
Output: loki v3.5.0 (affecting all output plugins supporting retry)
Fluent-bit should be started with Hot_Reload
and Ensure_Thread_Safety
set to True or Retry_Limit
set to no_limits
.
🔽 Fluent-bit config
[Service]
Hot_Reload.Ensure_Thread_Safety true
Shutdown_Grace 5
Log_Level trace
Http_Server true
Parsers_File /home/raskin/kind/fluent-test/fluent/parsers.conf
Parsers_File /home/raskin/kind/fluent-test/fluent/parsers_multiline.conf
storage.path /home/raskin/kind/fluent-test/fluent/buffer/
storage.sync normal
storage.checksum off
storage.backlog.mem_limit 5MB
storage.metrics on
storage.max_chunks_up 5
storage.delete_irrecoverable_chunks on
storage.backlog.flush_on_shutdown Off
[Input]
Name tail
Path /home/raskin/kind/fluent-test/fluent/log/*.log
Read_from_Head true
Refresh_Interval 60
Skip_Long_Lines true
DB /home/raskin/kind/fluent-test/fluent/tail/pos.db
DB.Sync Normal
Mem_Buf_Limit 32MB
Parser cri
Tag kube.*
storage.type filesystem
storage.pause_on_chunks_overlimit off
Log_Level trace
[Output]
Name loki
Log_Level trace
Match_Regex .*
Retry_Limit no_limits
host localhost
port 3100
auto_kubernetes_labels off
tls Off
Additional context
PR is ready — plan to fix it myself.