Skip to content

🐛 Endless loop on reload: Fluent Bit stops log processing and handling of SIGHUP/SIGTERM #10670

@CharlieR-o-o-t

Description

@CharlieR-o-o-t

Fluent Bit enters an endless loop on reload under specific conditions. It stops processing logs and no longer handles SIGHUP or SIGTERM signals properly.

Steps to reproduce the problem

  1. Start loki (on localhost)
  2. Start fluent-bit with provided config. Make sure retry - no_limits is set, or Hot_Reload.Ensure_Thread_Safety set to True
./build/bin/fluent-bit -c ../fluent-test/fluent.conf  -v  -Y -W
  1. Stop loki service. Fluent will detect that connection is down. Will reconcile to establish connection.
[2025/08/01 11:13:58] [trace] [upstream] get new connection for localhost:3100, net setup:
net.connect_timeout        = 10 seconds
net.source_address         = any
net.keepalive              = enabled
net.keepalive_idle_timeout = 30 seconds
net.max_worker_connections = 0
[2025/08/01 11:13:58] [debug] [net] socket #63 could not connect to 127.0.0.1:3100
[2025/08/01 11:13:58] [debug] [net] could not connect to localhost:3100
[2025/08/01 11:13:58] [debug] [upstream] connection #-1 failed to localhost:3100
[2025/08/01 11:13:58] [trace] [upstream] destroy connection #-1 to localhost:3100
[2025/08/01 11:13:58] [error] [output:loki:loki.0] no upstream connections available
  1. Push some messages to logs mentioned in fluent.conf
echo foo >> /home/raskin/kind/fluent-test/fluent/log/foo.log

Input plugin handles new lines, collecting it into chunks on filesystem and tries to push to output

2025/08/01 11:13:58] [debug] [input:tail:tail.0] inode=25839013, /home/raskin/kind/dev/fluent-test/fluent/log/app.log, events: IN_MODIFY 
[2025/08/01 11:13:58] [trace] [input chunk] update output instances with new chunk size diff=4096, records=1, input=tail.0
[2025/08/01 11:13:58] [trace] [task 0x7d003006b750] created (id=0)
[2025/08/01 11:13:58] [debug] [task] created task=0x7d003006b750 id=0 OK

[2025/08/01 11:13:58] [debug] [retry] new retry created for task_id=0 attempts=1
[2025/08/01 11:13:58] [ warn] [engine] failed to flush chunk '4077192-1754036038.387954750.flb', retry in 11 seconds: task_id=0, input=tail.0 > output=loki.0 (out_id=0)
  1. Send SIGHUP to fluent-bit
kill -SIGHUP $(pidof fluent-bit)

Flush chunk retry interval has changed to 1 sec. Fluent-bit tries to deliver logs.

[2025/08/01 11:14:31] [debug] [retry] re-using retry for task_id=0 attempts=4
[2025/08/01 11:14:31] [ warn] [engine] failed to flush chunk '4077192-1754036038.387954750.flb', retry in 1 seconds: task_id=0, input=tail.0 > output=loki.0 (out_id=0)
[2025/08/01 11:14:32] [ info] [task] tail/tail.0 has 1 pending task(s):
[2025/08/01 11:14:32] [ info] [task]   task_id=0 still running on route(s): loki/loki.0 
  1. Start loki:
    Fluent-bit will reconcile loki connection and deliver chunk.
[2025/08/01 11:14:44] [ info] [engine] flush chunk '4077192-1754036038.387954750.flb' succeeded at retry 16: task_id=0, input=tail.0 > output=loki.0 (out_id=0)
[2025/08/01 11:14:44] [debug] [task] destroy task=0x7d003006b750 (task_id=0)

Observe results:
Fluent doesn't handle new log inputs.
In case you send SIGHUP again, you will get the message in fluent log file:

[2025/08/01 11:14:58] [engine] caught signal (SIGHUP)
[2025/08/01 11:14:58] [error] reloading in progress, aborting.

This shows that reload didn't completed properly.

Expected behavior
Fluent-bit reloaded, begin to handle input/output. Should respond to signals SIGHUP, SIGTERM.

abnormal_behavior.log
hangs_gdb_output.txt

Your Environment
Operating System and version: any linux, mine: Ubuntu 24.04.2 LTS, kernel 6.8.0-60-generic
Fluent-bit (any last stable version).
Run arguments:

./build/bin/fluent-bit -c ../fluent-test/fluent.conf  -v  -Y -W

Output: loki v3.5.0 (affecting all output plugins supporting retry)

Fluent-bit should be started with Hot_Reload and Ensure_Thread_Safety set to True or Retry_Limit set to no_limits.

🔽 Fluent-bit config

[Service]
    Hot_Reload.Ensure_Thread_Safety true
    Shutdown_Grace 5
    Log_Level trace 
    Http_Server    true
    Parsers_File    /home/raskin/kind/fluent-test/fluent/parsers.conf
    Parsers_File    /home/raskin/kind/fluent-test/fluent/parsers_multiline.conf
    storage.path    /home/raskin/kind/fluent-test/fluent/buffer/
    storage.sync    normal
    storage.checksum    off
    storage.backlog.mem_limit   5MB
    storage.metrics    on
    storage.max_chunks_up   5 
    storage.delete_irrecoverable_chunks    on
    storage.backlog.flush_on_shutdown Off

[Input]
    Name    tail
    Path    /home/raskin/kind/fluent-test/fluent/log/*.log
    Read_from_Head    true
    Refresh_Interval    60
    Skip_Long_Lines    true
    DB    /home/raskin/kind/fluent-test/fluent/tail/pos.db
    DB.Sync    Normal
    Mem_Buf_Limit    32MB
    Parser    cri
    Tag    kube.*
    storage.type    filesystem
    storage.pause_on_chunks_overlimit    off
    Log_Level trace

[Output]
    Name    loki
    Log_Level    trace
    Match_Regex    .*
    Retry_Limit   no_limits 
    host   localhost 
    port    3100
    auto_kubernetes_labels    off
    tls    Off

Additional context
PR is ready — plan to fix it myself.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions