Skip to content

ObjectStore failure resulting in Rabbit virus like behaviour i.e storage is full. #23570

@MNKagiri

Description

@MNKagiri

Describe the bug

Risingwave writes to state store even when nothing is going on. This keeps on until there is no storage left on the secondary storage device.

Error message/log

:recovery_attempt: risingwave_meta::barrier::context::recovery: post applied reschedule for jobs in offline scaling jobs=[63, 177, 13, 132, 178, 192, 189, 182, 162, 15]
2025-10-23T17:58:28.731216541Z  INFO      rw-standalone-meta failure_recovery{error=gRPC request to stream service failed: Internal error: failed to complete epoch: [6 4294967295](tel:64294967295) EpochPair { curr: 9437643190108160, prev: 9437643121360896 } Storage error: Hummock error: Other error: failed to sync: ObjectStore failed with IO error: Unexpected (persistent) at Writer::close, context: { timeout: 5 } => io operation timeout reached

To Reproduce

Create a source, materialized view, table and external sink to clickhouse etc.
Drop and recreate the sources and materialized views periodically, often accompanying a change in schema.
Leave device on standby for a few hours.

Expected behavior

I did not expect my SSD storage to fill up.

How did you deploy RisingWave?

via Docker Compose. Here it is:


x-image: &image
image: ${RW_IMAGE:-risingwavelabs/risingwave:v2.5.1}
services:
risingwave-standalone:
<<: *image
command: "standalone --meta-opts="
--listen-addr 0.0.0.0:5690
--advertise-addr 0.0.0.0:5690
--dashboard-host 0.0.0.0:5691
--prometheus-host 0.0.0.0:1250
--prometheus-endpoint http://prometheus-0:9500
--backend sql
--sql-endpoint postgres://postgres:@postgres-0:5432/metadata
--state-store hummock+minio://hummockadmin:hummockadmin@minio-0:9301/hummock001
--data-directory hummock_001
--config-path /risingwave.toml"
--compute-opts="
--config-path /risingwave.toml
--listen-addr 0.0.0.0:5688
--prometheus-listener-addr 0.0.0.0:1250
--advertise-addr 0.0.0.0:5688
--async-stack-trace verbose
--parallelism 8
--total-memory-bytes 6442450944
--role both
--meta-address http://0.0.0.0:5690
--memory-manager-target-bytes 6012954215 "
--frontend-opts="
--config-path /risingwave.toml
--listen-addr 0.0.0.0:4566
--advertise-addr 0.0.0.0:4566
--prometheus-listener-addr 0.0.0.0:1250
--health-check-listener-addr 0.0.0.0:6786
--meta-addr http://0.0.0.0:5690
--frontend-total-memory-bytes=1073741824"
--compactor-opts="
--listen-addr 0.0.0.0:6660
--prometheus-listener-addr 0.0.0.0:1250
--advertise-addr 0.0.0.0:6660
--meta-address http://0.0.0.0:5690
--compactor-total-memory-bytes=1073741824""
expose:
- "6660"
- "4566"
- "5688"
- "5690"
- "1250"
- "5691"
ports:
- "4566:4566"
- "5690:5690"
- "5691:5691"
- "1250:1250"
depends_on:
- postgres-0
- minio-0
volumes:
- "./risingwave.toml:/risingwave.toml"
environment:
RUST_BACKTRACE: "1"
# If ENABLE_TELEMETRY is not set, telemetry will start by default
ENABLE_TELEMETRY: ${ENABLE_TELEMETRY:-true}
RW_TELEMETRY_TYPE: ${RW_TELEMETRY_TYPE:-"docker-compose"}
RW_SECRET_STORE_PRIVATE_KEY_HEX: ${RW_SECRET_STORE_PRIVATE_KEY_HEX:-0123456789abcdef0123456789abcdef}
RW_LICENSE_KEY: ${RW_LICENSE_KEY:-}
container_name: risingwave-standalone
healthcheck:
test:
- CMD-SHELL
- bash -c 'printf "GET / HTTP/1.1\n\n" > /dev/tcp/127.0.0.1/6660; exit $$?;'
- bash -c 'printf "GET / HTTP/1.1\n\n" > /dev/tcp/127.0.0.1/5688; exit $$?;'
- bash -c '> /dev/tcp/127.0.0.1/4566; exit $$?;'
- bash -c 'printf "GET / HTTP/1.1\n\n" > /dev/tcp/127.0.0.1/5690; exit $$?;'
interval: 1s
timeout: 5s
restart: always
deploy:
resources:
limits:
memory: 8G
reservations:
memory: 8G

postgres-0:
image: "postgres:15-alpine"
environment:
- POSTGRES_HOST_AUTH_METHOD=trust
- POSTGRES_USER=postgres
- POSTGRES_DB=metadata
- POSTGRES_INITDB_ARGS=--encoding=UTF-8 --lc-collate=C --lc-ctype=C
expose:
- "5432"
ports:
- "8432:5432"
volumes:
- "postgres-0:/var/lib/postgresql/data"
healthcheck:
test: [ "CMD-SHELL", "pg_isready -U postgres" ]
interval: 2s
timeout: 5s
retries: 5
restart: always

grafana-0:
image: "grafana/grafana-oss:latest"
command: [ ]
expose:
- "3001"
ports:
- "3001:3001"
depends_on: [ ]
volumes:
- "grafana-0:/var/lib/grafana"
- "./grafana.ini:/etc/grafana/grafana.ini"
- "./grafana-risedev-datasource.yml:/etc/grafana/provisioning/datasources/grafana-risedev-datasource.yml"
- "./grafana-risedev-dashboard.yml:/etc/grafana/provisioning/dashboards/grafana-risedev-dashboard.yml"
- "./dashboards:/dashboards"
environment: {}
container_name: grafana-0
healthcheck:
test:
- CMD-SHELL
- bash -c 'printf "GET / HTTP/1.1\n\n" > /dev/tcp/127.0.0.1/3001; exit $$?;'
interval: 1s
timeout: 5s
retries: 5
restart: always

minio-0:
image: "quay.io/minio/minio:latest"
command:
- server
- "--address"
- "0.0.0.0:9301"
- "--console-address"
- "0.0.0.0:9400"
- /data
expose:
- "9301"
- "9400"
ports:
- "9301:9301"
- "9400:9400"
depends_on: [ ]
volumes:
- "minio-0:/data"
entrypoint: "

  /bin/sh -c '

  set -e

  mkdir -p \"/data/hummock001\"

  /usr/bin/docker-entrypoint.sh \"$$0\" \"$$@\"

  '"
environment:
  MINIO_CI_CD: "1"
  MINIO_PROMETHEUS_AUTH_TYPE: public
  MINIO_PROMETHEUS_URL: "http://prometheus-0:9500"
  MINIO_ROOT_PASSWORD: hummockadmin
  MINIO_ROOT_USER: hummockadmin
  MINIO_DOMAIN: "minio-0"
container_name: minio-0
healthcheck:
  test:
    - CMD-SHELL
    - bash -c 'printf \"GET / HTTP/1.1\n\n\" > /dev/tcp/127.0.0.1/9301; exit $$?;'
  interval: 1s
  timeout: 5s
  retries: 5
restart: always

prometheus-0:
image: "prom/prometheus:latest"
command:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.path=/prometheus"
- "--web.console.libraries=/usr/share/prometheus/console_libraries"
- "--web.console.templates=/usr/share/prometheus/consoles"
- "--web.listen-address=0.0.0.0:9500"
- "--storage.tsdb.retention.time=30d"
expose:
- "9500"
ports:
- "9500:9500"
depends_on: [ ]
volumes:
- "prometheus-0:/prometheus"
- "./prometheus.yaml:/etc/prometheus/prometheus.yml"
environment: {}
container_name: prometheus-0
healthcheck:
test:
- CMD-SHELL
- sh -c 'printf "GET /-/healthy HTTP/1.0\n\n" | nc localhost 9500; exit $$?;'
interval: 1s
timeout: 5s
retries: 5
restart: always

message_queue:
image: "redpandadata/redpanda:latest"
command:
- redpanda
- start
- "--smp"
- "1"
- "--reserve-memory"
- 0M
- "--memory"
- 4G
- "--overprovisioned"
- "--node-id"
- "0"
- "--check=false"
- "--kafka-addr"
- "PLAINTEXT://0.0.0.0:29092,OUTSIDE://0.0.0.0:9092"
- "--advertise-kafka-addr"
- "PLAINTEXT://message_queue:29092,OUTSIDE://localhost:9092"
expose:
- "29092"
- "9092"
- "9644"
ports:
- "29092:29092"
- "9092:9092"
- "9644:9644"
- "8081:8081"
depends_on: [ ]
volumes:
- "message_queue:/var/lib/redpanda/data"
environment: {}
container_name: message_queue
healthcheck:
test: curl -f localhost:9644/v1/status/ready
interval: 1s
timeout: 5s
retries: 5
restart: always
volumes:
postgres-0:
external: false
grafana-0:
external: false
minio-0:
external: false
prometheus-0:
external: false
message_queue:
external: false

The version of RisingWave

Version 2.5.1

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    type/bugType: Bug. Only for issues.

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions