Skip to content

Commit dc4518f

Browse files
committed
[FLINK-36439][docs] Documents for Disaggregated State
1 parent aa6a8db commit dc4518f

File tree

5 files changed

+393
-3
lines changed

5 files changed

+393
-3
lines changed

docs/content.zh/docs/dev/datastream/fault-tolerance/state_v2.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,9 @@ to learn about the concepts behind stateful stream processing.
3535
The new state API is designed to be more flexible than the previous API. User can perform
3636
asynchronous state operations, thus making it more powerful and more efficient.
3737
The asynchronous state access is essential for the state backend to be able to handle
38-
large state sizes and to be able to spill to remote file systems when necessary.
38+
large state sizes and to be able to spill to remote file systems when necessary.
39+
This is called the 'disaggregated state management'. For more information about this,
40+
please see [Disaggregated State Management]({{< ref "docs/ops/state/disaggregated_state" >}}).
3941

4042
## Keyed DataStream
4143

Lines changed: 193 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,193 @@
1+
---
2+
title: "Disaggregated State Management"
3+
weight: 20
4+
type: docs
5+
aliases:
6+
- /ops/state/disaggregated_state.html
7+
- /apis/streaming/disaggregated_state.html
8+
---
9+
<!--
10+
Licensed to the Apache Software Foundation (ASF) under one
11+
or more contributor license agreements. See the NOTICE file
12+
distributed with this work for additional information
13+
regarding copyright ownership. The ASF licenses this file
14+
to you under the Apache License, Version 2.0 (the
15+
"License"); you may not use this file except in compliance
16+
with the License. You may obtain a copy of the License at
17+
18+
http://www.apache.org/licenses/LICENSE-2.0
19+
20+
Unless required by applicable law or agreed to in writing,
21+
software distributed under the License is distributed on an
22+
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
23+
KIND, either express or implied. See the License for the
24+
specific language governing permissions and limitations
25+
under the License.
26+
-->
27+
28+
# Disaggregated State Management
29+
30+
## Overview
31+
32+
For the first ten years of Flink, the state management is based on memory or local disk of the TaskManager.
33+
This approach works well for most use cases, but it has some limitations:
34+
* **Local Disk Constraints**: The state size is limited by the memory or disk size of the TaskManager.
35+
* **Spiky Resource Usage**: The local state model triggers periodic CPU and network I/O bursts during checkpointing or SST files compaction.
36+
* **Heavy Recovery**: State needs to be downloaded during recovery. The recovery time is
37+
proportional to the state size, which can be slow for large state sizes.
38+
39+
In Flink 2.0, we introduced the disaggregated state management. This feature allows users to store
40+
the state in external storage systems like S3, HDFS, etc. This is useful when the state size
41+
is extremely large. It could be used to store the state in a more cost-effective way, or to
42+
persist or recovery the state in a more lightweight way. The benefits of disaggregated state management are:
43+
* **Unlimited State Size**: The state size is only limited by the external storage system.
44+
* **Stable Resource Usage**: The state is stored in external storage, thus the checkpoint could be very lightweight.
45+
And the SST files compaction could be done remotely (TODO).
46+
* **Fast Recovery**: No need to download the state during recovery. The recovery time is
47+
independent of the state size.
48+
* **Flexible**: Users can easily choose different external storage systems or I/O performance levels,
49+
or scale the storage based on their requirements without change their hardware.
50+
* **Cost-effective**: External storage are usually cheaper than local disk. Users can flexibly
51+
adjust computing resources and storage resources independently if there is any bottleneck.
52+
53+
The disaggregated state management contains three parts:
54+
* **ForSt State Backend**: A state backend that stores the state in external storage systems. It
55+
can also leverage the local disk for caching and buffering. The asynchronous I/O model is used to
56+
read and write the state. For more details, see [ForSt State Backend]({{< ref "docs/ops/state/state_backends#the-forststatebackend" >}}).
57+
* **New State APIs**: The new state APIs (State V2) are introduced to perform asynchronous state
58+
reads and writes, which is essential for overcoming the high network latency when accessing
59+
the disaggregated state. For more details, see [New State APIs]({{< ref "docs/dev/datastream/fault-tolerance/state_v2" >}}).
60+
* **SQL Support**: Many SQL operators are rewritten to support the disaggregated state management
61+
and asynchronous state access. User can easily enable these by setting the configuration.
62+
63+
{{ <hint info >}}
64+
Disaggregated state and asynchronous state access are encouraged for large state. However, when
65+
the state size is small, the local state management with synchronous state access is a better
66+
choice.
67+
{{ </hint >}}
68+
69+
{{ <hint info >}}
70+
The disaggregated state management is still in experimental state. We are working on improving
71+
the performance and stability of this feature. The APIs and configurations may change in future
72+
release.
73+
{{ </hint >}}
74+
75+
## Quick Start
76+
77+
### For SQL Jobs
78+
79+
To enable the disaggregated state management in SQL jobs, you can set the following configurations:
80+
```yaml
81+
state.backend.type: forst
82+
table.exec.async-state.enabled: true
83+
84+
# enable checkpoints, checkpoint directory is required
85+
execution.checkpointing.incremental: true
86+
execution.checkpointing.dir: s3://your-bucket/flink-checkpoints
87+
88+
# We don't support the mini-batch and two-phase aggregation in asynchronous state access yet.
89+
table.exec.mini-batch.enabled: false
90+
table.optimizer.agg-phase-strategy: ONE_PHASE
91+
```
92+
Thus, you could leverage the disaggregated state management and asynchronous state access in
93+
your SQL jobs. We haven't implemented the full support for the asynchronous state access
94+
in SQL yet. If the SQL operators you are using are not supported, the operator will fall back
95+
to the synchronous state implementation automatically. The performance may not be optimal in
96+
this case. The supported stateful operators are:
97+
- Rank (Top1, Append TopN)
98+
- Row Time Deduplicate
99+
- Aggregate (without distinct)
100+
- Join
101+
- Window Join
102+
- Tumble / Hop / Cumulative Window Aggregate
103+
104+
105+
### For DataStream Jobs
106+
107+
To enable the disaggregated state management in DataStream jobs, firstly you should use
108+
the `ForStStateBackend`. Configure via code in per-job mode:
109+
```java
110+
Configuration config = new Configuration();
111+
config.set(StateBackendOptions.STATE_BACKEND, "forst");
112+
config.set(CheckpointingOptions.CHECKPOINTS_DIRECTORY, "s3://your-bucket/flink-checkpoints");
113+
config.set(CheckpointingOptions.INCREMENTAL_CHECKPOINTS, true);
114+
env.configure(config);
115+
```
116+
Or configure via `config.yaml`:
117+
```yaml
118+
state.backend.type: forst
119+
120+
# enable checkpoints, checkpoint directory is required
121+
execution.checkpointing.incremental: true
122+
execution.checkpointing.dir: s3://your-bucket/flink-checkpoints
123+
```
124+
125+
Then, you should write your datastream jobs with the new state APIs. For more
126+
details, see [State V2]({{< ref "docs/dev/datastream/fault-tolerance/state_v2" >}}).
127+
128+
## Advanced Tuning Options
129+
130+
### Tuning ForSt State Backend
131+
132+
The `ForStStateBackend` has many configurations to tune the performance.
133+
The design of ForSt is very similar to RocksDB, and the configurable options are almost the same,
134+
so you can refer to [large state tuning]({{< ref "docs/ops/state/large_state_tuning#tuning-rocksdb-or-forst" >}})
135+
to tune the ForSt state backend.
136+
137+
Besides that, the following sections introduce some unique configurations for ForSt.
138+
139+
#### ForSt Primary Storage Location
140+
141+
By default, ForSt stores the state in the checkpoint directory. In this case,
142+
ForSt could perform lightweight checkpoints and fast recovery. However, users may
143+
want to store the state in a different location, e.g., a different bucket in S3.
144+
You can set the following configuration to specify the primary storage location:
145+
```yaml
146+
state.backend.forst.primary-dir: s3://your-bucket/forst-state
147+
```
148+
149+
**Note**: If you set this configuration, you may not be able to leverage the lightweight
150+
checkpoint and fast recovery, since the ForSt will perform file copy between the primary
151+
storage location and the checkpoint directory during checkpointing and recovery.
152+
153+
154+
#### ForSt File Cache
155+
156+
ForSt uses the local disk for caching and buffering. The granularity of the cache is whole file.
157+
This is enabled by default, except when the primary storage location is set to local.
158+
There are two capacity limit policies for the cache:
159+
- Size-based: The cache will evict the oldest files when the cache size exceeds the limit.
160+
- Reserved-based: The cache will evict the oldest files when the reserved space on disk
161+
(the disk where cache directory is) is not enough.
162+
Corresponding configurations are:
163+
```yaml
164+
state.backend.forst.cache.size-based-limit: 1GB
165+
state.backend.forst.cache.reserve-size: 10GB
166+
```
167+
Those can take effect together. If so, the cache will evict the oldest files when the cache
168+
size exceeds either the size-based limit or the reserved size limit.
169+
170+
One can also specify the cache directory via:
171+
```yaml
172+
state.backend.forst.cache.dir: /tmp/forst-cache
173+
```
174+
175+
#### ForSt Asynchronous Threads
176+
177+
ForSt uses asynchronous I/O to read and write the state. There are three types of threads:
178+
- Coordinator thread: The thread that coordinates the asynchronous read and write.
179+
- Read thread: The thread that reads the state asynchronously.
180+
- Write thread: The thread that writes the state asynchronously.
181+
182+
The number of asynchronous threads is configurable. Typically, you don't need to adjust these
183+
values since the default values are good enough for most cases.
184+
In case for special needs, you can set the following configuration to specify the number of
185+
asynchronous threads:
186+
- `state.backend.forst.executor.read-io-parallelism`: The number of asynchronous threads for read. Default is 3.
187+
- `state.backend.forst.executor.write-io-parallelism`: The number of asynchronous threads for write. Default is 1.
188+
- `state.backend.forst.executor.inline-write`: Whether to inline the write operation in the coordinator thread.
189+
Default is true. Setting this to false will raise the CPU usage.
190+
- `state.backend.forst.executor.inline-coordinator`: Whether to let task thread be the coordinator thread.
191+
Default is true. Setting this to false will raise the CPU usage.
192+
193+
{{ <top> }}

docs/content/docs/dev/datastream/fault-tolerance/state_v2.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,9 @@ to learn about the concepts behind stateful stream processing.
3535
The new state API is designed to be more flexible than the previous API. User can perform
3636
asynchronous state operations, thus making it more powerful and more efficient.
3737
The asynchronous state access is essential for the state backend to be able to handle
38-
large state sizes and to be able to spill to remote file systems when necessary.
38+
large state sizes and to be able to spill to remote file systems when necessary.
39+
This is called the 'disaggregated state management'. For more information about this,
40+
please see [Disaggregated State Management]({{< ref "docs/ops/state/disaggregated_state" >}}).
3941

4042
## Keyed DataStream
4143

0 commit comments

Comments
 (0)