|
| 1 | +--- |
| 2 | +title: "Disaggregated State Management" |
| 3 | +weight: 20 |
| 4 | +type: docs |
| 5 | +aliases: |
| 6 | + - /ops/state/disaggregated_state.html |
| 7 | + - /apis/streaming/disaggregated_state.html |
| 8 | +--- |
| 9 | +<!-- |
| 10 | +Licensed to the Apache Software Foundation (ASF) under one |
| 11 | +or more contributor license agreements. See the NOTICE file |
| 12 | +distributed with this work for additional information |
| 13 | +regarding copyright ownership. The ASF licenses this file |
| 14 | +to you under the Apache License, Version 2.0 (the |
| 15 | +"License"); you may not use this file except in compliance |
| 16 | +with the License. You may obtain a copy of the License at |
| 17 | +
|
| 18 | + http://www.apache.org/licenses/LICENSE-2.0 |
| 19 | +
|
| 20 | +Unless required by applicable law or agreed to in writing, |
| 21 | +software distributed under the License is distributed on an |
| 22 | +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| 23 | +KIND, either express or implied. See the License for the |
| 24 | +specific language governing permissions and limitations |
| 25 | +under the License. |
| 26 | +--> |
| 27 | + |
| 28 | +# Disaggregated State Management |
| 29 | + |
| 30 | +## Overview |
| 31 | + |
| 32 | +For the first ten years of Flink, the state management is based on memory or local disk of the TaskManager. |
| 33 | +This approach works well for most use cases, but it has some limitations: |
| 34 | + * **Local Disk Constraints**: The state size is limited by the memory or disk size of the TaskManager. |
| 35 | + * **Spiky Resource Usage**: The local state model triggers periodic CPU and network I/O bursts during checkpointing or SST files compaction. |
| 36 | + * **Heavy Recovery**: State needs to be downloaded during recovery. The recovery time is |
| 37 | +proportional to the state size, which can be slow for large state sizes. |
| 38 | + |
| 39 | +In Flink 2.0, we introduced the disaggregated state management. This feature allows users to store |
| 40 | +the state in external storage systems like S3, HDFS, etc. This is useful when the state size |
| 41 | +is extremely large. It could be used to store the state in a more cost-effective way, or to |
| 42 | +persist or recovery the state in a more lightweight way. The benefits of disaggregated state management are: |
| 43 | + * **Unlimited State Size**: The state size is only limited by the external storage system. |
| 44 | + * **Stable Resource Usage**: The state is stored in external storage, thus the checkpoint could be very lightweight. |
| 45 | +And the SST files compaction could be done remotely (TODO). |
| 46 | + * **Fast Recovery**: No need to download the state during recovery. The recovery time is |
| 47 | +independent of the state size. |
| 48 | + * **Flexible**: Users can easily choose different external storage systems or I/O performance levels, |
| 49 | +or scale the storage based on their requirements without change their hardware. |
| 50 | + * **Cost-effective**: External storage are usually cheaper than local disk. Users can flexibly |
| 51 | +adjust computing resources and storage resources independently if there is any bottleneck. |
| 52 | + |
| 53 | +The disaggregated state management contains three parts: |
| 54 | + * **ForSt State Backend**: A state backend that stores the state in external storage systems. It |
| 55 | +can also leverage the local disk for caching and buffering. The asynchronous I/O model is used to |
| 56 | +read and write the state. For more details, see [ForSt State Backend]({{< ref "docs/ops/state/state_backends#the-forststatebackend" >}}). |
| 57 | + * **New State APIs**: The new state APIs (State V2) are introduced to perform asynchronous state |
| 58 | +reads and writes, which is essential for overcoming the high network latency when accessing |
| 59 | +the disaggregated state. For more details, see [New State APIs]({{< ref "docs/dev/datastream/fault-tolerance/state_v2" >}}). |
| 60 | + * **SQL Support**: Many SQL operators are rewritten to support the disaggregated state management |
| 61 | +and asynchronous state access. User can easily enable these by setting the configuration. |
| 62 | + |
| 63 | +{{ <hint info >}} |
| 64 | +Disaggregated state and asynchronous state access are encouraged for large state. However, when |
| 65 | +the state size is small, the local state management with synchronous state access is a better |
| 66 | +choice. |
| 67 | +{{ </hint >}} |
| 68 | + |
| 69 | +{{ <hint info >}} |
| 70 | +The disaggregated state management is still in experimental state. We are working on improving |
| 71 | +the performance and stability of this feature. The APIs and configurations may change in future |
| 72 | +release. |
| 73 | +{{ </hint >}} |
| 74 | + |
| 75 | +## Quick Start |
| 76 | + |
| 77 | +### For SQL Jobs |
| 78 | + |
| 79 | +To enable the disaggregated state management in SQL jobs, you can set the following configurations: |
| 80 | +```yaml |
| 81 | +state.backend.type: forst |
| 82 | +table.exec.async-state.enabled: true |
| 83 | + |
| 84 | +# enable checkpoints, checkpoint directory is required |
| 85 | +execution.checkpointing.incremental: true |
| 86 | +execution.checkpointing.dir: s3://your-bucket/flink-checkpoints |
| 87 | + |
| 88 | +# We don't support the mini-batch and two-phase aggregation in asynchronous state access yet. |
| 89 | +table.exec.mini-batch.enabled: false |
| 90 | +table.optimizer.agg-phase-strategy: ONE_PHASE |
| 91 | +``` |
| 92 | +Thus, you could leverage the disaggregated state management and asynchronous state access in |
| 93 | +your SQL jobs. We haven't implemented the full support for the asynchronous state access |
| 94 | +in SQL yet. If the SQL operators you are using are not supported, the operator will fall back |
| 95 | +to the synchronous state implementation automatically. The performance may not be optimal in |
| 96 | +this case. The supported stateful operators are: |
| 97 | + - Rank (Top1, Append TopN) |
| 98 | + - Row Time Deduplicate |
| 99 | + - Aggregate (without distinct) |
| 100 | + - Join |
| 101 | + - Window Join |
| 102 | + - Tumble / Hop / Cumulative Window Aggregate |
| 103 | +
|
| 104 | +
|
| 105 | +### For DataStream Jobs |
| 106 | +
|
| 107 | +To enable the disaggregated state management in DataStream jobs, firstly you should use |
| 108 | +the `ForStStateBackend`. Configure via code in per-job mode: |
| 109 | +```java |
| 110 | +Configuration config = new Configuration(); |
| 111 | +config.set(StateBackendOptions.STATE_BACKEND, "forst"); |
| 112 | +config.set(CheckpointingOptions.CHECKPOINTS_DIRECTORY, "s3://your-bucket/flink-checkpoints"); |
| 113 | +config.set(CheckpointingOptions.INCREMENTAL_CHECKPOINTS, true); |
| 114 | +env.configure(config); |
| 115 | +``` |
| 116 | +Or configure via `config.yaml`: |
| 117 | +```yaml |
| 118 | +state.backend.type: forst |
| 119 | +
|
| 120 | +# enable checkpoints, checkpoint directory is required |
| 121 | +execution.checkpointing.incremental: true |
| 122 | +execution.checkpointing.dir: s3://your-bucket/flink-checkpoints |
| 123 | +``` |
| 124 | + |
| 125 | +Then, you should write your datastream jobs with the new state APIs. For more |
| 126 | +details, see [State V2]({{< ref "docs/dev/datastream/fault-tolerance/state_v2" >}}). |
| 127 | + |
| 128 | +## Advanced Tuning Options |
| 129 | + |
| 130 | +### Tuning ForSt State Backend |
| 131 | + |
| 132 | +The `ForStStateBackend` has many configurations to tune the performance. |
| 133 | +The design of ForSt is very similar to RocksDB, and the configurable options are almost the same, |
| 134 | +so you can refer to [large state tuning]({{< ref "docs/ops/state/large_state_tuning#tuning-rocksdb-or-forst" >}}) |
| 135 | +to tune the ForSt state backend. |
| 136 | + |
| 137 | +Besides that, the following sections introduce some unique configurations for ForSt. |
| 138 | + |
| 139 | +#### ForSt Primary Storage Location |
| 140 | + |
| 141 | +By default, ForSt stores the state in the checkpoint directory. In this case, |
| 142 | +ForSt could perform lightweight checkpoints and fast recovery. However, users may |
| 143 | +want to store the state in a different location, e.g., a different bucket in S3. |
| 144 | +You can set the following configuration to specify the primary storage location: |
| 145 | +```yaml |
| 146 | +state.backend.forst.primary-dir: s3://your-bucket/forst-state |
| 147 | +``` |
| 148 | + |
| 149 | +**Note**: If you set this configuration, you may not be able to leverage the lightweight |
| 150 | +checkpoint and fast recovery, since the ForSt will perform file copy between the primary |
| 151 | +storage location and the checkpoint directory during checkpointing and recovery. |
| 152 | + |
| 153 | + |
| 154 | +#### ForSt File Cache |
| 155 | + |
| 156 | +ForSt uses the local disk for caching and buffering. The granularity of the cache is whole file. |
| 157 | +This is enabled by default, except when the primary storage location is set to local. |
| 158 | +There are two capacity limit policies for the cache: |
| 159 | + - Size-based: The cache will evict the oldest files when the cache size exceeds the limit. |
| 160 | + - Reserved-based: The cache will evict the oldest files when the reserved space on disk |
| 161 | +(the disk where cache directory is) is not enough. |
| 162 | +Corresponding configurations are: |
| 163 | +```yaml |
| 164 | +state.backend.forst.cache.size-based-limit: 1GB |
| 165 | +state.backend.forst.cache.reserve-size: 10GB |
| 166 | +``` |
| 167 | +Those can take effect together. If so, the cache will evict the oldest files when the cache |
| 168 | +size exceeds either the size-based limit or the reserved size limit. |
| 169 | + |
| 170 | +One can also specify the cache directory via: |
| 171 | +```yaml |
| 172 | +state.backend.forst.cache.dir: /tmp/forst-cache |
| 173 | +``` |
| 174 | + |
| 175 | +#### ForSt Asynchronous Threads |
| 176 | + |
| 177 | +ForSt uses asynchronous I/O to read and write the state. There are three types of threads: |
| 178 | + - Coordinator thread: The thread that coordinates the asynchronous read and write. |
| 179 | + - Read thread: The thread that reads the state asynchronously. |
| 180 | + - Write thread: The thread that writes the state asynchronously. |
| 181 | + |
| 182 | +The number of asynchronous threads is configurable. Typically, you don't need to adjust these |
| 183 | +values since the default values are good enough for most cases. |
| 184 | +In case for special needs, you can set the following configuration to specify the number of |
| 185 | +asynchronous threads: |
| 186 | + - `state.backend.forst.executor.read-io-parallelism`: The number of asynchronous threads for read. Default is 3. |
| 187 | + - `state.backend.forst.executor.write-io-parallelism`: The number of asynchronous threads for write. Default is 1. |
| 188 | + - `state.backend.forst.executor.inline-write`: Whether to inline the write operation in the coordinator thread. |
| 189 | +Default is true. Setting this to false will raise the CPU usage. |
| 190 | + - `state.backend.forst.executor.inline-coordinator`: Whether to let task thread be the coordinator thread. |
| 191 | +Default is true. Setting this to false will raise the CPU usage. |
| 192 | + |
| 193 | +{{ <top> }} |
0 commit comments