Update docs to prepare for 0.9 release (#5605)

rdettai · web-flow · commit f9a7e681473e · 2025-01-13T09:03:52.000Z
* Update docs to prepare for 0.9 release

* Fix typos and minor clarifications
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -22,6 +22,77 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 --->
 
+# [0.9.0]
+
+### Added
+- Add Ingest V2 (#5600, #5566, #5463, #5375, #5350, #5252 #5202)
+- Add SQS source (#5374, #5335, #5148)
+- Disable control plane check for searcher (#5599, #5360)
+- Partially implement `_elastic/_cluster/health` (#5595)
+- Make Jaeger span attribute-to-tag conversion exhaustive (#5574)
+- Use `content_length_limit` for ES bulk limit (#5573)
+- Limit and monitor warmup memory usage (#5568)
+- Add eviction metrics to caches (#5523)
+- Record object storage request latencies (#5521)
+- Add some kind of throttling on the janitor to prevent it from overloading (#5510)
+- Prevent single split searches from different `leaf_search` from interleaving (#5509)
+- Retry on S3 internal error (#5504)
+- Allow specifying OTEL index ID in header (#5503)
+- Add a metric to count storage errors and their error code (#5497)
+- Add support for concatenated fields (#4773, #5369, #5331) 
+- Add number of splits per root/leaf search histograms (#5472)
+- Introduce a searcher config option to timeout get requests (#5467)
+- Add fingerprint to task in cluster state (#5464)
+- Enrich root/leaf search spans with number of docs and splits (#5450)
+- Add some additional search metrics (#5447)
+- Improve GC resilience and add metrics (#5420)
+- Enable force shutdown with 2nd Ctrl+C (#5414)
+- Add request_timeout_secs config to searcher config (#5402)
+- Memoize S3 client (#5377)
+- Add more env var config for Postgres (#5365)
+- Enable str fast field range queries (#5324)
+- Allow querying non-existing fields (#5308)
+- Support updating doc mapper through api (#5253) 
+- Add optional special handling for hex in code tokenizer (#5200)
+- Added a circuit breaker layer (#5134)
+- Various performance optimizations in Tantivy (https://github.com/quickwit-oss/tantivy/blob/main/CHANGELOG.md)
+
+### Changed
+- Parse datetimes and timestamps with leading and/or trailing whitespace (#5544)
+- Restrict maturity period to retention (#5543)
+- Wait for merge at end of local ingest (#5542)
+- Log PostgreSQL metastore error (#5530)
+- Update azure multipart policy (#5553)
+- Stop relying on our own version of pulsar-rs (#5487)
+- Handle nested OTLP values in attributes and log bodies (#5485)
+- Improve merge pipeline finalization (#5475)
+- Allow failed splits in root search (#5440)
+- Batch delete from GC (#5404, #5380)
+- Make some S3 errors retryable (#5384)
+- Change default timestamps in OTEL logs (#5366)
+- Only return root spans for Jaeger HTTP API (#5358)
+- Share aggregation limit on node (#5357)
+
+### Fixed
+- Fix existence queries for nested fields (#5581)
+- Fix lenient option with wildcard queries (#5575)
+- Fix incompatible ES Java date format (#5462)
+- Fix bulk api response order (#5434)
+- Fix pulsar finalize (#5471)
+- Fix pulsar URI scheme (#5470)
+- Fix grafana searchers dashboard (#5455)
+- Fix jaeger http endpoint (#5378)
+- Fix file re-ingestion after EOF (#5330)
+- Fix source path in Lambda distrib (#5327)
+- Fix configuration interpolation (#5403)
+- Fix jaeger duration parse error (#5518)
+- Fix unit conversion in jaeger http search endpoint (#5519)
+
+### Removed
+- Remove support for 2-digit years in java datetime parser (#5596)
+- Remove DocMapper trait (#5508)
+
+
 # [0.8.1]
 
 ### Fixed
diff --git a/config/quickwit.yaml b/config/quickwit.yaml
@@ -130,6 +130,7 @@ indexer:
 # ingest_api:
 #   max_queue_memory_usage: 2GiB
 #   max_queue_disk_usage: 4GiB
+#   content_length_limit: 10MiB
 #
 # -------------------------------- Searcher settings --------------------------------
 # https://quickwit.io/docs/configuration/node-config#searcher-configuration
diff --git a/docs/configuration/node-config.md b/docs/configuration/node-config.md
@@ -176,13 +176,15 @@ indexer:
 | --- | --- | --- |
 | `max_queue_memory_usage` | Maximum size in bytes of the in-memory Ingest queue. | `2GiB` |
 | `max_queue_disk_usage` | Maximum disk-space in bytes taken by the Ingest queue. The minimum size is at least `256M` and be at least `max_queue_memory_usage`. | `4GiB` |
+| `content_length_limit` | Maximum payload size uncompressed. Increasing this is discouraged, use a [file source](../ingest-data/sqs-files.md) instead. | `10MiB` |
 
 Example:
 
 ```yaml
 ingest_api:
   max_queue_memory_usage: 2GiB
   max_queue_disk_usage: 4GiB
+  content_length_limit: 10MiB
 ```
 
 ## Searcher configuration
diff --git a/docs/deployment/cluster-sizing.md b/docs/deployment/cluster-sizing.md
@@ -41,9 +41,8 @@ To utilize all CPUs on Indexer nodes that have more than 4 cores, your indexing
 workload needs to be broken down into multiple indexing pipelines. This can be
 achieved by creating multiple indexes or by using a [partitioned data
 source](../configuration/source-config.md#number-of-pipelines) such as
-[Kafka](../configuration/source-config.md#kafka-source).
-
-<!-- TODO: change this note when releasing ingest v2 -->
+[Kafka](../configuration/source-config.md#kafka-source) or the [ingest API
+(v2)](../ingest-data/ingest-api.md#ingest-api-versions).
 
 :::
 
diff --git a/docs/ingest-data/ingest-api.md b/docs/ingest-data/ingest-api.md
@@ -69,3 +69,20 @@ curl -XDELETE 'http://localhost:7280/api/v1/indexes/stackoverflow-schemaless'
 ```
 
 This concludes the tutorial. You can now move on to the [next tutorial](/docs/ingest-data/kafka.md) to learn how to ingest data from Kafka.
+
+## Ingest API versions
+
+In 0.9, Quickwit introduced a new version of the ingest API that enables distributing the indexing in the cluster regardless of the node that received the ingest request. This new ingestion service is often referred to as "Ingest V2" compared to the legacy ingestion (V1). In upcoming versions the new ingest API will also be capable of replicating the write ahead log in order to achieve higher durability.
+
+By default, both ingestion services are enabled and ingest V2 is used. You can toggle this behavior with the following environment variables:
+
+| Variable              | Description   | Default value |
+| --------------------- | --------------|-------------- |
+| `QW_ENABLE_INGEST_V2` | Start the V2 ingest service and use it by default. | true | 
+| `QW_DISABLE_INGEST_V1`| V1 ingest will be used by the APIs only if V2 is disabled. Running V1 along V2 is necessary to migrate to V2 without loosing existing unindexed V1 logs. | false |
+
+:::note
+
+These configuration drive the ingest service used both by the `api/v1/<index-id>/ingest` endpoint and the [bulk API](../reference/es_compatible_api.md#_bulk--batch-ingestion-endpoint).
+
+:::
diff --git a/docs/internals/ingest-v2.md b/docs/internals/ingest-v2.md
@@ -1,12 +1,18 @@
 # Ingest V2
 
-Ingest V2 is a new ingestion API that is designed to be more efficient and scalable for thousands of indexes than the previous version. It is currently in beta and is not yet enabled by default.
+Ingest V2 is the latest ingestion API that is designed to be more efficient and scalable for thousands of indexes than the previous version. It is the default since 0.9.
 
-## Enabling Ingest V2
+## Architecture
 
-To enable Ingest V2, you need to set the `QW_ENABLE_INGEST_V2` environment variable to `1` on the indexer, control-plane, and metastore services.
+Just like ingest V1, the new ingest uses [`mrecordlog`](https://github.com/quickwit-oss/mrecordlog) to persist ingested documents that are waiting to be indexed. But unlike V1, which always persists the documents locally on the node that receives them, ingest V2 can dynamically distribute them into WAL units called _shards_. The assigned shard can be local or on another indexer. The control plane is in charge of distributing the shards to balance the indexing work as well as possible across all indexer nodes. The progress within each shard is not tracked as an index metadata checkpoint anymore but in a dedicated metastore `shards` table.
 
-You also have to activate the `enable_cooperative_indexing` option in the indexer configuration. The indexer configuration is in the node configuration:
+In the future, the shard based ingest will also be capable of writing a replica for each shard, thus ensuring a high durability of the documents that are waiting to be indexed (durability of the indexed documents is guarantied by the object store).
+
+## Toggling between ingest V1 and V2
+
+Variables driving the ingest configuration are documented [here](../ingest-data/ingest-api.md#ingest-api-versions).
+
+With ingest V2, you can also activate the `enable_cooperative_indexing` option in the indexer configuration. This setting is useful for deployments with very large numbers (dozens) of actively written indexers, to limit the indexing workbench memory consumption. The indexer configuration is in the node configuration:
 
 ```yaml
 version: 0.8
@@ -17,4 +23,12 @@ indexer:
 
 See [full configuration example](https://github.com/quickwit-oss/quickwit/blob/main/config/quickwit.yaml).
 
-The only way to use the ingest API V2 is to use the [bulk endpoint](../reference/es_compatible_api.md#_bulk--batch-ingestion-endpoint) of the Elasticsearch-compatible API. The native Quickwit API is not yet compatible with the ingest V2 API.
+## Differences between ingest V1 and V2
+
+- V1 uses the `queues/` directory whereas V2 uses the `wal/` directory
+- both V1 and V2 are configured with:
+  - `ingest_api.max_queue_memory_usage` 
+  - `ingest_api.max_queue_disk_usage` 
+- but ingest V2 can also be configured with:
+  - `ingest_api.replication_factor`, not working yet
+- ingest V1 always writes to the WAL of the node receiving the request, V2 potentially forwards it to another node, dynamically assigned by the control plane to distribute the indexing work more evenly.
diff --git a/docs/internals/template-index.md b/docs/internals/template-index.md
@@ -19,14 +19,12 @@ curl -XPUT -H 'Content-Type: application/yaml' 'http://localhost:7280/api/v1/tem
 curl -O https://quickwit-datasets-public.s3.amazonaws.com/stackoverflow.posts.transformed-10000.json
 
 # Ingest 10k docs into `stackoverflow-foo` index.
-curl -XPOST "http://127.0.0.1:7280/api/v1/stackoverflow-foo/ingest-v2" --data-binary @stackoverflow.posts.transformed-10000.json
+curl -XPOST "http://127.0.0.1:7280/api/v1/stackoverflow-foo/ingest" --data-binary @stackoverflow.posts.transformed-10000.json
 
 # Ingest 10k docs into `stackoverflow-bar` index.
-curl -XPOST "http://127.0.0.1:7280/api/v1/stackoverflow-bar/ingest-v2" --data-binary @stackoverflow.posts.transformed-10000.json
+curl -XPOST "http://127.0.0.1:7280/api/v1/stackoverflow-bar/ingest" --data-binary @stackoverflow.posts.transformed-10000.json
 
 # Delete Stackoverflow template.
 curl -XDELETE 'http://localhost:7280/api/v1/templates/stackoverflow'
 
 ```bash
-
-
diff --git a/docs/operating/data-directory.md b/docs/operating/data-directory.md
@@ -22,16 +22,20 @@ qwdata
 ├── indexing
 │   ├── wikipedia%01H13SVKDS03P%_ingest-api-source%RbaOAI
 │   └── wikipedia%01H13SVKDS03P%kafka-source%cNqQtI
+├── wal
+│   ├── wal-00000000000000000056
+│   └── wal-00000000000000000057
 └── queues
     ├── partition_id
     ├── wal-00000000000000000028
     └── wal-00000000000000000029
 ```
 
-### `/queues` directory
+### `/queues` and `/wal` directories
  
-This directory is created only if the ingest API service is running on your node. It contains write ahead log files of the ingest API to guard against data lost.
-The queue is truncated when Quickwit commits a split (piece of index), which means that the split is stored on the storage and its metadata are in the metastore.
+These directories are created only if the ingest API service is running on your node. They contain write ahead log files of the ingest API to guard against data loss. The `/queues` directory is used by the legacy version of the ingest (sometimes referred to as ingest V1). It is meant to be phased out in upcoming versions of Quickwit. Learn more about ingest API versions [here](../ingest-data/ingest-api.md#ingest-api-versions).
+
+The log file is truncated when Quickwit commits a split (piece of index), which means that the split is stored on the storage and its metadata are in the metastore.
 
 You can configure `max_queue_memory_usage` and `max_queue_disk_usage` in the [node config file](../configuration/node-config.md#ingest-api-configuration) to limit the max disk usage.
 
@@ -110,4 +114,3 @@ With these assumptions, you have to set `split_store_max_num_splits` to at least
 
 When starting, Quickwit is scanning all the splits in the cache directory to know which split is present locally, this can take a few minutes if you have tens of thousands splits. On Kubernetes, as your pod can be restarted if it takes too long to start, you may want to clean up the data directory or increase the liveliness probe timeout.
 Also please report such a behavior on [GitHub](https://github.com/quickwit-oss/quickwit) as we can certainly optimize this start phase.
-
diff --git a/docs/operating/upgrades.md b/docs/operating/upgrades.md
@@ -18,3 +18,9 @@ Quickwit 0.7.1 will create the new index `otel-logs-v0_7` which is now used by d
 
 In the traces index `otel-traces-v0_7`, the `service_name` field is now `fast`. 
 No migration is done if `otel-traces-v0_7` already exists. If you want `service_name` field to be `fast`, you have to delete first the existing `otel-traces-v0_7` index or you need to create your own index.
+
+## Migration from 0.8 to 0.9
+
+Quickwit 0.9 introduces a new ingestion service to to power the ingest and bulk APIs (v2). The new ingest is enabled and used by default, even though the legacy one (v1) remains enabled to finish indexing residual data in the legacy write ahead logs. Note that `ingest_api.max_queue_disk_usage` is enforced on both ingest versions separately, which means that the cumulated disk usage might be up to twice this limit.
+
+The control plane should be upgraded first in order to enable the new ingest source (v2) on all existing indexes. Ingested data into previously existing indexes on upgraded indexer nodes will not be picked by the indexing pipelines until the control plane is upgraded.
diff --git a/docs/reference/cli.md b/docs/reference/cli.md
@@ -353,8 +353,8 @@ quickwit index ingest
 | `--index` | ID of the target index |
 | `--input-path` | Location of the input file. |
 | `--batch-size-limit` | Size limit of each submitted document batch. |
-| `--wait` | Wait for all documents to be committed and available for search before exiting |
-| `--force` | Force a commit after the last document is sent, and wait for all documents to be committed and available for search before exiting |
+| `--wait` | Wait for all documents to be committed and available for search before exiting. Applies only to the last batch, see [#5417](https://github.com/quickwit-oss/quickwit/issues/5417). |
+| `--force` | Force a commit after the last document is sent, and wait for all documents to be committed and available for search before exiting. Applies only to the last batch, see [#5417](https://github.com/quickwit-oss/quickwit/issues/5417). |
 | `--commit-timeout` | Timeout for ingest operations that require waiting for the final commit (`--wait` or `--force`). This is different from the `commit_timeout_secs` indexing setting, which sets the maximum time before committing splits after their creation. |
 
 *Examples*
diff --git a/quickwit/quickwit-cli/src/index.rs b/quickwit/quickwit-cli/src/index.rs
@@ -141,12 +141,12 @@ pub fn build_index_command() -> Command {
                     Arg::new("wait")
                         .long("wait")
                         .short('w')
-                        .help("Wait for all documents to be committed and available for search before exiting")
+                        .help("Wait for all documents to be committed and available for search before exiting. Applies only to the last batch, see [#5417](https://github.com/quickwit-oss/quickwit/issues/5417).")
                         .action(ArgAction::SetTrue),
                     Arg::new("force")
                         .long("force")
                         .short('f')
-                        .help("Force a commit after the last document is sent, and wait for all documents to be committed and available for search before exiting")
+                        .help("Force a commit after the last document is sent, and wait for all documents to be committed and available for search before exiting. Applies only to the last batch, see [#5417](https://github.com/quickwit-oss/quickwit/issues/5417).")
                         .action(ArgAction::SetTrue)
                         .conflicts_with("wait"),
                     Arg::new("commit-timeout")
diff --git a/quickwit/quickwit-ingest/DESIGN.md b/quickwit/quickwit-ingest/DESIGN.md

Original file line number	Diff line number	Diff line change
`@@ -130,6 +130,7 @@ indexer:`
`130`	`130`	`# ingest_api:`
`131`	`131`	`# max_queue_memory_usage: 2GiB`
`132`	`132`	`# max_queue_disk_usage: 4GiB`
	`133`	`+# content_length_limit: 10MiB`
`133`	`134`	`#`
`134`	`135`	`# -------------------------------- Searcher settings --------------------------------`
`135`	`136`	`# https://quickwit.io/docs/configuration/node-config#searcher-configuration`