diff --git a/_platforms/machineLearning.md b/_platforms/machineLearning.md index 78dc4a7be..128c18d44 100644 --- a/_platforms/machineLearning.md +++ b/_platforms/machineLearning.md @@ -12,7 +12,7 @@ feature_area_category_name: MachineLearning feature_area_solution_name: '' has_hero: true -overview_header_text: 'Build flexible, scalable, and future-proof machine learning and artificial intelligence applications' +overview_header_text: 'Build flexible, scalable, and future-ready machine learning and artificial intelligence applications' key_benefits_list: - name: 'Proven in production' @@ -22,19 +22,19 @@ key_benefits_list: - name: 'Open and flexible' description: 'Take advantage of open-source integrations into popular open frameworks and use managed services from major cloud providers.' - name: 'Built for the future' - description: 'Future-proof your AI applications with vector, lexical, and hybrid search, analytics, and observability capabilities, all in one software suite.' + description: 'Prepare your AI applications for future innovations with vector, lexical, and hybrid search, as well as analytics and observability capabilities, all in one software suite.' key_features_list: - name: 'Vector database' description: 'Use low-latency queries to discover assets by degree of similarity through k-nearest neighbors (k-NN) functionality.' - name: 'Neural search' - description: 'Create semantic search applications by running human-language instead of vector-based queries.' + description: 'Improve accuracy and relevancy for human language queries through searches that consider context and relationships.' - name: 'Extensible ML framework' - description: 'Power neural search through integrated models that share a unified API, whether they run on-cluster or externally.' + description: 'Power neural search through OpenSearch’s pre-trained models, upload your own, or connect to externally hosted models.' - name: 'Anomaly detection' - description: 'Detect anomalies in your OpenSearch data automatically, in near real time, using the Random Cut Forest (RCF) algorithm.' + description: 'Automatically detect unusual behavior in your data in near real time using the Random Cut Forest (RCF) algorithm.' - name: 'Efficient filtering' - description: 'Intelligently evaluate strategies to optimize between recall and latency.' + description: 'Apply intelligent strategies to optimize recall and latency for vector search.' - name: 'Vector quantization support' description: 'Improve performance and cost by reducing your index size and query latency with minimal impact on recall.' @@ -49,8 +49,9 @@ hero_images: alt: 'OpenSearch platform for search applications hero banner.' --- -The artificial intelligence (AI) revolution has transformed process optimization, analytics, and customer experiences. Now, machine learning (ML) models are powering the next leap forward through vector search. By embedding models that can encode the meaning and context of documents, images, and audio into vectors for similarity-driven searches, this framework unlocks powerful ML and AI tooling and capabilities. +Artificial intelligence (AI) has transformed process optimization, analytics, and customer experiences. Now, machine learning (ML) models are powering the next leap forward through vector search. By embedding models that can encode the meaning and context of documents, images, and audio into vectors for similarity-driven searches, vector search unlocks powerful ML and AI tooling and capabilities. + +OpenSearch brings traditional search, analytics, and vector search together in one solution. By reducing the effort you need to operationalize, manage, and integrate AI-generated assets, OpenSearch’s vector database capabilities accelerate ML and AI application development. Built-in performance and scalability allow you to power vector, lexical, and hybrid search and analytics across all your models, vectors, and metadata. Enhance information retrieval and analytics, improve efficiency and stability, and give your generative AI models the resources to deliver more accurate and intelligent responses. -OpenSearch brings traditional search, analytics, and vector search together in one complete solution. By reducing the effort you need to operationalize, manage, and integrate AI-generated assets, OpenSearch’s vector database capabilities accelerate ML and AI application development. Built-in performance and scalability power vector, lexical, and hybrid search and analytics across all your models, vectors, and metadata. Enhance information retrieval and analytics, improve efficiency and stability, and give your generative AI models a greater pool of data using OpenSearch. diff --git a/_posts/2025-02-04-introduce-bitmap-filtering-feature.md b/_posts/2025-02-04-introduce-bitmap-filtering-feature.md new file mode 100644 index 000000000..1d0d6137b --- /dev/null +++ b/_posts/2025-02-04-introduce-bitmap-filtering-feature.md @@ -0,0 +1,144 @@ +--- +layout: post +title: "Efficient large-scale filtering with bitmap filtering in OpenSearch" +authors: + - bowenlan-amzn + - macrakis + - msfroh + - kolchfa +date: 2025-02-25 +categories: + - technical-posts +meta_keywords: bitmap filtering, OpenSearch 2.17, filtering large datasets, RoaringBitmap, OpenSearch queries, e-commerce search. search performance +meta_description: Discover how the bitmap filtering feature in OpenSearch optimizes large-scale filtering operations, and improves query performance and efficiency for datasets with thousands to millions of terms +--- + +OpenSearch is a powerful open-source search and analytics engine that enables you to efficiently search and filter large datasets. A common search pattern involves filtering documents based on whether a field matches any value in a large set. While the existing `terms` query works well for smaller sets, its performance degrades significantly when handling thousands or millions of terms. + +In OpenSearch 2.17, we introduced _bitmap filtering_ to address this issue, providing a more efficient way to handle large-scale filtering operations. OpenSearch 2.19 further enhances this feature with a new index-based bitmap query that improves performance for smaller queries and optimizes overall efficiency. + +## The challenge of large-scale filtering + +Many applications need to filter documents by checking whether a numeric identifier matches any value in a large set. Consider these examples: + +- An e-commerce platform filtering a product catalog to display only items in a customer's digital library (matching product IDs against a list of thousands of purchased items). +- A bookstore chain searching for all books from a specific store (matching book IDs against a list of thousands of ISBNs). + +Using `terms` queries for large sets of identifiers can cause the following issues: + +- Performance degradation as query size increases +- Scalability challenges because of high memory and CPU consumption +- Network overhead from transmitting extensive filter lists + +These limitations negatively affect both performance and scalability, particularly for large datasets and high-traffic workloads. + +## Bitmap filtering: An optimized approach + +Bitmap filtering improves query performance and scalability when filtering by integer sets, such as product IDs or ISBN numbers. It uses [RoaringBitmap](https://github.com/RoaringBitmap/RoaringBitmap), an efficient data structure for handling integer sets: + +- RoaringBitmap automatically selects the most efficient internal representation based on data characteristics. It provides excellent compression for sparse integer sets while maintaining fast lookup speeds. +- Set operations (for example, intersection or union) can be computed efficiently using the RoaringBitmap library before sending queries to OpenSearch. + +## How OpenSearch implements bitmap filtering + +OpenSearch integrates bitmap filtering seamlessly into its query infrastructure: + +- A new `value_type` parameter in `terms` queries allows you to specify a filter list using a Base64-encoded Roaring bitmap. +- The `terms` lookup has been enhanced so that it fetches values from stored fields instead of from the entire `_source`. + +## Example: Filtering a customer's purchased products + +Suppose you run an e-commerce marketplace with 1 million products and 100,000 customers. Each customer has a digital library containing their purchased products. You maintain a bitmap for each customer representing their product ownership. Using bitmap filtering, you can efficiently retrieve the products owned by a specific customer as follows: + +```json +POST products/_search +{ + "query": { + "terms": { + "product_id": { + "index": "customers", + "id": "customer123", + "path": "customer_filter", + "store": true + }, + "value_type": "bitmap" + } + } +} +``` + +In this example, the bitmap filter is applied to the `product_id` field in the `products` index to retrieve products owned by a specific customer. The bitmap filter data is stored in the `customers` index under the document ID `customer123`, in the field `customer_filter`. This binary field is optimized for fast retrieval and efficient processing. During query execution, the bitmap filter is loaded into memory and applied to the filtering operation. + +In addition to using a `terms` lookup, as shown in the previous example, you can provide the bitmap directly in the query: + +```json +POST products/_search +{ + "query": { + "terms": { + "product_id": [""], + "value_type": "bitmap" + } + } +} +``` + +In this case, the bitmap must be Base64 encoded before being included in the query. This approach is useful when you have precomputed bitmaps or need to perform bitmap operations on the client side before querying OpenSearch. + +## Key advantages of bitmap filtering + +Bitmap filtering integrates seamlessly with OpenSearch's existing query infrastructure: + +- The `value_type: "bitmap"` parameter allows you to specify bitmap filters in `terms` queries. +- Enhanced `terms` lookup enables efficient retrieval of stored bitmap filters. +- You can combine bitmap filters with other query types in Boolean queries, making them a flexible tool for large-scale filtering. + +These enhancements enable you to adopt bitmap filtering with minimal changes to your existing OpenSearch queries while gaining significant performance benefits for large-scale filtering operations. + +## Performance benchmarks + +We conducted performance tests on an index containing 100 million documents, comparing different filtering approaches across filter sizes ranging from 100 to 10 million random IDs. + +We compared the following approaches: + +- Query using a list of IDs (standard `terms` query) on a document-values-only field (OpenSearch 2.17) +- Query using a list of IDs (standard `terms` query) on an indexed field (OpenSearch 2.17) +- Query using bitmap filtering on a document-values-only field (OpenSearch 2.17) +- Query using bitmap filtering on an indexed field (OpenSearch 2.19) + +The following figure shows the query time comparison of these approaches. + +![Traditional and bitmap query performance](/assets/media/blog-images/2025-02-04-introduce-bitmap-filtering-feature/query_time_comparison.png){:class="img-centered"} +*Figure 1: Traditional and bitmap filtering query performance* + +For small filter sizes (up to 100,000 IDs), standard and bitmap approaches performed similarly. However, standard methods degraded rapidly for larger filter sizes, while bitmap filtering maintained stable performance even with millions of IDs. + +### Optimized bitmap filtering comparison + +OpenSearch 2.19 introduced an index-based bitmap query. Now a bitmap query can automatically select the most efficient execution strategy---index or doc value query---based on the query context and cost estimation when used in a Boolean query with other filters. Compared to the original document-value-based bitmap implementation, the new implementation delivers remarkable improvements in the benchmark experiments: + +- **1,000x speed improvement** for smaller filter sizes +- **Consistently low query times** even with millions of IDs +- **Stable performance** across all filter sizes + +The following figure shows the query time comparison of a Boolean query with a 100K ID filter and a bitmap query on a document-values-only field and on an indexed field. + +![Optimized bitmap filtering performance](/assets/media/blog-images/2025-02-04-introduce-bitmap-filtering-feature/query_time_comparison_bitmap_index_docvalues.png){:class="img-centered"} +*Figure 2: Optimized bitmap filtering performance* + +Bitmap filtering is not only faster but also more space efficient. A filter containing 10 million IDs requires only **16 MB** of storage when encoded as a bitmap, compared to **360 MB** as a raw ID list. This compact representation reduces network transfer times, disk I/O, and memory usage. + +The following figure shows the space efficiency comparison of bitmap filtering with document values and with indexed fields. + +![Space efficiency comparison](/assets/media/blog-images/2025-02-04-introduce-bitmap-filtering-feature/data_size_comparison.png){:class="img-centered"} +*Figure 3: Optimized bitmap filtering space efficiency* + + +## Conclusion + +Bitmap filtering in OpenSearch provides an efficient way to filter documents using large sets of numeric identifiers (thousands to millions). Integrated into `terms` queries and `terms` lookup, it is especially useful for scenarios such as: + +- Digital content platforms filtering large document collections based on user entitlements. +- E-commerce platforms matching product IDs against customer libraries. + +To determine whether bitmap filtering or standard `terms` queries best suit your needs, see [Performance benchmarks](#performance-benchmarks). If you're using bitmap filtering in large-scale applications, we welcome your feedback on the [OpenSearch forum](https://forum.opensearch.org/) to help shape future improvements. \ No newline at end of file diff --git a/_redesign_use_cases/machineLearning-category.markdown b/_redesign_use_cases/machineLearning-category.markdown index a70668ee7..5814eca3d 100644 --- a/_redesign_use_cases/machineLearning-category.markdown +++ b/_redesign_use_cases/machineLearning-category.markdown @@ -24,23 +24,19 @@ button_stack: Search - Power visual, semantic, and multimodal search using models that work best for your key scenarios. + Power multimodal search and accommodate different data types using the models that work best for your key scenarios. Generative AI agents - Use generative AI to build intelligent agents that deliver better results for chatbots or automated conversation entities. + Use large language models (LLMs) and generative AI to build intelligent agents that deliver better results for chatbots or automated conversation entities. Recommendation engine - Generate product and user embeddings using collaborative filtering techniques. + Generate product and user embeddings through collaborative filtering techniques. User-level content targeting - Personalize web pages by retrieving content ranked by user propensities through embeddings trained on user interactions. - - - Automated pattern matching and de-duplication - Enhance your data quality processes by using similarity search to automate pattern matching and duplication discovery in data. + Personalize web pages by retrieving content ranked by user propensities with embeddings trained on human interactions. Data and ML platforms diff --git a/assets/media/blog-images/2025-02-04-introduce-bitmap-filtering-feature/data_size_comparison.png b/assets/media/blog-images/2025-02-04-introduce-bitmap-filtering-feature/data_size_comparison.png new file mode 100644 index 000000000..895b6807a Binary files /dev/null and b/assets/media/blog-images/2025-02-04-introduce-bitmap-filtering-feature/data_size_comparison.png differ diff --git a/assets/media/blog-images/2025-02-04-introduce-bitmap-filtering-feature/query_time_comparison.png b/assets/media/blog-images/2025-02-04-introduce-bitmap-filtering-feature/query_time_comparison.png new file mode 100644 index 000000000..e105c4855 Binary files /dev/null and b/assets/media/blog-images/2025-02-04-introduce-bitmap-filtering-feature/query_time_comparison.png differ diff --git a/assets/media/blog-images/2025-02-04-introduce-bitmap-filtering-feature/query_time_comparison_bitmap_index_docvalues.png b/assets/media/blog-images/2025-02-04-introduce-bitmap-filtering-feature/query_time_comparison_bitmap_index_docvalues.png new file mode 100644 index 000000000..a40d57f44 Binary files /dev/null and b/assets/media/blog-images/2025-02-04-introduce-bitmap-filtering-feature/query_time_comparison_bitmap_index_docvalues.png differ