Add support for ingesting and returning data in the Apache Arrow format #16664

jayaskren · 2024-11-18T02:34:54Z

Is your feature request related to a problem? Please describe

At work, I am making dashboards related to the work all of our servers are processing at any given time. I am severely limited by how many data points I can add to a given dashboard. Either I need to filter the number of servers, limit the the time window to days or even hours, or severely aggregate the data so we lose all context. I have a prototype using Vega doing what I want to do at work, but our dashboard cannot handle that much data even though Vega doesn't have a problem with it and the dashboard supports Vega.

Describe the solution you'd like

I want to be able to scale up my visualizations. If OpenSearch offered the ability to return data in the Apache Arrow format, we could handle a lot more data on the frontend via Vega. Other visualization technologies on the frontend could also potentially take advantage of Apache Arrow. Here is a discussion of using Vega with Apache Arrow: https://observablehq.com/@theneuralbit/introduction-to-apache-arrow

While we are at it, it shouldn't be difficult to add the ability to ingest Apache Arrow data while I am at it. I am happy to work on the implementation especially if someone can point me architecturally to where the code would go and what interfaces it would need to implement.

Related component

Search:Performance

Describe alternatives you've considered

I have built my own custom visualizations, but it would be nice if opensearch could handle this out of the box rather than me needing to go to another tool

Additional context

Although it uses different technology, here is a prototype of the idea that I built several years ago:
https://d2xis0feu0l7hz.cloudfront.net/index.html

It uses D3 on the frontend and my own columnar format as the data format. As a POC, I was able to display a table of 10 million rows of finance data. I also have a chart in which D3 aggregates all 43 million rows of data. For comparison, Excel has a limit of 1 million rows, and Google Docs has a limit of 10 million cells. Apache Arrow should be a good replacement for my columnar format to make the solution more standard.

sandeshkr419 · 2024-11-27T17:16:10Z

[Search Triage] @rishabhmaurya & @msfroh - Can you please take a look?

rishabhmaurya · 2024-11-27T17:27:59Z

@jayaskren we are actively working on integrating arrow format and flight RPC to the search path - #16679

I'm trying to understand your use case related to dashboard better.

msfroh · 2024-11-27T20:50:44Z

Incidentally, as we think about how to support streaming Arrow to clients, we should think about whether Arrow Flight makes more sense or if we should embrace something like ADBC (or both?)

@jayaskren -- As a potential user of the functionality, do you have an opinion?

Also, if we do want to be able to both ingest and return data in Arrow format, it arguably makes sense to persist it in Arrow as well. I've been toying with the idea of trying to implement a Lucene doc value format based on Arrow. If we have that, then OpenSearch should be able to return Arrow results with even less effort.

jayaskren · 2024-11-28T00:46:13Z

I'm impressed you are working on this. I assumed I would need to do it and create a pull request. As far as my use case, I am creating a custom dashboard using Vega. The 10,000 results limit is killing me. The data I want to plot has millions of data points or potentially tens of millions if we want to show more than a week at a time. For the short term we are just going to build a custom web app with Vega on the front end and static arrow files on the backend. In the long term I would love it if I could just use OpenSearch with Vega off the shelf.

I used a proprietary format instead of Apache Arrow in the past, but Arrow should do the job just as well. In my experience using a highly optimized compressed format like Arrow instead of JSON makes everything better. Encoding/Decoding is faster, transferring data is faster because it is smaller, memory use is smaller, there is less memory to garbage collect. All of that means you can send more data to the browser. Especially when the cardinality of the data is relatively low, it can be an order of magnitude difference.

I can't share our dataset, but I could potentially find some open data and make something similar in OpenSearch and with Apache Arrow to show the difference.

jayaskren · 2024-11-28T00:48:59Z

@msfroh I don't have a strong opinion yet since my experience with another comparable format. I'm hoping to put something together I can share in a couple weeks. I can dig into the different options while I'm at it

jayaskren · 2024-11-28T00:51:23Z

I thought I saw recently that Flight was not fully implemented in JavaScript. So that might make it hard to take advantage of on the frontend. I'm traveling for the holidays but could look further when I get home

jayaskren · 2024-11-28T01:03:26Z

In case it is not obvious, being able to get results in the Arrow format but not being able to parse the results in JavaScript is not as useful

jayaskren · 2024-12-01T14:57:37Z

Here is the github issue with Arrow Flight and Javascript.
apache/arrow#17325

I didn't know about ADBC but it looks really cool! I have wanted this as I have struggled to get large amounts of data out of Postgres efficiently. It doesn't look like there is a javascript client for ADBC either.

My use case for this is to consume Arrow from the browser, so these wouldn't solve my issue. I could see them both being useful on the backend though.

rishabhmaurya · 2024-12-02T19:49:55Z

@jayaskren It would nice if you can evaluate in parallel how can we support FlightStream apis you need in javascript. Maybe we can start with some minimal wrapper over FlightSream which we can expose to javascript opensearch client?

jayaskren · 2024-12-27T21:08:24Z

I apologize for not responding sooner. I have been busy with work and then the holidays. I think I jumped the gun. I have been experimenting with Apache Arrow and it doesn't appear to support basic bin packing encoding like I expected and is where I have seen a lot of bang for the buck.
So compression is much worse compared to Parquet and Orc. The Parquet format might be a better fit but is not currently supported in Vega. Arrow can also handle the Parquet format in theory. I need to experiment on the Javascript side and see what I can get working with Vega, and that can help guide what would be best for the server to return.

jayaskren added enhancement Enhancement or improvement to existing feature or request untriaged labels Nov 18, 2024

github-actions bot added the Search:Performance label Nov 18, 2024

github-project-automation bot added this to Search Project Board Nov 18, 2024

github-project-automation bot moved this to 🆕 New in Search Project Board Nov 18, 2024

sandeshkr419 removed the untriaged label Nov 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for ingesting and returning data in the Apache Arrow format #16664

Add support for ingesting and returning data in the Apache Arrow format #16664

jayaskren commented Nov 18, 2024

sandeshkr419 commented Nov 27, 2024 •

edited

Loading

rishabhmaurya commented Nov 27, 2024

msfroh commented Nov 27, 2024 •

edited

Loading

jayaskren commented Nov 28, 2024

jayaskren commented Nov 28, 2024

jayaskren commented Nov 28, 2024

jayaskren commented Nov 28, 2024

jayaskren commented Dec 1, 2024 •

edited

Loading

rishabhmaurya commented Dec 2, 2024

jayaskren commented Dec 27, 2024 •

edited

Loading

Add support for ingesting and returning data in the Apache Arrow format #16664

Add support for ingesting and returning data in the Apache Arrow format #16664

Comments

jayaskren commented Nov 18, 2024

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Related component

Describe alternatives you've considered

Additional context

sandeshkr419 commented Nov 27, 2024 • edited Loading

rishabhmaurya commented Nov 27, 2024

msfroh commented Nov 27, 2024 • edited Loading

jayaskren commented Nov 28, 2024

jayaskren commented Nov 28, 2024

jayaskren commented Nov 28, 2024

jayaskren commented Nov 28, 2024

jayaskren commented Dec 1, 2024 • edited Loading

rishabhmaurya commented Dec 2, 2024

jayaskren commented Dec 27, 2024 • edited Loading

sandeshkr419 commented Nov 27, 2024 •

edited

Loading

msfroh commented Nov 27, 2024 •

edited

Loading

jayaskren commented Dec 1, 2024 •

edited

Loading

jayaskren commented Dec 27, 2024 •

edited

Loading