-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for ingesting and returning data in the Apache Arrow format #16664
Comments
[Search Triage] @rishabhmaurya & @msfroh - Can you please take a look? |
@jayaskren we are actively working on integrating arrow format and flight RPC to the search path - #16679 I'm trying to understand your use case related to dashboard better. |
Incidentally, as we think about how to support streaming Arrow to clients, we should think about whether Arrow Flight makes more sense or if we should embrace something like ADBC (or both?) @jayaskren -- As a potential user of the functionality, do you have an opinion? Also, if we do want to be able to both ingest and return data in Arrow format, it arguably makes sense to persist it in Arrow as well. I've been toying with the idea of trying to implement a Lucene doc value format based on Arrow. If we have that, then OpenSearch should be able to return Arrow results with even less effort. |
I'm impressed you are working on this. I assumed I would need to do it and create a pull request. As far as my use case, I am creating a custom dashboard using Vega. The 10,000 results limit is killing me. The data I want to plot has millions of data points or potentially tens of millions if we want to show more than a week at a time. For the short term we are just going to build a custom web app with Vega on the front end and static arrow files on the backend. In the long term I would love it if I could just use OpenSearch with Vega off the shelf. I used a proprietary format instead of Apache Arrow in the past, but Arrow should do the job just as well. In my experience using a highly optimized compressed format like Arrow instead of JSON makes everything better. Encoding/Decoding is faster, transferring data is faster because it is smaller, memory use is smaller, there is less memory to garbage collect. All of that means you can send more data to the browser. Especially when the cardinality of the data is relatively low, it can be an order of magnitude difference. I can't share our dataset, but I could potentially find some open data and make something similar in OpenSearch and with Apache Arrow to show the difference. |
@msfroh I don't have a strong opinion yet since my experience with another comparable format. I'm hoping to put something together I can share in a couple weeks. I can dig into the different options while I'm at it |
I thought I saw recently that Flight was not fully implemented in JavaScript. So that might make it hard to take advantage of on the frontend. I'm traveling for the holidays but could look further when I get home |
In case it is not obvious, being able to get results in the Arrow format but not being able to parse the results in JavaScript is not as useful |
Here is the github issue with Arrow Flight and Javascript. I didn't know about ADBC but it looks really cool! I have wanted this as I have struggled to get large amounts of data out of Postgres efficiently. It doesn't look like there is a javascript client for ADBC either. My use case for this is to consume Arrow from the browser, so these wouldn't solve my issue. I could see them both being useful on the backend though. |
@jayaskren It would nice if you can evaluate in parallel how can we support FlightStream apis you need in javascript. Maybe we can start with some minimal wrapper over FlightSream which we can expose to javascript opensearch client? |
I apologize for not responding sooner. I have been busy with work and then the holidays. I think I jumped the gun. I have been experimenting with Apache Arrow and it doesn't appear to support basic bin packing encoding like I expected and is where I have seen a lot of bang for the buck. |
Is your feature request related to a problem? Please describe
At work, I am making dashboards related to the work all of our servers are processing at any given time. I am severely limited by how many data points I can add to a given dashboard. Either I need to filter the number of servers, limit the the time window to days or even hours, or severely aggregate the data so we lose all context. I have a prototype using Vega doing what I want to do at work, but our dashboard cannot handle that much data even though Vega doesn't have a problem with it and the dashboard supports Vega.
Describe the solution you'd like
I want to be able to scale up my visualizations. If OpenSearch offered the ability to return data in the Apache Arrow format, we could handle a lot more data on the frontend via Vega. Other visualization technologies on the frontend could also potentially take advantage of Apache Arrow. Here is a discussion of using Vega with Apache Arrow: https://observablehq.com/@theneuralbit/introduction-to-apache-arrow
While we are at it, it shouldn't be difficult to add the ability to ingest Apache Arrow data while I am at it. I am happy to work on the implementation especially if someone can point me architecturally to where the code would go and what interfaces it would need to implement.
Related component
Search:Performance
Describe alternatives you've considered
I have built my own custom visualizations, but it would be nice if opensearch could handle this out of the box rather than me needing to go to another tool
Additional context
Although it uses different technology, here is a prototype of the idea that I built several years ago:
https://d2xis0feu0l7hz.cloudfront.net/index.html
It uses D3 on the frontend and my own columnar format as the data format. As a POC, I was able to display a table of 10 million rows of finance data. I also have a chart in which D3 aggregates all 43 million rows of data. For comparison, Excel has a limit of 1 million rows, and Google Docs has a limit of 10 million cells. Apache Arrow should be a good replacement for my columnar format to make the solution more standard.
The text was updated successfully, but these errors were encountered: