Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for ingesting and returning data in the Apache Arrow format #16664

Open
jayaskren opened this issue Nov 18, 2024 · 10 comments
Open
Labels
enhancement Enhancement or improvement to existing feature or request Search:Performance

Comments

@jayaskren
Copy link

Is your feature request related to a problem? Please describe

At work, I am making dashboards related to the work all of our servers are processing at any given time. I am severely limited by how many data points I can add to a given dashboard. Either I need to filter the number of servers, limit the the time window to days or even hours, or severely aggregate the data so we lose all context. I have a prototype using Vega doing what I want to do at work, but our dashboard cannot handle that much data even though Vega doesn't have a problem with it and the dashboard supports Vega.

Describe the solution you'd like

I want to be able to scale up my visualizations. If OpenSearch offered the ability to return data in the Apache Arrow format, we could handle a lot more data on the frontend via Vega. Other visualization technologies on the frontend could also potentially take advantage of Apache Arrow. Here is a discussion of using Vega with Apache Arrow: https://observablehq.com/@theneuralbit/introduction-to-apache-arrow

While we are at it, it shouldn't be difficult to add the ability to ingest Apache Arrow data while I am at it. I am happy to work on the implementation especially if someone can point me architecturally to where the code would go and what interfaces it would need to implement.

Related component

Search:Performance

Describe alternatives you've considered

I have built my own custom visualizations, but it would be nice if opensearch could handle this out of the box rather than me needing to go to another tool

Additional context

Although it uses different technology, here is a prototype of the idea that I built several years ago:
https://d2xis0feu0l7hz.cloudfront.net/index.html

It uses D3 on the frontend and my own columnar format as the data format. As a POC, I was able to display a table of 10 million rows of finance data. I also have a chart in which D3 aggregates all 43 million rows of data. For comparison, Excel has a limit of 1 million rows, and Google Docs has a limit of 10 million cells. Apache Arrow should be a good replacement for my columnar format to make the solution more standard.

@jayaskren jayaskren added enhancement Enhancement or improvement to existing feature or request untriaged labels Nov 18, 2024
@sandeshkr419
Copy link
Contributor

sandeshkr419 commented Nov 27, 2024

[Search Triage] @rishabhmaurya & @msfroh - Can you please take a look?

@rishabhmaurya
Copy link
Contributor

@jayaskren we are actively working on integrating arrow format and flight RPC to the search path - #16679

I'm trying to understand your use case related to dashboard better.

@msfroh
Copy link
Collaborator

msfroh commented Nov 27, 2024

Incidentally, as we think about how to support streaming Arrow to clients, we should think about whether Arrow Flight makes more sense or if we should embrace something like ADBC (or both?)

@jayaskren -- As a potential user of the functionality, do you have an opinion?

Also, if we do want to be able to both ingest and return data in Arrow format, it arguably makes sense to persist it in Arrow as well. I've been toying with the idea of trying to implement a Lucene doc value format based on Arrow. If we have that, then OpenSearch should be able to return Arrow results with even less effort.

@jayaskren
Copy link
Author

I'm impressed you are working on this. I assumed I would need to do it and create a pull request. As far as my use case, I am creating a custom dashboard using Vega. The 10,000 results limit is killing me. The data I want to plot has millions of data points or potentially tens of millions if we want to show more than a week at a time. For the short term we are just going to build a custom web app with Vega on the front end and static arrow files on the backend. In the long term I would love it if I could just use OpenSearch with Vega off the shelf.

I used a proprietary format instead of Apache Arrow in the past, but Arrow should do the job just as well. In my experience using a highly optimized compressed format like Arrow instead of JSON makes everything better. Encoding/Decoding is faster, transferring data is faster because it is smaller, memory use is smaller, there is less memory to garbage collect. All of that means you can send more data to the browser. Especially when the cardinality of the data is relatively low, it can be an order of magnitude difference.

I can't share our dataset, but I could potentially find some open data and make something similar in OpenSearch and with Apache Arrow to show the difference.

@jayaskren
Copy link
Author

@msfroh I don't have a strong opinion yet since my experience with another comparable format. I'm hoping to put something together I can share in a couple weeks. I can dig into the different options while I'm at it

@jayaskren
Copy link
Author

I thought I saw recently that Flight was not fully implemented in JavaScript. So that might make it hard to take advantage of on the frontend. I'm traveling for the holidays but could look further when I get home

@jayaskren
Copy link
Author

In case it is not obvious, being able to get results in the Arrow format but not being able to parse the results in JavaScript is not as useful

@jayaskren
Copy link
Author

jayaskren commented Dec 1, 2024

Here is the github issue with Arrow Flight and Javascript.
apache/arrow#17325

I didn't know about ADBC but it looks really cool! I have wanted this as I have struggled to get large amounts of data out of Postgres efficiently. It doesn't look like there is a javascript client for ADBC either.

My use case for this is to consume Arrow from the browser, so these wouldn't solve my issue. I could see them both being useful on the backend though.

@rishabhmaurya
Copy link
Contributor

@jayaskren It would nice if you can evaluate in parallel how can we support FlightStream apis you need in javascript. Maybe we can start with some minimal wrapper over FlightSream which we can expose to javascript opensearch client?

@jayaskren
Copy link
Author

jayaskren commented Dec 27, 2024

I apologize for not responding sooner. I have been busy with work and then the holidays. I think I jumped the gun. I have been experimenting with Apache Arrow and it doesn't appear to support basic bin packing encoding like I expected and is where I have seen a lot of bang for the buck.
So compression is much worse compared to Parquet and Orc. The Parquet format might be a better fit but is not currently supported in Vega. Arrow can also handle the Parquet format in theory. I need to experiment on the Javascript side and see what I can get working with Vega, and that can help guide what would be best for the server to return.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Search:Performance
Projects
Status: 🆕 New
Development

No branches or pull requests

4 participants