You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: _posts/2025-01-07-arrow-result-transfer.md
+3-3
Original file line number
Diff line number
Diff line change
@@ -60,7 +60,7 @@ In a columnar (column-oriented) data format, the values for each column in the d
60
60
61
61
High-performance analytic databases, data warehouses, query engines, and storage systems have converged on columnar architecture because it speeds up the most common types of analytic queries. Examples of modern columnar query systems include Amazon Redshift, ClickHouse, Databricks Photon Engine, DuckDB, Google BigQuery, Microsoft Azure Synapse Analytics, Snowflake, and Voltron Data Theseus.
62
62
63
-
Likewise, many destinations for analytic query results (such as business intelligence tools, data application platforms, dataframe libraries, and machine learning platforms) use columnar architecture. Examples of columnar business intelligence tools include Amazon QuickSight, Domo, Power BI, Qlik Sense, Spotfire, and Tableau. Examples of columnar dataframe libraries include cuDF, pandas, and Polars.
63
+
Likewise, many destinations for analytic query results (such as business intelligence tools, data application platforms, dataframe libraries, and machine learning platforms) use columnar architecture. Examples of columnar business intelligence tools include Amazon QuickSight, Domo, GoodData, Power BI, Qlik Sense, Spotfire, and Tableau. Examples of columnar dataframe libraries include cuDF, pandas, and Polars.
64
64
65
65
So it is increasingly common for both the source format and the target format of a query result to be columnar formats. The most efficient way to transfer data between a columnar source and a columnar target is to use a columnar transfer format. This eliminates the need for a time-consuming transpose of the data from columns to rows at the source during the serialization step and another time-consuming transpose of the data from rows to columns at the destination during the deserialization step.
66
66
@@ -101,9 +101,9 @@ Arrow is a streamable data format. A dataset can be represented in Arrow as a se
101
101
102
102
Arrow has emerged as a de facto standard format for working with tabular data in memory. The Arrow format is a language-independent open standard. Libraries are available for working with Arrow data in languages including C, C++, C#, Go, Java, JavaScript, Julia, MATLAB, Python, R, Ruby, and Rust. Applications developed in virtually any mainstream language can add support for sending or receiving data in Arrow format. Data does not need to pass through a specific language runtime, like it must with some database connectivity APIs, including JDBC.
103
103
104
-
Arrow’s universality allows it to address a fundamental problem in speeding up real-world data systems: Performance improvements are inherently constrained by a system’s bottlenecks. This problem is known as [Amdahl’s law](https://www.geeksforgeeks.org/computer-organization-amdahls-law-and-its-proof/). In real-world data pipelines, query results often flow through multiple stages, incurring ser/de overheads at each stage. For example, if your data pipeline has five stages and you eliminate ser/de overheads in four of them, your system might be no faster than before because ser/de in the one remaining stage will bottleneck the full pipeline.
104
+
Arrow’s universality allows it to address a fundamental problem in speeding up real-world data systems: Performance improvements are inherently constrained by a system’s bottlenecks. This problem is known as [Amdahl’s law](https://www.geeksforgeeks.org/computer-organization-amdahls-law-and-its-proof/). In real-world data pipelines, query results often flow through multiple stages, incurring ser/de overheads at each stage. If, for example, your data pipeline has five stages and you eliminate ser/de overheads in four of them, your system might be no faster than before because ser/de in the one remaining stage will bottleneck the full pipeline.
105
105
106
-
Arrow’s ability to operate efficiently in virtually any technology stack helps solve this problem. Does your data flow from a Scala-based distributed backend with NVIDIA GPU-accelerated workers to a Jetty-based HTTP server then to a Rails-powered feature engineering app which users interact with through a Node.js-based machine learning framework with a Pyodide-based browser front end? No problem; Arrow libraries are available to eliminate ser/de overheads between all of those components.
106
+
Arrow’s ability to operate efficiently in virtually any technology stack helps to solve this problem. Does your data flow from a Scala-based distributed backend with NVIDIA GPU-accelerated workers to a Jetty-based HTTP server then to a Rails-powered feature engineering app which users interact with through a Node.js-based machine learning framework with a Pyodide-based browser front end? No problem; Arrow libraries are available to eliminate ser/de overheads between all of those components.
0 commit comments