Skip to content

Commit 9a2ae94

Browse files
committed
Minor tweaks
1 parent e0a88ef commit 9a2ae94

File tree

1 file changed

+3
-3
lines changed

1 file changed

+3
-3
lines changed

Diff for: _posts/2025-01-07-arrow-result-transfer.md

+3-3
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,7 @@ In a columnar (column-oriented) data format, the values for each column in the d
6060

6161
High-performance analytic databases, data warehouses, query engines, and storage systems have converged on columnar architecture because it speeds up the most common types of analytic queries. Examples of modern columnar query systems include Amazon Redshift, ClickHouse, Databricks Photon Engine, DuckDB, Google BigQuery, Microsoft Azure Synapse Analytics, Snowflake, and Voltron Data Theseus.
6262

63-
Likewise, many destinations for analytic query results (such as business intelligence tools, data application platforms, dataframe libraries, and machine learning platforms) use columnar architecture. Examples of columnar business intelligence tools include Amazon QuickSight, Domo, Power BI, Qlik Sense, Spotfire, and Tableau. Examples of columnar dataframe libraries include cuDF, pandas, and Polars.
63+
Likewise, many destinations for analytic query results (such as business intelligence tools, data application platforms, dataframe libraries, and machine learning platforms) use columnar architecture. Examples of columnar business intelligence tools include Amazon QuickSight, Domo, GoodData, Power BI, Qlik Sense, Spotfire, and Tableau. Examples of columnar dataframe libraries include cuDF, pandas, and Polars.
6464

6565
So it is increasingly common for both the source format and the target format of a query result to be columnar formats. The most efficient way to transfer data between a columnar source and a columnar target is to use a columnar transfer format. This eliminates the need for a time-consuming transpose of the data from columns to rows at the source during the serialization step and another time-consuming transpose of the data from rows to columns at the destination during the deserialization step.
6666

@@ -101,9 +101,9 @@ Arrow is a streamable data format. A dataset can be represented in Arrow as a se
101101

102102
Arrow has emerged as a de facto standard format for working with tabular data in memory. The Arrow format is a language-independent open standard. Libraries are available for working with Arrow data in languages including C, C++, C#, Go, Java, JavaScript, Julia, MATLAB, Python, R, Ruby, and Rust. Applications developed in virtually any mainstream language can add support for sending or receiving data in Arrow format. Data does not need to pass through a specific language runtime, like it must with some database connectivity APIs, including JDBC.
103103

104-
Arrow’s universality allows it to address a fundamental problem in speeding up real-world data systems: Performance improvements are inherently constrained by a system’s bottlenecks. This problem is known as [Amdahl’s law](https://www.geeksforgeeks.org/computer-organization-amdahls-law-and-its-proof/). In real-world data pipelines, query results often flow through multiple stages, incurring ser/de overheads at each stage. For example, if your data pipeline has five stages and you eliminate ser/de overheads in four of them, your system might be no faster than before because ser/de in the one remaining stage will bottleneck the full pipeline.
104+
Arrow’s universality allows it to address a fundamental problem in speeding up real-world data systems: Performance improvements are inherently constrained by a system’s bottlenecks. This problem is known as [Amdahl’s law](https://www.geeksforgeeks.org/computer-organization-amdahls-law-and-its-proof/). In real-world data pipelines, query results often flow through multiple stages, incurring ser/de overheads at each stage. If, for example, your data pipeline has five stages and you eliminate ser/de overheads in four of them, your system might be no faster than before because ser/de in the one remaining stage will bottleneck the full pipeline.
105105

106-
Arrow’s ability to operate efficiently in virtually any technology stack helps solve this problem. Does your data flow from a Scala-based distributed backend with NVIDIA GPU-accelerated workers to a Jetty-based HTTP server then to a Rails-powered feature engineering app which users interact with through a Node.js-based machine learning framework with a Pyodide-based browser front end? No problem; Arrow libraries are available to eliminate ser/de overheads between all of those components.
106+
Arrow’s ability to operate efficiently in virtually any technology stack helps to solve this problem. Does your data flow from a Scala-based distributed backend with NVIDIA GPU-accelerated workers to a Jetty-based HTTP server then to a Rails-powered feature engineering app which users interact with through a Node.js-based machine learning framework with a Pyodide-based browser front end? No problem; Arrow libraries are available to eliminate ser/de overheads between all of those components.
107107

108108
### Conclusion
109109

0 commit comments

Comments
 (0)