Skip to content

Commit a80a788

Browse files
authored
Enable Dataframe to be converted into views which can be used in register_table (#1016)
* add test_view * feat: add into_view method to register DataFrame as a view * add pytableprovider * feat: add as_table method to PyTableProvider and update into_view to return PyTable * refactor: simplify as_table method and update documentation for into_view * test: improve test_register_filtered_dataframe by removing redundant comments and assertions * test: enhance test_register_filtered_dataframe with additional assertions for DataFrame results * ruff formatted * cleanup: remove unused imports from test_view.py * docs: add example for registering a DataFrame as a view in README.md * docs: update docstring for into_view method to clarify usage as ViewTable * chore: add license header to test_view.py * ruff correction * refactor: rename into_view method to _into_view * ruff lint * refactor: simplify into_view method and update Rust binding convention * docs: add views section to user guide with example on registering views * feat: add register_view method to SessionContext for DataFrame registration * docs: update README and user guide to reflect register_view method for DataFrame registration * docs: remove some documentation from PyDataFrame
1 parent 69ebf70 commit a80a788

File tree

7 files changed

+203
-0
lines changed

7 files changed

+203
-0
lines changed

README.md

+40
Original file line numberDiff line numberDiff line change
@@ -79,6 +79,46 @@ This produces the following chart:
7979

8080
![Chart](examples/chart.png)
8181

82+
## Registering a DataFrame as a View
83+
84+
You can use SessionContext's `register_view` method to convert a DataFrame into a view and register it with the context.
85+
86+
```python
87+
from datafusion import SessionContext, col, literal
88+
89+
# Create a DataFusion context
90+
ctx = SessionContext()
91+
92+
# Create sample data
93+
data = {"a": [1, 2, 3, 4, 5], "b": [10, 20, 30, 40, 50]}
94+
95+
# Create a DataFrame from the dictionary
96+
df = ctx.from_pydict(data, "my_table")
97+
98+
# Filter the DataFrame (for example, keep rows where a > 2)
99+
df_filtered = df.filter(col("a") > literal(2))
100+
101+
# Register the dataframe as a view with the context
102+
ctx.register_view("view1", df_filtered)
103+
104+
# Now run a SQL query against the registered view
105+
df_view = ctx.sql("SELECT * FROM view1")
106+
107+
# Collect the results
108+
results = df_view.collect()
109+
110+
# Convert results to a list of dictionaries for display
111+
result_dicts = [batch.to_pydict() for batch in results]
112+
113+
print(result_dicts)
114+
```
115+
116+
This will output:
117+
118+
```python
119+
[{'a': [3, 4, 5], 'b': [30, 40, 50]}]
120+
```
121+
82122
## Configuration
83123

84124
It is possible to configure runtime (memory and disk settings) and configuration settings when creating a context.

docs/source/user-guide/common-operations/index.rst

+1
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@ The contents of this section are designed to guide a new user through how to use
2323
.. toctree::
2424
:maxdepth: 2
2525

26+
views
2627
basic-info
2728
select-and-filter
2829
expressions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
.. Licensed to the Apache Software Foundation (ASF) under one
2+
.. or more contributor license agreements. See the NOTICE file
3+
.. distributed with this work for additional information
4+
.. regarding copyright ownership. The ASF licenses this file
5+
.. to you under the Apache License, Version 2.0 (the
6+
.. "License"); you may not use this file except in compliance
7+
.. with the License. You may obtain a copy of the License at
8+
9+
.. http://www.apache.org/licenses/LICENSE-2.0
10+
11+
.. Unless required by applicable law or agreed to in writing,
12+
.. software distributed under the License is distributed on an
13+
.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
14+
.. KIND, either express or implied. See the License for the
15+
.. specific language governing permissions and limitations
16+
.. under the License.
17+
18+
======================
19+
Registering Views
20+
======================
21+
22+
You can use the context's ``register_view`` method to register a DataFrame as a view
23+
24+
.. code-block:: python
25+
26+
from datafusion import SessionContext, col, literal
27+
28+
# Create a DataFusion context
29+
ctx = SessionContext()
30+
31+
# Create sample data
32+
data = {"a": [1, 2, 3, 4, 5], "b": [10, 20, 30, 40, 50]}
33+
34+
# Create a DataFrame from the dictionary
35+
df = ctx.from_pydict(data, "my_table")
36+
37+
# Filter the DataFrame (for example, keep rows where a > 2)
38+
df_filtered = df.filter(col("a") > literal(2))
39+
40+
# Register the dataframe as a view with the context
41+
ctx.register_view("view1", df_filtered)
42+
43+
# Now run a SQL query against the registered view
44+
df_view = ctx.sql("SELECT * FROM view1")
45+
46+
# Collect the results
47+
results = df_view.collect()
48+
49+
# Convert results to a list of dictionaries for display
50+
result_dicts = [batch.to_pydict() for batch in results]
51+
52+
print(result_dicts)
53+
54+
This will output:
55+
56+
.. code-block:: python
57+
58+
[{'a': [3, 4, 5], 'b': [30, 40, 50]}]

python/datafusion/context.py

+12
Original file line numberDiff line numberDiff line change
@@ -707,6 +707,18 @@ def from_polars(self, data: polars.DataFrame, name: str | None = None) -> DataFr
707707
"""
708708
return DataFrame(self.ctx.from_polars(data, name))
709709

710+
# https://github.com/apache/datafusion-python/pull/1016#discussion_r1983239116
711+
# is the discussion on how we arrived at adding register_view
712+
def register_view(self, name: str, df: DataFrame):
713+
"""Register a :py:class: `~datafusion.detaframe.DataFrame` as a view.
714+
715+
Args:
716+
name (str): The name to register the view under.
717+
df (DataFrame): The DataFrame to be converted into a view and registered.
718+
"""
719+
view = df.into_view()
720+
self.ctx.register_table(name, view)
721+
710722
def register_table(self, name: str, table: Table) -> None:
711723
"""Register a :py:class: `~datafusion.catalog.Table` as a table.
712724

python/datafusion/dataframe.py

+4
Original file line numberDiff line numberDiff line change
@@ -124,6 +124,10 @@ def __init__(self, df: DataFrameInternal) -> None:
124124
"""
125125
self.df = df
126126

127+
def into_view(self) -> pa.Table:
128+
"""Convert DataFrame as a ViewTable which can be used in register_table."""
129+
return self.df.into_view()
130+
127131
def __getitem__(self, key: str | List[str]) -> DataFrame:
128132
"""Return a new :py:class`DataFrame` with the specified column or columns.
129133

python/tests/test_view.py

+49
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
# Licensed to the Apache Software Foundation (ASF) under one
2+
# or more contributor license agreements. See the NOTICE file
3+
# distributed with this work for additional information
4+
# regarding copyright ownership. The ASF licenses this file
5+
# to you under the Apache License, Version 2.0 (the
6+
# "License"); you may not use this file except in compliance
7+
# with the License. You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing,
12+
# software distributed under the License is distributed on an
13+
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
14+
# KIND, either express or implied. See the License for the
15+
# specific language governing permissions and limitations
16+
# under the License.
17+
18+
19+
from datafusion import SessionContext, col, literal
20+
21+
22+
def test_register_filtered_dataframe():
23+
ctx = SessionContext()
24+
25+
data = {"a": [1, 2, 3, 4, 5], "b": [10, 20, 30, 40, 50]}
26+
27+
df = ctx.from_pydict(data, "my_table")
28+
29+
df_filtered = df.filter(col("a") > literal(2))
30+
31+
ctx.register_view("view1", df_filtered)
32+
33+
df_view = ctx.sql("SELECT * FROM view1")
34+
35+
filtered_results = df_view.collect()
36+
37+
result_dicts = [batch.to_pydict() for batch in filtered_results]
38+
39+
expected_results = [{"a": [3, 4, 5], "b": [30, 40, 50]}]
40+
41+
assert result_dicts == expected_results
42+
43+
df_results = df.collect()
44+
45+
df_result_dicts = [batch.to_pydict() for batch in df_results]
46+
47+
expected_df_results = [{"a": [1, 2, 3, 4, 5], "b": [10, 20, 30, 40, 50]}]
48+
49+
assert df_result_dicts == expected_df_results

src/dataframe.rs

+39
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,7 @@ use datafusion::arrow::util::pretty;
3030
use datafusion::common::UnnestOptions;
3131
use datafusion::config::{CsvOptions, TableParquetOptions};
3232
use datafusion::dataframe::{DataFrame, DataFrameWriteOptions};
33+
use datafusion::datasource::TableProvider;
3334
use datafusion::execution::SendableRecordBatchStream;
3435
use datafusion::parquet::basic::{BrotliLevel, Compression, GzipLevel, ZstdLevel};
3536
use datafusion::prelude::*;
@@ -39,6 +40,7 @@ use pyo3::pybacked::PyBackedStr;
3940
use pyo3::types::{PyCapsule, PyTuple, PyTupleMethods};
4041
use tokio::task::JoinHandle;
4142

43+
use crate::catalog::PyTable;
4244
use crate::errors::{py_datafusion_err, PyDataFusionError};
4345
use crate::expr::sort_expr::to_sort_expressions;
4446
use crate::physical_plan::PyExecutionPlan;
@@ -50,6 +52,25 @@ use crate::{
5052
expr::{sort_expr::PySortExpr, PyExpr},
5153
};
5254

55+
// https://github.com/apache/datafusion-python/pull/1016#discussion_r1983239116
56+
// - we have not decided on the table_provider approach yet
57+
// this is an interim implementation
58+
#[pyclass(name = "TableProvider", module = "datafusion")]
59+
pub struct PyTableProvider {
60+
provider: Arc<dyn TableProvider>,
61+
}
62+
63+
impl PyTableProvider {
64+
pub fn new(provider: Arc<dyn TableProvider>) -> Self {
65+
Self { provider }
66+
}
67+
68+
pub fn as_table(&self) -> PyTable {
69+
let table_provider: Arc<dyn TableProvider> = self.provider.clone();
70+
PyTable::new(table_provider)
71+
}
72+
}
73+
5374
/// A PyDataFrame is a representation of a logical plan and an API to compose statements.
5475
/// Use it to build a plan and `.collect()` to execute the plan and collect the result.
5576
/// The actual execution of a plan runs natively on Rust and Arrow on a multi-threaded environment.
@@ -156,6 +177,24 @@ impl PyDataFrame {
156177
PyArrowType(self.df.schema().into())
157178
}
158179

180+
/// Convert this DataFrame into a Table that can be used in register_table
181+
/// By convention, into_... methods consume self and return the new object.
182+
/// Disabling the clippy lint, so we can use &self
183+
/// because we're working with Python bindings
184+
/// where objects are shared
185+
/// https://github.com/apache/datafusion-python/pull/1016#discussion_r1983239116
186+
/// - we have not decided on the table_provider approach yet
187+
#[allow(clippy::wrong_self_convention)]
188+
fn into_view(&self) -> PyDataFusionResult<PyTable> {
189+
// Call the underlying Rust DataFrame::into_view method.
190+
// Note that the Rust method consumes self; here we clone the inner Arc<DataFrame>
191+
// so that we don’t invalidate this PyDataFrame.
192+
let table_provider = self.df.as_ref().clone().into_view();
193+
let table_provider = PyTableProvider::new(table_provider);
194+
195+
Ok(table_provider.as_table())
196+
}
197+
159198
#[pyo3(signature = (*args))]
160199
fn select_columns(&self, args: Vec<PyBackedStr>) -> PyDataFusionResult<Self> {
161200
let args = args.iter().map(|s| s.as_ref()).collect::<Vec<&str>>();

0 commit comments

Comments
 (0)