Skip to content

Commit d6ef9bc

Browse files
authored
Add DataFrame API Documentation for DataFusion Python (#1132)
* feat: add API reference documentation for DataFrame and index * feat: add tests for validating RST syntax, code blocks, and internal links in DataFrame API documentation * refactor: remove test script for DataFrame API documentation in RST format * fix: correct formatting inconsistencies in dataframe.rst * fix: correct header formatting in functions.rst * fix: adjust formatting for code block in dataframe.rst * fix: skip documentation for duplicate modules in autoapi configuration * fix: add cross reference to io pages
1 parent 24f0b1a commit d6ef9bc

File tree

6 files changed

+422
-2
lines changed

6 files changed

+422
-2
lines changed

docs/source/api/dataframe.rst

Lines changed: 387 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,387 @@
1+
.. Licensed to the Apache Software Foundation (ASF) under one
2+
.. or more contributor license agreements. See the NOTICE file
3+
.. distributed with this work for additional information
4+
.. regarding copyright ownership. The ASF licenses this file
5+
.. to you under the Apache License, Version 2.0 (the
6+
.. "License"); you may not use this file except in compliance
7+
.. with the License. You may obtain a copy of the License at
8+
9+
.. http://www.apache.org/licenses/LICENSE-2.0
10+
11+
.. Unless required by applicable law or agreed to in writing,
12+
.. software distributed under the License is distributed on an
13+
.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
14+
.. KIND, either express or implied. See the License for the
15+
.. specific language governing permissions and limitations
16+
.. under the License.
17+
18+
=================
19+
DataFrame API
20+
=================
21+
22+
Overview
23+
--------
24+
25+
The ``DataFrame`` class is the core abstraction in DataFusion that represents tabular data and operations
26+
on that data. DataFrames provide a flexible API for transforming data through various operations such as
27+
filtering, projection, aggregation, joining, and more.
28+
29+
A DataFrame represents a logical plan that is lazily evaluated. The actual execution occurs only when
30+
terminal operations like ``collect()``, ``show()``, or ``to_pandas()`` are called.
31+
32+
Creating DataFrames
33+
-------------------
34+
35+
DataFrames can be created in several ways:
36+
37+
* From SQL queries via a ``SessionContext``:
38+
39+
.. code-block:: python
40+
41+
from datafusion import SessionContext
42+
43+
ctx = SessionContext()
44+
df = ctx.sql("SELECT * FROM your_table")
45+
46+
* From registered tables:
47+
48+
.. code-block:: python
49+
50+
df = ctx.table("your_table")
51+
52+
* From various data sources:
53+
54+
.. code-block:: python
55+
56+
# From CSV files (see :ref:`io_csv` for detailed options)
57+
df = ctx.read_csv("path/to/data.csv")
58+
59+
# From Parquet files (see :ref:`io_parquet` for detailed options)
60+
df = ctx.read_parquet("path/to/data.parquet")
61+
62+
# From JSON files (see :ref:`io_json` for detailed options)
63+
df = ctx.read_json("path/to/data.json")
64+
65+
# From Avro files (see :ref:`io_avro` for detailed options)
66+
df = ctx.read_avro("path/to/data.avro")
67+
68+
# From Pandas DataFrame
69+
import pandas as pd
70+
pandas_df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
71+
df = ctx.from_pandas(pandas_df)
72+
73+
# From Arrow data
74+
import pyarrow as pa
75+
batch = pa.RecordBatch.from_arrays(
76+
[pa.array([1, 2, 3]), pa.array([4, 5, 6])],
77+
names=["a", "b"]
78+
)
79+
df = ctx.from_arrow(batch)
80+
81+
For detailed information about reading from different data sources, see the :doc:`I/O Guide <../user-guide/io/index>`.
82+
For custom data sources, see :ref:`io_custom_table_provider`.
83+
84+
Common DataFrame Operations
85+
---------------------------
86+
87+
DataFusion's DataFrame API offers a wide range of operations:
88+
89+
.. code-block:: python
90+
91+
from datafusion import column, literal
92+
93+
# Select specific columns
94+
df = df.select("col1", "col2")
95+
96+
# Select with expressions
97+
df = df.select(column("a") + column("b"), column("a") - column("b"))
98+
99+
# Filter rows
100+
df = df.filter(column("age") > literal(25))
101+
102+
# Add computed columns
103+
df = df.with_column("full_name", column("first_name") + literal(" ") + column("last_name"))
104+
105+
# Multiple column additions
106+
df = df.with_columns(
107+
(column("a") + column("b")).alias("sum"),
108+
(column("a") * column("b")).alias("product")
109+
)
110+
111+
# Sort data
112+
df = df.sort(column("age").sort(ascending=False))
113+
114+
# Join DataFrames
115+
df = df1.join(df2, on="user_id", how="inner")
116+
117+
# Aggregate data
118+
from datafusion import functions as f
119+
df = df.aggregate(
120+
[], # Group by columns (empty for global aggregation)
121+
[f.sum(column("amount")).alias("total_amount")]
122+
)
123+
124+
# Limit rows
125+
df = df.limit(100)
126+
127+
# Drop columns
128+
df = df.drop("temporary_column")
129+
130+
Terminal Operations
131+
-------------------
132+
133+
To materialize the results of your DataFrame operations:
134+
135+
.. code-block:: python
136+
137+
# Collect all data as PyArrow RecordBatches
138+
result_batches = df.collect()
139+
140+
# Convert to various formats
141+
pandas_df = df.to_pandas() # Pandas DataFrame
142+
polars_df = df.to_polars() # Polars DataFrame
143+
arrow_table = df.to_arrow_table() # PyArrow Table
144+
py_dict = df.to_pydict() # Python dictionary
145+
py_list = df.to_pylist() # Python list of dictionaries
146+
147+
# Display results
148+
df.show() # Print tabular format to console
149+
150+
# Count rows
151+
count = df.count()
152+
153+
HTML Rendering in Jupyter
154+
-------------------------
155+
156+
When working in Jupyter notebooks or other environments that support rich HTML display,
157+
DataFusion DataFrames automatically render as nicely formatted HTML tables. This functionality
158+
is provided by the ``_repr_html_`` method, which is automatically called by Jupyter.
159+
160+
Basic HTML Rendering
161+
~~~~~~~~~~~~~~~~~~~~
162+
163+
In a Jupyter environment, simply displaying a DataFrame object will trigger HTML rendering:
164+
165+
.. code-block:: python
166+
167+
# Will display as HTML table in Jupyter
168+
df
169+
170+
# Explicit display also uses HTML rendering
171+
display(df)
172+
173+
HTML Rendering Customization
174+
----------------------------
175+
176+
DataFusion provides extensive customization options for HTML table rendering through the
177+
``datafusion.html_formatter`` module.
178+
179+
Configuring the HTML Formatter
180+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
181+
182+
You can customize how DataFrames are rendered by configuring the formatter:
183+
184+
.. code-block:: python
185+
186+
from datafusion.html_formatter import configure_formatter
187+
188+
configure_formatter(
189+
max_cell_length=30, # Maximum length of cell content before truncation
190+
max_width=800, # Maximum width of table in pixels
191+
max_height=400, # Maximum height of table in pixels
192+
max_memory_bytes=2 * 1024 * 1024,# Maximum memory used for rendering (2MB)
193+
min_rows_display=10, # Minimum rows to display
194+
repr_rows=20, # Number of rows to display in representation
195+
enable_cell_expansion=True, # Allow cells to be expandable on click
196+
custom_css=None, # Custom CSS to apply
197+
show_truncation_message=True, # Show message when data is truncated
198+
style_provider=None, # Custom style provider class
199+
use_shared_styles=True # Share styles across tables to reduce duplication
200+
)
201+
202+
Custom Style Providers
203+
~~~~~~~~~~~~~~~~~~~~~~
204+
205+
For advanced styling needs, you can create a custom style provider class:
206+
207+
.. code-block:: python
208+
209+
from datafusion.html_formatter import configure_formatter
210+
211+
class CustomStyleProvider:
212+
def get_cell_style(self) -> str:
213+
return "background-color: #f5f5f5; color: #333; padding: 8px; border: 1px solid #ddd;"
214+
215+
def get_header_style(self) -> str:
216+
return "background-color: #4285f4; color: white; font-weight: bold; padding: 10px;"
217+
218+
# Apply custom styling
219+
configure_formatter(style_provider=CustomStyleProvider())
220+
221+
Custom Type Formatters
222+
~~~~~~~~~~~~~~~~~~~~~~
223+
224+
You can register custom formatters for specific data types:
225+
226+
.. code-block:: python
227+
228+
from datafusion.html_formatter import get_formatter
229+
230+
formatter = get_formatter()
231+
232+
# Format integers with color based on value
233+
def format_int(value):
234+
return f'<span style="color: {"red" if value > 100 else "blue"}">{value}</span>'
235+
236+
formatter.register_formatter(int, format_int)
237+
238+
# Format date values
239+
def format_date(value):
240+
return f'<span class="date-value">{value.isoformat()}</span>'
241+
242+
formatter.register_formatter(datetime.date, format_date)
243+
244+
Custom Cell Builders
245+
~~~~~~~~~~~~~~~~~~~~
246+
247+
For complete control over cell rendering:
248+
249+
.. code-block:: python
250+
251+
formatter = get_formatter()
252+
253+
def custom_cell_builder(value, row, col, table_id):
254+
try:
255+
num_value = float(value)
256+
if num_value > 0: # Positive values get green
257+
return f'<td style="background-color: #d9f0d3">{value}</td>'
258+
if num_value < 0: # Negative values get red
259+
return f'<td style="background-color: #f0d3d3">{value}</td>'
260+
except (ValueError, TypeError):
261+
pass
262+
263+
# Default styling for non-numeric or zero values
264+
return f'<td style="border: 1px solid #ddd">{value}</td>'
265+
266+
formatter.set_custom_cell_builder(custom_cell_builder)
267+
268+
Custom Header Builders
269+
~~~~~~~~~~~~~~~~~~~~~~
270+
271+
Similarly, you can customize the rendering of table headers:
272+
273+
.. code-block:: python
274+
275+
def custom_header_builder(field):
276+
tooltip = f"Type: {field.type}"
277+
return f'<th style="background-color: #333; color: white" title="{tooltip}">{field.name}</th>'
278+
279+
formatter.set_custom_header_builder(custom_header_builder)
280+
281+
Managing Formatter State
282+
-----------------------~
283+
284+
The HTML formatter maintains global state that can be managed:
285+
286+
.. code-block:: python
287+
288+
from datafusion.html_formatter import reset_formatter, reset_styles_loaded_state, get_formatter
289+
290+
# Reset the formatter to default settings
291+
reset_formatter()
292+
293+
# Reset only the styles loaded state (useful when styles were loaded but need reloading)
294+
reset_styles_loaded_state()
295+
296+
# Get the current formatter instance to make changes
297+
formatter = get_formatter()
298+
299+
Advanced Example: Dashboard-Style Formatting
300+
------------------------------------------~~
301+
302+
This example shows how to create a dashboard-like styling for your DataFrames:
303+
304+
.. code-block:: python
305+
306+
from datafusion.html_formatter import configure_formatter, get_formatter
307+
308+
# Define custom CSS
309+
custom_css = """
310+
.datafusion-table {
311+
font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
312+
border-collapse: collapse;
313+
width: 100%;
314+
box-shadow: 0 2px 3px rgba(0,0,0,0.1);
315+
}
316+
.datafusion-table th {
317+
position: sticky;
318+
top: 0;
319+
z-index: 10;
320+
}
321+
.datafusion-table tr:hover td {
322+
background-color: #f1f7fa !important;
323+
}
324+
.datafusion-table .numeric-positive {
325+
color: #0a7c00;
326+
}
327+
.datafusion-table .numeric-negative {
328+
color: #d13438;
329+
}
330+
"""
331+
332+
class DashboardStyleProvider:
333+
def get_cell_style(self) -> str:
334+
return "padding: 8px 12px; border-bottom: 1px solid #e0e0e0;"
335+
336+
def get_header_style(self) -> str:
337+
return ("background-color: #0078d4; color: white; font-weight: 600; "
338+
"padding: 12px; text-align: left; border-bottom: 2px solid #005a9e;")
339+
340+
# Apply configuration
341+
configure_formatter(
342+
max_height=500,
343+
enable_cell_expansion=True,
344+
custom_css=custom_css,
345+
style_provider=DashboardStyleProvider(),
346+
max_cell_length=50
347+
)
348+
349+
# Add custom formatters for numbers
350+
formatter = get_formatter()
351+
352+
def format_number(value):
353+
try:
354+
num = float(value)
355+
cls = "numeric-positive" if num > 0 else "numeric-negative" if num < 0 else ""
356+
return f'<span class="{cls}">{value:,}</span>' if cls else f'{value:,}'
357+
except (ValueError, TypeError):
358+
return str(value)
359+
360+
formatter.register_formatter(int, format_number)
361+
formatter.register_formatter(float, format_number)
362+
363+
Best Practices
364+
--------------
365+
366+
1. **Memory Management**: For large datasets, use ``max_memory_bytes`` to limit memory usage.
367+
368+
2. **Responsive Design**: Set reasonable ``max_width`` and ``max_height`` values to ensure tables display well on different screens.
369+
370+
3. **Style Optimization**: Use ``use_shared_styles=True`` to avoid duplicate style definitions when displaying multiple tables.
371+
372+
4. **Reset When Needed**: Call ``reset_formatter()`` when you want to start fresh with default settings.
373+
374+
5. **Cell Expansion**: Use ``enable_cell_expansion=True`` when cells might contain longer content that users may want to see in full.
375+
376+
Additional Resources
377+
--------------------
378+
379+
* :doc:`../user-guide/dataframe` - Complete guide to using DataFrames
380+
* :doc:`../user-guide/io/index` - I/O Guide for reading data from various sources
381+
* :doc:`../user-guide/data-sources` - Comprehensive data sources guide
382+
* :ref:`io_csv` - CSV file reading
383+
* :ref:`io_parquet` - Parquet file reading
384+
* :ref:`io_json` - JSON file reading
385+
* :ref:`io_avro` - Avro file reading
386+
* :ref:`io_custom_table_provider` - Custom table providers
387+
* `API Reference <https://arrow.apache.org/datafusion-python/api/index.html>`_ - Full API reference

0 commit comments

Comments
 (0)