Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs(python): Add Polars & LLMs page to the user guide #21160

Closed
wants to merge 23 commits into from
Closed
Show file tree
Hide file tree
Changes from 8 commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
1828fa6
Add Polars LLMS page
Feb 10, 2025
201c69f
Merge branch 'main' into list-expr-docstring-examples
Feb 10, 2025
a892956
Add to index
Feb 10, 2025
b431a67
Updates
Feb 10, 2025
4354002
Edits
Feb 10, 2025
c2215ec
Formatted file
Feb 10, 2025
8cb3d2b
Add custom API docs
braaannigan Feb 11, 2025
8513e0a
Format file
Feb 11, 2025
be4e181
Fix typo
braaannigan Feb 11, 2025
bf3151e
empty commit
Feb 12, 2025
a25f3af
feat: Improve DataFrame fmt in explain (#21158)
ritchie46 Feb 10, 2025
9c7be88
fix: Fix projection count query optimization (#21162)
ritchie46 Feb 10, 2025
e982c8e
fix: Projection of only row index in new streaming IPC (#21167)
coastalwhite Feb 10, 2025
0ac3e6e
chore: Install seaborn when running remote benchmark (#21168)
coastalwhite Feb 10, 2025
0359406
docs: Improve Arrow key feature description (#21171)
edwinvehmaanpera Feb 10, 2025
e586fc1
chore: Add feature gate to old streaming deprecation warning (#21179)
lukemanley Feb 11, 2025
3b9deb2
feat: Add row index to new streaming multiscan (#21169)
coastalwhite Feb 11, 2025
f987f01
fix: Raise error instead of panicking for unsupported SQL operations …
jqnatividad Feb 11, 2025
6c33e7d
fix: Do not panic in `strptime()` if `format` ends with '%' (#21176)
etiennebacher Feb 11, 2025
c3c4edb
feat: Add SQL support for the `DELETE` statement (#21190)
alexander-beedie Feb 12, 2025
44fa71b
feat: Don't take in rewriting visitor (#21212)
ritchie46 Feb 12, 2025
7b14ada
perf: Add sampling to new-streaming equi join to decide between build…
orlp Feb 12, 2025
2be2036
refactor(rust): Use distributor channel in new-streaming CSV reader a…
orlp Feb 12, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
76 changes: 76 additions & 0 deletions docs/source/user-guide/polars_llms.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
# Generating Polars code with LLMs

Large Language Models (LLMs) can sometimes return Pandas code or invalid Polars code in their
output. This guide presents approaches that help LLMs generate valid Polars code more consistently.

These approaches have been developed by the Polars community through testing model responses to
various inputs. If you find additional effective approaches for generating Polars code from LLMs,
please raise an [pull request](https://github.com/pola-rs/polars/pulls).

## System prompt

Many LLMs allow you to provide a system prompt that is included with every individual prompt you
send to the model. In the system prompt, you can specify your preferred defaults, such as "Use
Polars as the default dataframe library". Including such a system prompt typically leads to models
consistently generating Polars code rather than Pandas code.

You can set this system prompt in the settings menu of both web-based LLMs like ChatGPT and
IDE-based LLMs like Cursor. Refer to each application's documentation for specific instructions.

## Enable web search

Some LLMs can search the web to access information beyond their pre-training data. Enabling web
search allows an LLM to reference up-to-date Polars documentation for the current API.

However, web search is not a universal solution. If a model is confident in a result based on its
pre-training data, it may not incorporate web search results in its output.

## Reference documentation

Some IDE-based LLMs can index the Polars API documentation and reference this when generating code.
For example, in Cursor you can add Polars as a custom docs source.

## Provide examples

You can guide LLMs to use correct syntax by including relevant examples in your prompt.

For instance, this basic query:

```python
df = pl.DataFrame({
"id": ["a", "b", "a", "b", "c"],
"score": [1, 2, 1, 3, 3],
"year": [2020, 2020, 2021, 2021, 2021],
})
# Compute average of score by id
```

Often results in outdated `groupby` syntax instead of the correct `group_by`.

However, including a simple example from the Polars `group_by` documentation (preferably with web
search enabled) like this:

```python
df = pl.DataFrame({
"id": ["a", "b", "a", "b", "c"],
"score": [1, 2, 1, 3, 3],
"year": [2020, 2020, 2021, 2021, 2021],
})
# Compute average of score by id
# Examples of Polars code:

# df.group_by("a").agg(pl.col("b").mean())
```

Produces valid outputs more consistently. This approach has been validated across several leading
models.

The combination of web search and examples is more effective than either independently. Model
outputs indicate that when an example contradicts the model's pre-trained expectations, it seems
more likely to trigger a web search for verification.

Additionally, explicit instructions like "use `group_by` instead of `groupby`" can be effective in
guiding the model to use correct syntax.

Common examples such as `df.group_by("a").agg(pl.col("b").mean())` can also be added the system
prompt for more consistency.
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,7 @@ nav:
- user-guide/migration/pandas.md
- user-guide/migration/spark.md
- user-guide/ecosystem.md
- user-guide/polars_llms.md
- Misc:
- user-guide/misc/multiprocessing.md
- user-guide/misc/visualization.md
Expand Down
Loading