Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Defer credential provider resolution to take place at query collection instead of construction #21225

Merged
merged 17 commits into from
Feb 13, 2025

Conversation

nameexhaustion
Copy link
Collaborator

@nameexhaustion nameexhaustion commented Feb 13, 2025

We currently do this during query construction, which can lead to unpredictable errors from environment differences when queries are sent to another machine for execution

Updated verbose output

Setting POLARS_VERBOSE=1, one can now observe the following log output when constructing a LazyFrame that scans from S3:

_init_credential_provider_builder(): credential_provider_init = CredentialProviderBuilder(CredentialProviderAWS @ AutoInitAWS)

During serialization/deserialization, the following log lines can be used to track what is being serialized:

  • During serialization:
[CredentialProviderBuilder]: __getstate__(): state = CredentialProviderAWS @ AutoInitAWS 
  • During deserialization:
[CredentialProviderBuilder]: __setstate__(): self = CredentialProviderBuilder(CredentialProviderAWS @ AutoInitAWS)

Examples of other variants that can be observed:

credential_provider_init = CredentialProviderBuilder(CredentialProviderAWS @ AutoInitAWS)
credential_provider_init = CredentialProviderBuilder(CredentialProviderGCP @ AutoInit)
credential_provider_init = CredentialProviderBuilder(CredentialProviderAzure @ AutoInit)
credential_provider_init = CredentialProviderBuilder(<function my_custom_credential_provider at 0x100dc6340> @ InitializedCredentialProvider)
credential_provider_init = CredentialProviderBuilder(<__main__.MyCustomCredentialProvider object at 0x1014aaff0> @ InitializedCredentialProvider)
credential_provider_init = None

Demo of behavior change

Following examples show an experiment where a query is constructed in an environment without boto3 installed, but is executed on an environment where boto3 is available:

Before

  • Observe we immediately attempt to initialize CredentialProviderAWS, this fails and we set the credential_provider to None. As nothing gets serialized, no credential provider is used during query execution, causing it to fail:
import os

os.environ["POLARS_VERBOSE"] = "1"

import pickle
import subprocess

import polars as pl

subprocess.check_output("pip uninstall -y boto3", shell=True)
q = pl.scan_parquet("s3://.../...")
# Unable to auto-select credential provider: boto3 must be installed to use `CredentialProviderAWS`

dump = pickle.dumps(q)
deserialized = pickle.loads(dump)
subprocess.check_output("pip install boto3", shell=True)
print("shape:", deserialized.collect().shape)
# polars.exceptions.ComputeError: The operation lacked the necessary privileges...

After

  • The query now captures state indicating that a credential_provider should be auto-initialized (CredentialProviderAWS @ AutoInitAWS), and this auto-initialization happens only when the query is collected. In this case the auto-initialization succeeds as boto3 is available at the point of collection, and the query successfully executes:
import os

os.environ["POLARS_VERBOSE"] = "1"

import pickle
import subprocess

import polars as pl

subprocess.check_output("pip uninstall -y boto3", shell=True)
q = pl.scan_parquet("s3://.../...")
# _init_credential_provider_builder(): credential_provider_init = CredentialProviderBuilder(CredentialProviderAWS @ AutoInitAWS)

dump = pickle.dumps(q)
# [CredentialProviderBuilder]: __getstate__(): state = CredentialProviderAWS @ AutoInitAWS

deserialized = pickle.loads(dump)
# [CredentialProviderBuilder]: __setstate__(): self = CredentialProviderBuilder(CredentialProviderAWS @ AutoInitAWS)

subprocess.check_output("pip install boto3", shell=True)
# Installed 1 package in 1ms
#  + boto3==1.36.19

print("shape:", deserialized.collect().shape)
# [CredentialProviderBuilder]: Begin initialize CredentialProviderAWS @ AutoInitAWS
# [CredentialProviderBuilder]: Initialized <polars.io.cloud.credential_provider._providers.CredentialProviderAWS object at 0x1077e35f0> from CredentialProviderAWS @ AutoInitAWS
# ...<other verbose logs>
# shape: (150, 5)

@github-actions github-actions bot added fix Bug fix python Related to Python Polars rust Related to Rust Polars labels Feb 13, 2025
@nameexhaustion nameexhaustion changed the title fix: Ensure deserialized queries perform credential provider auto initialization fix: Defer credential provider resolution to occur during query execution Feb 13, 2025
Copy link

codecov bot commented Feb 13, 2025

Codecov Report

Attention: Patch coverage is 73.65591% with 98 lines in your changes missing coverage. Please review.

Project coverage is 79.77%. Comparing base (9bcff72) to head (e6ee1a4).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
...rs/polars/io/cloud/credential_provider/_builder.py 63.75% 42 Missing and 12 partials ⚠️
crates/polars-io/src/cloud/credential_provider.rs 78.00% 11 Missing ⚠️
.../polars/io/cloud/credential_provider/_providers.py 28.57% 10 Missing ⚠️
py-polars/polars/catalog/unity/client.py 0.00% 9 Missing ⚠️
crates/polars-utils/src/python_function.rs 88.05% 8 Missing ⚠️
py-polars/polars/dataframe/frame.py 86.66% 1 Missing and 1 partial ⚠️
...quet/src/parquet/metadata/column_chunk_metadata.rs 0.00% 1 Missing ⚠️
crates/polars-plan/src/dsl/expr_dyn_fn.rs 50.00% 1 Missing ⚠️
crates/polars-python/src/catalog/unity.rs 0.00% 1 Missing ⚠️
py-polars/polars/io/delta.py 83.33% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #21225      +/-   ##
==========================================
+ Coverage   79.75%   79.77%   +0.01%     
==========================================
  Files        1593     1596       +3     
  Lines      228119   228274     +155     
  Branches     2600     2607       +7     
==========================================
+ Hits       181947   182098     +151     
- Misses      45575    45580       +5     
+ Partials      597      596       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@nameexhaustion nameexhaustion changed the title fix: Defer credential provider resolution to occur during query execution fix: Defer credential provider resolution to take place at query collection instead of construction Feb 13, 2025
where
S: serde::Serializer,
{
PySerializeWrap(self.0.as_ref()).serialize(serializer)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replaced with serde(serialize_with = "PythonObject::serialize_with_pyversion") above

D: Deserializer<'de>,
{
type T = Option<PlCredentialProvider>;
T::deserialize(deserializer).or_else(|_| Ok(Default::default()))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • This was added by test: Fix tests #20745 - I'm not sure for the reason. But I've removed it as it was causing deserialization errors to be silently ignored.

if config::verbose() {
eprintln!(
"serialize_pyobject_with_cloudpickle_fallback(): \
retrying with cloudpickle due to error: {:?}",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During testing for the error case, __getstate__() was called twice, but it was not immediately clear that this was due to a cloudpickle serialization fallback. We now have a verbose log indicating this is what is happening.

source: Any,
storage_options: dict[str, Any] | None,
caller_name: str,
) -> CredentialProviderBuilder | None:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of this code is moved, with some adjustments so that it returns a CredentialProviderBuilder instead of directly initializing the credential provider.

@nameexhaustion nameexhaustion marked this pull request as ready for review February 13, 2025 08:45
@nameexhaustion

This comment was marked as outdated.

@nameexhaustion nameexhaustion marked this pull request as draft February 13, 2025 10:53
@nameexhaustion nameexhaustion marked this pull request as ready for review February 13, 2025 11:19
@ritchie46 ritchie46 merged commit 7f22a25 into pola-rs:main Feb 13, 2025
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix Bug fix python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Run _init_credential_provider before executing deserialized logical plan if credential_provider="auto"
2 participants