-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Defer credential provider resolution to take place at query collection instead of construction #21225
fix: Defer credential provider resolution to take place at query collection instead of construction #21225
Conversation
d1aba82
to
6a5830b
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #21225 +/- ##
==========================================
+ Coverage 79.75% 79.77% +0.01%
==========================================
Files 1593 1596 +3
Lines 228119 228274 +155
Branches 2600 2607 +7
==========================================
+ Hits 181947 182098 +151
- Misses 45575 45580 +5
+ Partials 597 596 -1 ☔ View full report in Codecov by Sentry. |
where | ||
S: serde::Serializer, | ||
{ | ||
PySerializeWrap(self.0.as_ref()).serialize(serializer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replaced with serde(serialize_with = "PythonObject::serialize_with_pyversion")
above
D: Deserializer<'de>, | ||
{ | ||
type T = Option<PlCredentialProvider>; | ||
T::deserialize(deserializer).or_else(|_| Ok(Default::default())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- This was added by test: Fix tests #20745 - I'm not sure for the reason. But I've removed it as it was causing deserialization errors to be silently ignored.
if config::verbose() { | ||
eprintln!( | ||
"serialize_pyobject_with_cloudpickle_fallback(): \ | ||
retrying with cloudpickle due to error: {:?}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During testing for the error case, __getstate__()
was called twice, but it was not immediately clear that this was due to a cloudpickle serialization fallback. We now have a verbose log indicating this is what is happening.
source: Any, | ||
storage_options: dict[str, Any] | None, | ||
caller_name: str, | ||
) -> CredentialProviderBuilder | None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of this code is moved, with some adjustments so that it returns a CredentialProviderBuilder
instead of directly initializing the credential provider.
_init_credential_provider
before executing deserialized logical plan ifcredential_provider="auto"
#21157We currently do this during query construction, which can lead to unpredictable errors from environment differences when queries are sent to another machine for execution
Updated verbose output
Setting
POLARS_VERBOSE=1
, one can now observe the following log output when constructing a LazyFrame that scans from S3:During serialization/deserialization, the following log lines can be used to track what is being serialized:
Examples of other variants that can be observed:
Demo of behavior change
Following examples show an experiment where a query is constructed in an environment without
boto3
installed, but is executed on an environment whereboto3
is available:Before
CredentialProviderAWS
, this fails and we set thecredential_provider
toNone
. As nothing gets serialized, no credential provider is used during query execution, causing it to fail:After
credential_provider
should be auto-initialized (CredentialProviderAWS @ AutoInitAWS
), and this auto-initialization happens only when the query is collected. In this case the auto-initialization succeeds asboto3
is available at the point of collection, and the query successfully executes: