forked from raystack/dagger
-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable Dagger Parquet Source feature using Ali OSS Service #49
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…t(cosn) storage services Given the configuration provided correctly. Set the below environment variables accordingly to access the files stored in the respective bucket. Ali(oss) - OSS_ACCESS_KEY_ID - OSS_ACCESS_KEY_SECRET Tencent(cos) - COS_SECRET_ID - COS_SECRET_KEY - COS_REGION
If you need to use COS filesystem for the dagger, provide the cos bucket/key configuration in the state.backend.fs.checkpointdir, state.savepoints.dir, high-availability.storageDir to flinkdeployment manifest. If the filesystem protocol begins with cosn for the above configurations, dagger uses the below configurations provided in the flinkdeployment manifest file. fs.cosn.impl: org.apache.hadoop.fs.CosFileSystem fs.AbstractFileSystem.cosn.impl: org.apache.hadoop.fs.CosN fs.cosn.userinfo.secretId: <secretID> fs.cosn.userinfo.secretKey: <secretKey> fs.cosn.bucket.region: <region> fs.cosn.bucket.endpoint_suffix: <tencent-provided-prefix.xyz.com>
Most of the client implementation including GCS, is not serializable, so fixed this issue by making client implementation not part of the serialization, and when the client is passed over wire and the client doesn't exist, it initializes as and when it is required. // In a distributed system, we don't intend the client to be serialized and most of the implementations like // GCP Storage implementation doesn't implement java.io.Serializable interface and you may see the below error // Caused by: org.apache.flink.api.common.InvalidProgramException: com.google.api.services.storage.Storage@1c666a8f // is not serializable. The object probably contains or references non serializable fields. // Caused by: java.io.NotSerializableException: com.google.api.services.storage.Storage
mayankrai09
approved these changes
Jan 23, 2025
Conflicts: dagger-common/build.gradle dagger-functions/src/main/java/com/gotocompany/dagger/functions/udfs/scalar/dart/store/DartDataStoreClientProvider.java
mayankrai09
approved these changes
Jan 23, 2025
rajuGT
pushed a commit
that referenced
this pull request
Jan 23, 2025
-- Enable Dagger Parquet Source feature using Ali OSS Service (#49) * Add gradle tasks to minimal and dependencies to maven local * Add capability to dagger to read python udfs from Ali(oss) and Tencent(cosn) storage services Given the configuration provided correctly. Set the below environment variables accordingly to access the files stored in the respective bucket. Ali(oss) - OSS_ACCESS_KEY_ID - OSS_ACCESS_KEY_SECRET Tencent(cos) - COS_SECRET_ID - COS_SECRET_KEY - COS_REGION * OSS client endpoint should be configurable via ENV variable * COS filesystem high availability support If you need to use COS filesystem for the dagger, provide the cos bucket/key configuration in the state.backend.fs.checkpointdir, state.savepoints.dir, high-availability.storageDir to flinkdeployment manifest. If the filesystem protocol begins with cosn for the above configurations, dagger uses the below configurations provided in the flinkdeployment manifest file. fs.cosn.impl: org.apache.hadoop.fs.CosFileSystem fs.AbstractFileSystem.cosn.impl: org.apache.hadoop.fs.CosN fs.cosn.userinfo.secretId: <secretID> fs.cosn.userinfo.secretKey: <secretKey> fs.cosn.bucket.region: <region> fs.cosn.bucket.endpoint_suffix: <tencent-provided-prefix.xyz.com> * Fix checkstyle and made constants as static variables * Refactor Dart Feature to plug other object storage service providers * test checkstyle fix * Dart Support for OSS Service Provider * fix checkstyle * Dart Support for COS Service Provider * Dart implementation fix - the object storage client aren't serializable Most of the client implementation including GCS, is not serializable, so fixed this issue by making client implementation not part of the serialization, and when the client is passed over wire and the client doesn't exist, it initializes as and when it is required. // In a distributed system, we don't intend the client to be serialized and most of the implementations like // GCP Storage implementation doesn't implement java.io.Serializable interface and you may see the below error // Caused by: org.apache.flink.api.common.InvalidProgramException: com.google.api.services.storage.Storage@1c666a8f // is not serializable. The object probably contains or references non serializable fields. // Caused by: java.io.NotSerializableException: com.google.api.services.storage.Storage * checkstyle fix * Add unit tests for DartDataStoreClientProvider * Enable Dagger Parquet Source feature using Ali OSS Service
rajuGT
added a commit
that referenced
this pull request
Jan 23, 2025
-- Enable Dagger Parquet Source feature using Ali OSS Service (#49) * Add gradle tasks to minimal and dependencies to maven local * Add capability to dagger to read python udfs from Ali(oss) and Tencent(cosn) storage services Given the configuration provided correctly. Set the below environment variables accordingly to access the files stored in the respective bucket. Ali(oss) - OSS_ACCESS_KEY_ID - OSS_ACCESS_KEY_SECRET Tencent(cos) - COS_SECRET_ID - COS_SECRET_KEY - COS_REGION * OSS client endpoint should be configurable via ENV variable * COS filesystem high availability support If you need to use COS filesystem for the dagger, provide the cos bucket/key configuration in the state.backend.fs.checkpointdir, state.savepoints.dir, high-availability.storageDir to flinkdeployment manifest. If the filesystem protocol begins with cosn for the above configurations, dagger uses the below configurations provided in the flinkdeployment manifest file. fs.cosn.impl: org.apache.hadoop.fs.CosFileSystem fs.AbstractFileSystem.cosn.impl: org.apache.hadoop.fs.CosN fs.cosn.userinfo.secretId: <secretID> fs.cosn.userinfo.secretKey: <secretKey> fs.cosn.bucket.region: <region> fs.cosn.bucket.endpoint_suffix: <tencent-provided-prefix.xyz.com> * Fix checkstyle and made constants as static variables * Refactor Dart Feature to plug other object storage service providers * test checkstyle fix * Dart Support for OSS Service Provider * fix checkstyle * Dart Support for COS Service Provider * Dart implementation fix - the object storage client aren't serializable Most of the client implementation including GCS, is not serializable, so fixed this issue by making client implementation not part of the serialization, and when the client is passed over wire and the client doesn't exist, it initializes as and when it is required. // In a distributed system, we don't intend the client to be serialized and most of the implementations like // GCP Storage implementation doesn't implement java.io.Serializable interface and you may see the below error // Caused by: org.apache.flink.api.common.InvalidProgramException: com.google.api.services.storage.Storage@1c666a8f // is not serializable. The object probably contains or references non serializable fields. // Caused by: java.io.NotSerializableException: com.google.api.services.storage.Storage * checkstyle fix * Add unit tests for DartDataStoreClientProvider * Enable Dagger Parquet Source feature using Ali OSS Service Co-authored-by: rajuGT <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.