Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Apply obstore as storage backend #3033

Open
wants to merge 24 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
58ba73c
feat: enable obstore for minio in local
machichima Dec 28, 2024
79ea46d
feat: enable obstore write to remote minio s3
machichima Dec 29, 2024
caaa657
feat: use obstore for gcs
machichima Dec 30, 2024
7ba66e2
feat: use obstore for azure blob storage (abfs)
machichima Jan 1, 2025
0ef7c05
fix: wrong file path for get_filesystem_for_path
machichima Jan 4, 2025
17bde4a
build(Dockerfile.dev): add obstore
machichima Jan 4, 2025
353f000
build(pyproject): add obstore
machichima Jan 5, 2025
7f0782a
fix: add storage specific obstore class
machichima Jan 5, 2025
04bdf20
fix: path error for some file source
machichima Jan 5, 2025
0189419
feat: enable setting retries for s3
machichima Jan 5, 2025
deb9f3d
test: modify test for obstore s3
machichima Jan 5, 2025
0ca6dbc
test: use mock patch for obstore test
machichima Jan 8, 2025
42cc75f
fix: use correct anon key for s3 and azure
machichima Jan 8, 2025
a1c99ec
fix: use defined DEFAULT_BLOCK_SIZE
machichima Jan 8, 2025
9c7e8db
build: update obstore to 0.3.0b10
machichima Jan 10, 2025
6fba41b
feat: lru_cache & enable read public bucket
machichima Jan 11, 2025
99f9ede
test: update test
machichima Jan 11, 2025
f317ff7
feat: override storage options for s3 and azxure
machichima Jan 12, 2025
749a7fe
test: update test
machichima Jan 12, 2025
31d8880
fix: update storage config
machichima Jan 15, 2025
9645dc7
fix: update obstore client_options
machichima Jan 23, 2025
5ac0766
refactor: comments + config format
machichima Jan 28, 2025
8a0f262
fix: add bucket to get_fsspec_storage_options
machichima Jan 28, 2025
4009f80
test: update test
machichima Jan 28, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions Dockerfile.dev
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@ RUN SETUPTOOLS_SCM_PRETEND_VERSION_FOR_FLYTEKIT=$PSEUDO_VERSION \
-e /flytekit \
-e /flytekit/plugins/flytekit-deck-standard \
-e /flytekit/plugins/flytekit-flyteinteractive \
obstore==0.3.0b9 \
markdown \
pandas \
pillow \
Expand Down
282 changes: 225 additions & 57 deletions flytekit/core/data_persistence.py

Large diffs are not rendered by default.

36 changes: 36 additions & 0 deletions flytekit/core/obstore_filesystem.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
"""
Classes that overrides the AsyncFsspecStore that specify the filesystem specific parameters
"""

from obstore.fsspec import AsyncFsspecStore

DEFAULT_BLOCK_SIZE = 5 * 2**20


class ObstoreS3FileSystem(AsyncFsspecStore):
"""
Add following property used in S3FileSystem
"""

root_marker = ""
blocksize = DEFAULT_BLOCK_SIZE
protocol = ("s3", "s3a")
_extra_tokenize_attributes = ("default_block_size",)


class ObstoreGCSFileSystem(AsyncFsspecStore):
"""
Add following property used in GCSFileSystem
"""

scopes = {"read_only", "read_write", "full_control"}
blocksize = DEFAULT_BLOCK_SIZE
protocol = "gcs", "gs"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing parentheses in protocol tuple definition

The protocol tuple definition is missing parentheses which could lead to incorrect protocol handling. Consider adding parentheses: protocol = ("gcs", "gs")

Code suggestion
Check the AI-generated fix before applying
Suggested change
protocol = "gcs", "gs"
protocol = ("gcs", "gs")

Code Review Run #ab65d8


Is this a valid issue, or was it incorrectly flagged by the Agent?

  • it was incorrectly flagged

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ensure protocol is a tuple

The protocol assignment for ObstoreGCSFileSystem should be enclosed in parentheses to ensure it is a tuple.

Code suggestion
Check the AI-generated fix before applying
Suggested change
protocol = "gcs", "gs"
protocol = ("gcs", "gs")

Code Review Run #52212e


Is this a valid issue, or was it incorrectly flagged by the Agent?

  • it was incorrectly flagged



class ObstoreAzureBlobFileSystem(AsyncFsspecStore):
"""
Add following property used in AzureBlobFileSystem
"""

protocol = "abfs"
10 changes: 8 additions & 2 deletions flytekit/types/structured/basic_dfs.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@

from flytekit import FlyteContext, lazy_module, logger
from flytekit.configuration import DataConfig
from flytekit.core.data_persistence import get_fsspec_storage_options
from flytekit.core.data_persistence import get_fsspec_storage_options, split_path
from flytekit.models import literals
from flytekit.models.literals import StructuredDatasetMetadata
from flytekit.models.types import StructuredDatasetType
Expand Down Expand Up @@ -37,7 +37,13 @@ def get_pandas_storage_options(
from pandas.io.common import is_fsspec_url

if is_fsspec_url(uri):
return get_fsspec_storage_options(protocol=get_protocol(uri), data_config=data_config, anonymous=anonymous)
bucket, _ = split_path(uri)
return get_fsspec_storage_options(
protocol=get_protocol(uri),
data_config=data_config,
anonymous=anonymous,
bucket=bucket,
)

# Pandas does not allow storage_options for non-fsspec paths e.g. local.
return None
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
import typing

from flytekit import FlyteContext, lazy_module
from flytekit.core.data_persistence import get_fsspec_storage_options
from flytekit.core.data_persistence import get_fsspec_storage_options, split_path
from flytekit.models import literals
from flytekit.models.literals import StructuredDatasetMetadata
from flytekit.models.types import StructuredDatasetType
Expand Down Expand Up @@ -91,10 +91,11 @@ def decode(
current_task_metadata: StructuredDatasetMetadata,
) -> pl.DataFrame:
uri = flyte_value.uri

bucket, _ = split_path(uri)
kwargs = get_fsspec_storage_options(
protocol=fsspec_utils.get_protocol(uri),
data_config=ctx.file_access.data_config,
bucket=bucket,
)
if current_task_metadata.structured_dataset_type and current_task_metadata.structured_dataset_type.columns:
columns = [c.name for c in current_task_metadata.structured_dataset_type.columns]
Expand Down Expand Up @@ -153,10 +154,11 @@ def decode(
current_task_metadata: StructuredDatasetMetadata,
) -> pl.LazyFrame:
uri = flyte_value.uri

bucket, _ = split_path(uri)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider validating bucket before usage

Consider validating that bucket is not empty before using it as a parameter. The split_path() function can return an empty string for the bucket in some cases.

Code suggestion
Check the AI-generated fix before applying
Suggested change
bucket, _ = split_path(uri)
bucket, _ = split_path(uri)
if not bucket:
raise ValueError(f"Could not determine bucket from URI: {uri}")

Code Review Run #39e27b


Is this a valid issue, or was it incorrectly flagged by the Agent?

  • it was incorrectly flagged

kwargs = get_fsspec_storage_options(
protocol=fsspec_utils.get_protocol(uri),
data_config=ctx.file_access.data_config,
bucket=bucket,
Comment on lines +157 to +161
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider validating bucket name before use

Consider validating the bucket value before using it in the storage options. An empty or invalid bucket name could cause issues with storage access.

Code suggestion
Check the AI-generated fix before applying
 -        bucket, _ = split_path(uri)
 -        kwargs = get_fsspec_storage_options(
 -            protocol=fsspec_utils.get_protocol(uri),
 -            data_config=ctx.file_access.data_config,
 -            bucket=bucket,
 +        bucket, _ = split_path(uri)
 +        if not bucket:
 +            raise ValueError(f"Invalid or empty bucket name extracted from uri: {uri}")
 +        kwargs = get_fsspec_storage_options(
 +            protocol=fsspec_utils.get_protocol(uri),
 +            data_config=ctx.file_access.data_config,
 +            bucket=bucket,

Code Review Run #39e27b


Is this a valid issue, or was it incorrectly flagged by the Agent?

  • it was incorrectly flagged

)
# use read_parquet instead of scan_parquet for now because scan_parquet currently doesn't work with fsspec:
# https://github.com/pola-rs/polars/issues/16737
Expand Down
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ dependencies = [
"marshmallow-jsonschema>=0.12.0",
"mashumaro>=3.15",
"msgpack>=1.1.0",
"obstore==0.3.0b10",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider version range for obstore

The version pinning of obstore to 0.3.0b10 might limit flexibility. Consider using a version range to allow for updates.

Code suggestion
Check the AI-generated fix before applying
Suggested change
"obstore==0.3.0b10",
"obstore>=0.3.0b10,<0.4.0",

Code Review Run #52212e


Is this a valid issue, or was it incorrectly flagged by the Agent?

  • it was incorrectly flagged

"protobuf!=4.25.0",
"pygments",
"python-json-logger>=2.0.0",
Expand Down
Loading