Skip to content

Add fsspec filesystem storage options #9257

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

lucmos
Copy link
Contributor

@lucmos lucmos commented Sep 1, 2021

What does this PR do?

Expose storage_options parameter in get_filesystem to configure the fsspec filesystem.

def get_filesystem(path: Union[str, Path], **storage_options) -> AbstractFileSystem:
    ...

e.g. this would enable the use of a minio instance.

Does your PR introduce any breaking changes? If yes, please list them.

Before submitting

  • Was this discussed via Slack (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Discussion

I think the best way to let the fsspec filesystem be configurable is an additional parameter storage_options, as in the fsspec.filestytem signature:

 fsspec.filesystem(protocol, **storage_options)

Thus the signature of get_filesystem would become:

def get_filesystem(path: Union[str, Path], **storage_options) -> AbstractFileSystem:
    ...

This requires a new storage_options parameter in the Trainer, and the propagation of that parameter to all the get_filesystem calls. I count 33 matches in 14 files for "get_filesystem", thus this MR would be changing many files.

If you agree this is the best way to go I can start doing that, otherwise I am happy to hear alternative solutions!

@justusschock @kaushikb11

@lucmos lucmos marked this pull request as draft September 1, 2021 15:07
@lucmos lucmos force-pushed the feature/fsspec-options branch from f6207ea to 36fe459 Compare September 1, 2021 15:25
@justusschock
Copy link
Member

@lucmos Thanks a lot!
Is this only necessary for checkpoints or for every kind of file io?

If it only involves checkpoints, I'd prefer to have this option as part of the Checkpointing callback.

We are currently reinvestigating all arguments at the trainer's init and therefore I am a bit hesitant to simply add a new one, since it will be hard to remove/deprecate it, once it is there.

@lucmos
Copy link
Contributor Author

lucmos commented Sep 1, 2021

@justusschock Thanks for the reply!

My use case is to configure the Trainer default_root_dir to use an s3 bucket, to store everything there instead of using the local storage. Ideally, I would like to not use local storage for non-temporary files.
It works, but only with "default" s3 buckets.

A self hosted minio instance, for example, requires a custom endpoint that currently cannot be configured.

I see your point about the Trainer (too) many init parameters, however, I feel like these storage_options should be tied together with the default_root_dir since the default root dir could be an s3 bucket (that, without the correct endpoint configured, would fail anywhere).

Maybe, to avoid adding another parameter we could promote the storage_root_dir to an object that handles local paths or remote paths together with their filesystem configuration?

@justusschock
Copy link
Member

not sure what others think about that idea, but we could make the following assumption: If default_root_dir is a string, we assume it to be a local directory, otherwise it can be a dict, providing the other options that are necessary to properly resolve the remote storage.

That being said, I am not sure, whether all implemented loggers do support something like that (the default_root_dir does affect them as well). Most of them handle that internally, so there is nothing we can do in that direction.

@lucmos
Copy link
Contributor Author

lucmos commented Sep 1, 2021

I see, I like this direction

Another similar option would be to move the responsibility of configuring the filesystem from lightning to the user.

default_root_dir could contain both the path (like now) and an instance of AbstractFileSystem from fsspec (a remote already configured by the user)

@justusschock
Copy link
Member

I like that even more. that's also what @tchaton suggested in slack :) we just need to test that this works with the loggers and everything then :)

@lucmos
Copy link
Contributor Author

lucmos commented Sep 1, 2021

Oh nice! I completely missed the message on slack.
Do you think it would be possible to specify a path together with the filesystem?

Ok then, probably tomorrow I will try to look into this. I'll ask here if I have any question 🙂

@Borda Borda changed the title Draft PR: Add fsspec filesystem storage options Add fsspec filesystem storage options Sep 1, 2021
@carmocca carmocca added this to the v1.5 milestone Sep 1, 2021
@carmocca carmocca added design Includes a design discussion feature Is an improvement or enhancement labels Sep 1, 2021
@tchaton
Copy link
Contributor

tchaton commented Sep 2, 2021

Oh nice! I completely missed the message on slack.
Do you think it would be possible to specify a path together with the filesystem?

Ok then, probably tomorrow I will try to look into this. I'll ask here if I have any question 🙂

Hey @lucmos,

We could support something like this. Do you want to contribute this feature ? It might be a bit of work, as we might add this to profiler, logger too and not sure how fs behaves with some path broadcasting we do across processes.

Anyhow, we will help you along the way :)

fs = fsspec.filesystem('ftp', host=host, port=port, username=user, password=pw)
trainer = Trainer(default_root_dir=fs)

Best,
T.C

@lucmos
Copy link
Contributor Author

lucmos commented Sep 2, 2021

Hello @tchaton,

Thank you for your message. I like the idea of letting the user configure the filesystem, however, I think only the fs is not enough as the default_root_dir

E.g., this is how you would access a minio bucket with fsspec:

s3: AbstractFileSystem = fsspec.filesystem(
    "s3",
    key="mykey",
    secret="mysecret",
    client_kwargs={"endpoint_url": "http://127.0.0.1:9000"},
)
s3.touch("/mybucket/my_folder/my_file.json")

Both the filesystem and a path (at least the bucket name) in that file system are needed. I did not find a way to merge these two information (fs and path) into a single Path-like object in fsspec

Do you think we could pass both information to the default_root_dir?

Something like

fs = fsspec.filesystem('ftp', host=host, port=port, username=user, password=pw)
fs_path = 'my/path/in/fs'
trainer = Trainer(default_root_dir={'filesystem': fs, 'path': fs_path})

@lucmos
Copy link
Contributor Author

lucmos commented Sep 2, 2021

I think the easiest and cleaner way to implement this is to introduce a new PathLike class LightningPath, that ties together a path and its fsspec filesystem. Ideally, it should be as much as possible a transparent drop-in replacement:

class LightningPath(PathLike):
    def __init__(self, path: Optional[Union[str, Path]], filesystem: Optional[AbstractFileSystem] = None):
        self.path = Path(path if path is not None else ".")
        self.filesystem: AbstractFileSystem = filesystem if filesystem is not None else LocalFileSystem()

    def __fspath__(self) -> str:
        return f"{self.filesystem.protocol}://{self.path}"

    def __truediv__(self, other: Union[str, Path]) -> "LightningPath":
        return LightningPath(self.path / other, filesystem=self.filesystem)

    def _prefix_path(self, url: Union[str, Path, Iterable[Union[str, Path]]]) -> Union[Path, Iterable[Path]]:
        if isinstance(url, str):
            return self.path / url
        elif isinstance(url, Path):
            return self.path, url
        elif isinstance(url, Iterable):
            return [self.path / x for x in url]
        assert False

    def open(self, path, *args, **kwargs):
        path = self._prefix_path(path)
        return self.filesystem.open(path, *args, **kwargs)

    def write(self, path, *args, **kwargs):
        path = self._prefix_path(path)
        return self.filesystem.write(path, *args, **kwargs)
        
    ... # todo: all other AbastractFileSystem methods delegations

It would:

  • Delegate all methods to the configured filesystem with the correct prefix.
  • Implement a PathLike interface to work with paths (e.g. the truedivide to join paths)

Thus, the default_root_dir parameter would have type default_root_dir: Optional[Union[str, LightningPath]]


I would have liked to subclass Path or AbstractFileSystem but it does not seem feasible


An example of usage would be:

s3: AbstractFileSystem = fsspec.filesystem(
    "s3",
    key="mykey",
    secret="mysecret",
    client_kwargs={"endpoint_url": "http://127.0.0.1:9000"},
)
Trainer(default_root_dir = LightningPath(path='/mybucket/', filesystem=s3))

What do you think about this?

As a side note, do you have any pointer to which tests need to be adapted? I see that many tests use a tmpdir as a default_root_dir.
Should we parametrize the tmpdir fixture to be both a temporary directory and a LightningPath? Or can we default to internally use a LightningPath in the Trainer, and leave the tests as they are now?


Relevant issue from fsspec fsspec/filesystem_spec#395

And possibly relevant "universal path"
https://github.com/Quansight/universal_pathlib

@tchaton
Copy link
Contributor

tchaton commented Sep 2, 2021

Curious to get @ananthsub thoughts on this proposal.

@awaelchli awaelchli modified the milestones: v1.5, v1.6 Nov 1, 2021
@tchaton
Copy link
Contributor

tchaton commented Nov 15, 2021

Hey @lucmos ,

I like the LightningPath proposal.

To answer your question, I would just use the LightningPath internally and add additional tests to make sure it works as expected with remote data. Possibly secret bucket on s3.

@Borda
Copy link
Member

Borda commented Nov 15, 2021

Something like

fs = fsspec.filesystem('ftp', host=host, port=port, username=user, password=pw)
fs_path = 'my/path/in/fs'
trainer = Trainer(default_root_dir={'filesystem': fs, 'path': fs_path})

would it make sense to have this FS setting experiment/run-wide meaning you would set it for all usage in PL

import pytorch_lightning
pytorch_lightning.fs = fsspec.filesystem('ftp', host=host, port=port, username=user, password=pw)

or (not sure how now) some kind of registry?

@carmocca
Copy link
Contributor

carmocca commented Nov 15, 2021

+1 to the LightningPath abstraction idea, but I would consider calling it FsspecPath as nothing is strictly lightning-related.

For this reason, could this be contributed to the fsspec library instead? It could be useful for all fsspec projects, not just Lightning.

would it make sense to have this FS setting experiment/run-wide meaning you would set it for all usage in PL

I would avoid the global variable. This would be very limiting if one needs to interact with multiple filesystems.

@tchaton
Copy link
Contributor

tchaton commented Feb 9, 2022

Hey @lucmos,

Any updates on this PR ?

@carmocca carmocca removed this from the 1.6 milestone Feb 10, 2022
@lucmos
Copy link
Contributor Author

lucmos commented Feb 22, 2022

Hey @lucmos,

Any updates on this PR ?

Hello, sorry for the late reply!

Unfortunately at the moment I do not have any bandwidth to continue this PR. Feel free to take over!

@stale
Copy link

stale bot commented Mar 10, 2022

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. If you need further help see our docs: https://pytorch-lightning.readthedocs.io/en/latest/generated/CONTRIBUTING.html#pull-request or ask the assistance of a core contributor here or on Slack. Thank you for your contributions.

@stale stale bot added the won't fix This will not be worked on label Mar 10, 2022
@stale
Copy link

stale bot commented Apr 2, 2022

This pull request is going to be closed. Please feel free to reopen it create a new from the actual master.

@stale stale bot closed this Apr 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design Includes a design discussion feature Is an improvement or enhancement won't fix This will not be worked on
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants