-
Notifications
You must be signed in to change notification settings - Fork 3.5k
Add fsspec filesystem storage options #9257
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
f6207ea
to
36fe459
Compare
@lucmos Thanks a lot! If it only involves checkpoints, I'd prefer to have this option as part of the Checkpointing callback. We are currently reinvestigating all arguments at the trainer's init and therefore I am a bit hesitant to simply add a new one, since it will be hard to remove/deprecate it, once it is there. |
@justusschock Thanks for the reply! My use case is to configure the Trainer A self hosted minio instance, for example, requires a custom endpoint that currently cannot be configured. I see your point about the Trainer (too) many init parameters, however, I feel like these Maybe, to avoid adding another parameter we could promote the |
not sure what others think about that idea, but we could make the following assumption: If That being said, I am not sure, whether all implemented loggers do support something like that (the |
I see, I like this direction Another similar option would be to move the responsibility of configuring the filesystem from lightning to the user.
|
I like that even more. that's also what @tchaton suggested in slack :) we just need to test that this works with the loggers and everything then :) |
Oh nice! I completely missed the message on slack. Ok then, probably tomorrow I will try to look into this. I'll ask here if I have any question 🙂 |
Hey @lucmos, We could support something like this. Do you want to contribute this feature ? It might be a bit of work, as we might add this to profiler, logger too and not sure how fs behaves with some path broadcasting we do across processes. Anyhow, we will help you along the way :) fs = fsspec.filesystem('ftp', host=host, port=port, username=user, password=pw)
trainer = Trainer(default_root_dir=fs) Best, |
Hello @tchaton, Thank you for your message. I like the idea of letting the user configure the filesystem, however, I think only the fs is not enough as the E.g., this is how you would access a minio bucket with fsspec: s3: AbstractFileSystem = fsspec.filesystem(
"s3",
key="mykey",
secret="mysecret",
client_kwargs={"endpoint_url": "http://127.0.0.1:9000"},
)
s3.touch("/mybucket/my_folder/my_file.json") Both the filesystem and a path (at least the bucket name) in that file system are needed. I did not find a way to merge these two information (fs and path) into a single Path-like object in Do you think we could pass both information to the Something like fs = fsspec.filesystem('ftp', host=host, port=port, username=user, password=pw)
fs_path = 'my/path/in/fs'
trainer = Trainer(default_root_dir={'filesystem': fs, 'path': fs_path}) |
I think the easiest and cleaner way to implement this is to introduce a new PathLike class class LightningPath(PathLike):
def __init__(self, path: Optional[Union[str, Path]], filesystem: Optional[AbstractFileSystem] = None):
self.path = Path(path if path is not None else ".")
self.filesystem: AbstractFileSystem = filesystem if filesystem is not None else LocalFileSystem()
def __fspath__(self) -> str:
return f"{self.filesystem.protocol}://{self.path}"
def __truediv__(self, other: Union[str, Path]) -> "LightningPath":
return LightningPath(self.path / other, filesystem=self.filesystem)
def _prefix_path(self, url: Union[str, Path, Iterable[Union[str, Path]]]) -> Union[Path, Iterable[Path]]:
if isinstance(url, str):
return self.path / url
elif isinstance(url, Path):
return self.path, url
elif isinstance(url, Iterable):
return [self.path / x for x in url]
assert False
def open(self, path, *args, **kwargs):
path = self._prefix_path(path)
return self.filesystem.open(path, *args, **kwargs)
def write(self, path, *args, **kwargs):
path = self._prefix_path(path)
return self.filesystem.write(path, *args, **kwargs)
... # todo: all other AbastractFileSystem methods delegations It would:
Thus, the I would have liked to subclass An example of usage would be: s3: AbstractFileSystem = fsspec.filesystem(
"s3",
key="mykey",
secret="mysecret",
client_kwargs={"endpoint_url": "http://127.0.0.1:9000"},
)
Trainer(default_root_dir = LightningPath(path='/mybucket/', filesystem=s3)) What do you think about this? As a side note, do you have any pointer to which tests need to be adapted? I see that many tests use a Relevant issue from fsspec fsspec/filesystem_spec#395 And possibly relevant "universal path" |
Curious to get @ananthsub thoughts on this proposal. |
Hey @lucmos , I like the To answer your question, I would just use the |
would it make sense to have this FS setting experiment/run-wide meaning you would set it for all usage in PL import pytorch_lightning
pytorch_lightning.fs = fsspec.filesystem('ftp', host=host, port=port, username=user, password=pw) or (not sure how now) some kind of registry? |
+1 to the For this reason, could this be contributed to the
I would avoid the global variable. This would be very limiting if one needs to interact with multiple filesystems. |
Hey @lucmos, Any updates on this PR ? |
Hello, sorry for the late reply! Unfortunately at the moment I do not have any bandwidth to continue this PR. Feel free to take over! |
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. If you need further help see our docs: https://pytorch-lightning.readthedocs.io/en/latest/generated/CONTRIBUTING.html#pull-request or ask the assistance of a core contributor here or on Slack. Thank you for your contributions. |
This pull request is going to be closed. Please feel free to reopen it create a new from the actual master. |
What does this PR do?
Expose
storage_options
parameter inget_filesystem
to configure the fsspec filesystem.e.g. this would enable the use of a minio instance.
Does your PR introduce any breaking changes? If yes, please list them.
get_filesystem
https://github.com/PyTorchLightning/pytorch-lightning/blob/ccc83e717cd3472e10dfb6002f69054e418e73c6/pytorch_lightning/utilities/cloud_io.py#L41
get_filesystem
s.t. they use the correct storage optionsBefore submitting
PR review
Discussion
I think the best way to let the fsspec filesystem be configurable is an additional parameter
storage_options
, as in thefsspec.filestytem
signature:Thus the signature of
get_filesystem
would become:This requires a new
storage_options
parameter in theTrainer
, and the propagation of that parameter to all theget_filesystem
calls. I count 33 matches in 14 files for "get_filesystem", thus this MR would be changing many files.If you agree this is the best way to go I can start doing that, otherwise I am happy to hear alternative solutions!
@justusschock @kaushikb11