Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dropbox implementation + add existing implementation in the registry #207

Merged
merged 10 commits into from
Jan 7, 2020
Merged

Dropbox implementation + add existing implementation in the registry #207

merged 10 commits into from
Jan 7, 2020

Conversation

MarineChap
Copy link
Collaborator

Dropbox (partial) implementation : Only download 1 file at the time (tested with jpeg image)
issue #200

…file

Dropbox (partial) implementation : Only download 1 file at the time (tested with jpeg image)
@martindurant
Copy link
Member

Thank you for contributing!
Before proceeding, it seems to me that the URL link capability would allow for fetching parts of a file, with RANGE header keys like a HTTPFile. That would be very useful for chunking and parallel access, which we generally care about in fsspec.

I note that there is no file listing here, which does seem fairly easy to do with the dropbox API.

Also, the API allows for uploading in chunks via a session, so it would be nice to implement writing too.

Aside from that, documentation will be important here, such as the token parameter in __init__ - how is htis generated?

@MarineChap
Copy link
Collaborator Author

The token is the dropbox key generated by each user by adding an app in their account. Where I should add this documentation ? In class comment or in an error message in case of fail due to a missing token ?

I had the file listing but for the fetch range, it is really confusing. I don't fully understand how it works ... So, seems difficult to implement.

And finally, indeed, upload seems easy to implement and I already have a draft but I don't see how it is integrated in the filesystem class. I just need to overload the method _upload_chunk ?

@martindurant
Copy link
Member

In class comment or in an error message

In the class docs first, since it should be a required parameter, token=; for the error, you should surface the failure in a way that makes sense. So don't swallow the HTTP exception for a print, but let it bubble up.

the fetch range

From my brief reading, it seems that grabbing the URL and using the existing machinery of fsspec.implementations.http.HTTPFile is best. You would need to create a Session containing the credentials. The file size is already known from the metadata.

I just need to overload the method _upload_chunk

_upload_chunk writes data to an existing upload session and commits it when final is True, _initiate_upload creates such a session.

@MarineChap
Copy link
Collaborator Author

Hello,
I have finally understood how to do the download file with fetch_range. It is working but I am unsure how to proceed. I have copy-past the fetch_range, fetch_all, read method from the http implementation and not import them. I was thinking it is better to keep the implementation independent from each other, seems good for you ?
I have also implement the upload part.

marine chaput added 2 commits November 22, 2019 12:00
@martindurant
Copy link
Member

@TomAugspurger, @rabernat , you may be interested by this, another very nice piece linking to broadly-available public storage.

Copy link
Collaborator

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gave a quick skim, looks nice. Two questions:

  1. What's the policy on whether an implementation belongs in fsspec or not? This discussion likely doesn't belong in this PR, though it would determine how we proceed.
  2. Do we have a strategy for testing this?

)


class AllBytes(object):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most likely should implement this in fsspec.caching (or move the one from fsspec.implementations.http there).


"""

def __init__(self, **storage_options):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit clearer to have token as a required argument I think.

Suggested change
def __init__(self, **storage_options):
def __init__(self, token, *args, **storage_options):

but still pass it through to super().

"""

def __init__(
self, fs, dbx, path, session=None, block_size=None, mode="rb", **kwargs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend matching the signature of the base class.

Suggested change
self, fs, dbx, path, session=None, block_size=None, mode="rb", **kwargs
self,
fs,
path,
mode="rb",
block_size="default",
autocommit=True,
cache_type="readahead",
cache_options=None,
*,
dbx
**kwargs

That uses a keyword only argument for dbx. Though I wonder: can you just get it from fs instead?

Modify the signature of the class to follow the base class
@MarineChap
Copy link
Collaborator Author

Thank you for your review! I followed your advices. (I didn't know the possibility to directly commit suggestions sorry I will do it next time).
For testing, I have a script which run main features but it is using my dropbox with my personal token. So I don't really know how to test it without using it.

@martindurant
Copy link
Member

Yeah, I'm not sure we have a good testing story we can bring to this one. It might have to be included as "use at your own risk". We have different testing strategies for different implementations:

  • s3fs uses an emulator of the storage service
  • ssh and some others use real servers deployed on docker
  • gcsfs uses an HTTP request recording library (painful to use)
  • gdrive uses travis environment secrets and tests against the real service (this requires PRs are only on in-repo branches, not forks)

@MarineChap
Copy link
Collaborator Author

Do you think it could be merge soon ? Can I do something else to allow this ?
And if it is merged, do you have a schedule for the next release, please ?

I have one of my new feature based on this implementation + intake module and I am waiting to release it. I would appreciate to have just an approximat' deadline. Like this, I can decide, whether or not, I should do a monkey patch in waiting. Thank you !!

Copy link
Member

@martindurant martindurant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a number of comments, most of which are easily addressed.

@TomAugspurger , how comfortable are we with including something alpha-quality and not tested in CI? I suggest we may be happy, so long as it says so in the documentation.

@@ -370,6 +370,16 @@ def __len__(self):
return len(self.cache)


class AllBytes(object):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is copied from the http implementation, which seems unnecessary duplication. Indeed, if we can fetch ranges, then this is not required anymore at all

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my comment below about "copy-past".
Also, your colleague advises me to put this class in the caching file. If the http implementation is modified to used it from the catching module, it will remove the redundancy problem at least for this part of the code and it seems also more coherent. But as the pull request is not about the http implementation, I didn't wanted to modify it my-self here.
Tell me what you prefer about this please. :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that this is clearly a caching method and should be included in the caching module, although it is not particularly useful for most files, since it just keeps the whole thing in memory. Moving the def out of the http module now is fine - there are tests for that functionality that would fail if somehow moving the code went wrong.

list_item = self.dbx.files_list_folder_continue(list_item.cursor)
items = list_item.entries + items

if detail:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting that this does not cache listings for future calls

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I try to look how it is done in other implementations but I don't see any difference... Can you give an hint about what you were expecting here please ?
Thanks :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some implementations populate a dirache attribute, which is a dict of {path: [file, file]} storing the listings that have been seen so far. It has proven tricky to use, but potentially speeds up repeat listings of the same directory. You are fine to proceed without it, but may get users later complaining that calls with glob and walk are slow.

configured for the FileSystem creating this file
"""
for key, value in kwargs.items():
print("{0} = {1}".format(key, value))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be removed


self.path = path
self.dbx = self.fs.dbx
while "//" in path:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while is not needed here, replace does the job in one go, unless you expect "//////"-like strings

kwargs = self.kwargs.copy()
headers = kwargs.pop("headers", {})
headers["Range"] = "bytes=%i-%i" % (start, end - 1)
r = self.session.get(self.url, headers=headers, stream=True, **kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much of this seems to be copied from HTTPFile - which is OK, I guess, but less duplication would be good

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a previous comment, I asked you how to proceed with this "copy-past". I did not know if I should import http implementation to use this method and avoid redundancy or if copy-past was fine as the implementations should stay independent between them.
Do you see another way to go ? and if not, which solution do you prefer? Thanks

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe you could have subclassed HTTPFile successfully to reduce the amount of code here, but without trying it, I cannot be sure - you may have had to override many of the methods anyway. Initially I suspected that you could just use HTTPFile directly, passing the known size in the constructor.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, both methods are working. Except, that by subclassed HTTPFile, a little trick is needed to avoid error raise by the mode "wb"
=> send "rb" in super and set again the mode "wb" after (as well as 4 bool variables)
Like it makes the code little more dirty, I prefer the second solution (using the object HTTPFile object)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, well we don't like dirty code, so no need to change it.

@TomAugspurger
Copy link
Collaborator

Up to you I think, but my preference would be to have this in its own package.

Regradless of the decision here, we should define a policy for that.

…by using an instance of http to read files
@MarineChap
Copy link
Collaborator Author

Hello guys,
What do you think about the last version? Did you decide if you prefer a separate package or include the implementation in yours ?
Thanks for the feedback. :)

@martindurant
Copy link
Member

@TomAugspurger would I believe prefer a separate package, and he is currently the main maintainer here, so I defer to him. That means more work for you, unfortunately: you will need to form your module into a package and release on pypi and ideally conda-forge too. At least, it does mean that you can change the code as fast as you need and release whenever you want. In that case, I would reduce this PR to include the move of the AllBytes class (if desired) and make the registry point to the new location of the implementation. Note that dropboxfs is already taken (that code may also be useful here).

@MarineChap
Copy link
Collaborator Author

Ok. No problem for the additional work, I am always happy to try to do new stuffs!
For what I can see, dropboxfs is now maintained here. And the good point is that it implements some tests!
But in any case, it won't be a problem as I called my module dropboxdrivefs and I didn't find any other occurrences in pypi and conda-forge.
Thanks you for your help !

@MarineChap
Copy link
Collaborator Author

Hello! Holiday are finished and I finally upload my package in pypi (here).
Do you think you will be able to merge this branch any time soon?

Also, I try to do a patch monkey but in fact I have no idea how to do it.
Could you give some hints please?
Thanks
Happy new year ! :)

@martindurant
Copy link
Member

I try to do a patch monkey

Sorry, what are you trying to patch?

@MarineChap
Copy link
Collaborator Author

Sorry, I was not clear. In waiting to you, to eventually merge this branch and release a new version of your package, I was trying to implement a monkey patch to make it already available for my colleagues in our environment. (basically, modify the registry file to add the dropbox implementation class). My manager advises me to it but it is the first time I am trying to do something like that. ^^

@martindurant
Copy link
Member

Oh I see. The following would be enough:

from fsspec.registry import registry
import dropboxfilefs
registry["dropbox"] = dropboxdrivefs.DropboxDriveFileSystem

...but I will merge this now. Releases are pretty frequent, but I can't promise a specific date.

By the way, you may want to raise a new issue in this repo announcing the availability of your repo and pypi package, so that people can comment.

@martindurant martindurant merged commit c325a40 into fsspec:master Jan 7, 2020
@MarineChap
Copy link
Collaborator Author

Nice! Thank you for your help, I learn a lot! It is a good start for the year, my first PR merged ! :)

About the patch monkey, I have already tried that but as I am working directly with intake module (and not fsspec) it does not work for some reasons. But as you merged it, it is enough for me to develop the next step of my project on top of that in waiting for the release.
I will create the issue as you advice it.

@MarineChap MarineChap deleted the dropbox_implementation branch January 7, 2020 14:38
@MarineChap MarineChap restored the dropbox_implementation branch January 7, 2020 14:42
@efiop efiop mentioned this pull request Mar 8, 2021
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants