-
Notifications
You must be signed in to change notification settings - Fork 373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dropbox implementation + add existing implementation in the registry #207
Dropbox implementation + add existing implementation in the registry #207
Conversation
…file Dropbox (partial) implementation : Only download 1 file at the time (tested with jpeg image)
Thank you for contributing! I note that there is no file listing here, which does seem fairly easy to do with the dropbox API. Also, the API allows for uploading in chunks via a session, so it would be nice to implement writing too. Aside from that, documentation will be important here, such as the |
The token is the dropbox key generated by each user by adding an app in their account. Where I should add this documentation ? In class comment or in an error message in case of fail due to a missing token ? I had the file listing but for the fetch range, it is really confusing. I don't fully understand how it works ... So, seems difficult to implement. And finally, indeed, upload seems easy to implement and I already have a draft but I don't see how it is integrated in the filesystem class. I just need to overload the method _upload_chunk ? |
In the class docs first, since it should be a required parameter,
From my brief reading, it seems that grabbing the URL and using the existing machinery of fsspec.implementations.http.HTTPFile is best. You would need to create a
|
Hello, |
ask for install the request module as well as the dropbox module
@TomAugspurger, @rabernat , you may be interested by this, another very nice piece linking to broadly-available public storage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gave a quick skim, looks nice. Two questions:
- What's the policy on whether an implementation belongs in fsspec or not? This discussion likely doesn't belong in this PR, though it would determine how we proceed.
- Do we have a strategy for testing this?
) | ||
|
||
|
||
class AllBytes(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most likely should implement this in fsspec.caching
(or move the one from fsspec.implementations.http there).
|
||
""" | ||
|
||
def __init__(self, **storage_options): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit clearer to have token as a required argument I think.
def __init__(self, **storage_options): | |
def __init__(self, token, *args, **storage_options): |
but still pass it through to super()
.
""" | ||
|
||
def __init__( | ||
self, fs, dbx, path, session=None, block_size=None, mode="rb", **kwargs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd recommend matching the signature of the base class.
self, fs, dbx, path, session=None, block_size=None, mode="rb", **kwargs | |
self, | |
fs, | |
path, | |
mode="rb", | |
block_size="default", | |
autocommit=True, | |
cache_type="readahead", | |
cache_options=None, | |
*, | |
dbx | |
**kwargs |
That uses a keyword only argument for dbx. Though I wonder: can you just get it from fs
instead?
Modify the signature of the class to follow the base class
Thank you for your review! I followed your advices. (I didn't know the possibility to directly commit suggestions sorry I will do it next time). |
Yeah, I'm not sure we have a good testing story we can bring to this one. It might have to be included as "use at your own risk". We have different testing strategies for different implementations:
|
Do you think it could be merge soon ? Can I do something else to allow this ? I have one of my new feature based on this implementation + intake module and I am waiting to release it. I would appreciate to have just an approximat' deadline. Like this, I can decide, whether or not, I should do a monkey patch in waiting. Thank you !! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a number of comments, most of which are easily addressed.
@TomAugspurger , how comfortable are we with including something alpha-quality and not tested in CI? I suggest we may be happy, so long as it says so in the documentation.
@@ -370,6 +370,16 @@ def __len__(self): | |||
return len(self.cache) | |||
|
|||
|
|||
class AllBytes(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is copied from the http implementation, which seems unnecessary duplication. Indeed, if we can fetch ranges, then this is not required anymore at all
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see my comment below about "copy-past".
Also, your colleague advises me to put this class in the caching file. If the http implementation is modified to used it from the catching module, it will remove the redundancy problem at least for this part of the code and it seems also more coherent. But as the pull request is not about the http implementation, I didn't wanted to modify it my-self here.
Tell me what you prefer about this please. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that this is clearly a caching method and should be included in the caching module, although it is not particularly useful for most files, since it just keeps the whole thing in memory. Moving the def out of the http module now is fine - there are tests for that functionality that would fail if somehow moving the code went wrong.
list_item = self.dbx.files_list_folder_continue(list_item.cursor) | ||
items = list_item.entries + items | ||
|
||
if detail: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noting that this does not cache listings for future calls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I try to look how it is done in other implementations but I don't see any difference... Can you give an hint about what you were expecting here please ?
Thanks :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some implementations populate a dirache
attribute, which is a dict of {path: [file, file]} storing the listings that have been seen so far. It has proven tricky to use, but potentially speeds up repeat listings of the same directory. You are fine to proceed without it, but may get users later complaining that calls with glob
and walk
are slow.
configured for the FileSystem creating this file | ||
""" | ||
for key, value in kwargs.items(): | ||
print("{0} = {1}".format(key, value)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be removed
|
||
self.path = path | ||
self.dbx = self.fs.dbx | ||
while "//" in path: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while
is not needed here, replace
does the job in one go, unless you expect "//////"-like strings
kwargs = self.kwargs.copy() | ||
headers = kwargs.pop("headers", {}) | ||
headers["Range"] = "bytes=%i-%i" % (start, end - 1) | ||
r = self.session.get(self.url, headers=headers, stream=True, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much of this seems to be copied from HTTPFile - which is OK, I guess, but less duplication would be good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a previous comment, I asked you how to proceed with this "copy-past". I did not know if I should import http implementation to use this method and avoid redundancy or if copy-past was fine as the implementations should stay independent between them.
Do you see another way to go ? and if not, which solution do you prefer? Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe you could have subclassed HTTPFile successfully to reduce the amount of code here, but without trying it, I cannot be sure - you may have had to override many of the methods anyway. Initially I suspected that you could just use HTTPFile directly, passing the known size in the constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, both methods are working. Except, that by subclassed HTTPFile, a little trick is needed to avoid error raise by the mode "wb"
=> send "rb" in super and set again the mode "wb" after (as well as 4 bool variables)
Like it makes the code little more dirty, I prefer the second solution (using the object HTTPFile object)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, well we don't like dirty code, so no need to change it.
Up to you I think, but my preference would be to have this in its own package. Regradless of the decision here, we should define a policy for that. |
…by using an instance of http to read files
Hello guys, |
@TomAugspurger would I believe prefer a separate package, and he is currently the main maintainer here, so I defer to him. That means more work for you, unfortunately: you will need to form your module into a package and release on pypi and ideally conda-forge too. At least, it does mean that you can change the code as fast as you need and release whenever you want. In that case, I would reduce this PR to include the move of the AllBytes class (if desired) and make the registry point to the new location of the implementation. Note that dropboxfs is already taken (that code may also be useful here). |
Ok. No problem for the additional work, I am always happy to try to do new stuffs! |
Hello! Holiday are finished and I finally upload my package in pypi (here). Also, I try to do a patch monkey but in fact I have no idea how to do it. |
Sorry, what are you trying to patch? |
Sorry, I was not clear. In waiting to you, to eventually merge this branch and release a new version of your package, I was trying to implement a monkey patch to make it already available for my colleagues in our environment. (basically, modify the registry file to add the dropbox implementation class). My manager advises me to it but it is the first time I am trying to do something like that. ^^ |
Oh I see. The following would be enough:
...but I will merge this now. Releases are pretty frequent, but I can't promise a specific date. By the way, you may want to raise a new issue in this repo announcing the availability of your repo and pypi package, so that people can comment. |
Nice! Thank you for your help, I learn a lot! It is a good start for the year, my first PR merged ! :) About the patch monkey, I have already tried that but as I am working directly with intake module (and not fsspec) it does not work for some reasons. But as you merged it, it is enough for me to develop the next step of my project on top of that in waiting for the release. |
Dropbox (partial) implementation : Only download 1 file at the time (tested with jpeg image)
issue #200