Dropbox implementation + add existing implementation in the registry #207

MarineChap · 2019-11-21T12:43:19Z

Dropbox (partial) implementation : Only download 1 file at the time (tested with jpeg image)
issue #200

…file Dropbox (partial) implementation : Only download 1 file at the time (tested with jpeg image)

martindurant · 2019-11-21T15:03:40Z

Thank you for contributing!
Before proceeding, it seems to me that the URL link capability would allow for fetching parts of a file, with RANGE header keys like a HTTPFile. That would be very useful for chunking and parallel access, which we generally care about in fsspec.

I note that there is no file listing here, which does seem fairly easy to do with the dropbox API.

Also, the API allows for uploading in chunks via a session, so it would be nice to implement writing too.

Aside from that, documentation will be important here, such as the token parameter in __init__ - how is htis generated?

MarineChap · 2019-11-21T20:03:53Z

The token is the dropbox key generated by each user by adding an app in their account. Where I should add this documentation ? In class comment or in an error message in case of fail due to a missing token ?

I had the file listing but for the fetch range, it is really confusing. I don't fully understand how it works ... So, seems difficult to implement.

And finally, indeed, upload seems easy to implement and I already have a draft but I don't see how it is integrated in the filesystem class. I just need to overload the method _upload_chunk ?

martindurant · 2019-11-21T20:25:08Z

In class comment or in an error message

In the class docs first, since it should be a required parameter, token=; for the error, you should surface the failure in a way that makes sense. So don't swallow the HTTP exception for a print, but let it bubble up.

the fetch range

From my brief reading, it seems that grabbing the URL and using the existing machinery of fsspec.implementations.http.HTTPFile is best. You would need to create a Session containing the credentials. The file size is already known from the metadata.

I just need to overload the method _upload_chunk

_upload_chunk writes data to an existing upload session and commits it when final is True, _initiate_upload creates such a session.

MarineChap · 2019-11-22T10:47:12Z

Hello,
I have finally understood how to do the download file with fetch_range. It is working but I am unsure how to proceed. I have copy-past the fetch_range, fetch_all, read method from the http implementation and not import them. I was thinking it is better to keep the implementation independent from each other, seems good for you ?
I have also implement the upload part.

ask for install the request module as well as the dropbox module

martindurant · 2019-11-22T19:22:21Z

@TomAugspurger, @rabernat , you may be interested by this, another very nice piece linking to broadly-available public storage.

TomAugspurger

Gave a quick skim, looks nice. Two questions:

What's the policy on whether an implementation belongs in fsspec or not? This discussion likely doesn't belong in this PR, though it would determine how we proceed.
Do we have a strategy for testing this?

TomAugspurger · 2019-11-22T21:12:48Z

fsspec/implementations/dropboxdrivefs.py

+        )
+
+
+class AllBytes(object):


Most likely should implement this in fsspec.caching (or move the one from fsspec.implementations.http there).

TomAugspurger · 2019-11-22T21:16:00Z

fsspec/implementations/dropboxdrivefs.py

+
+    """
+
+    def __init__(self, **storage_options):


A bit clearer to have token as a required argument I think.

Suggested change

def __init__(self, **storage_options):

def __init__(self, token, *args, **storage_options):

but still pass it through to super().

TomAugspurger · 2019-11-22T21:17:51Z

fsspec/implementations/dropboxdrivefs.py

+    """
+
+    def __init__(
+        self, fs, dbx, path, session=None, block_size=None, mode="rb", **kwargs


I'd recommend matching the signature of the base class.

Suggested change

self, fs, dbx, path, session=None, block_size=None, mode="rb", **kwargs

self,

fs,

path,

mode="rb",

block_size="default",

autocommit=True,

cache_type="readahead",

cache_options=None,

*,

dbx

**kwargs

That uses a keyword only argument for dbx. Though I wonder: can you just get it from fs instead?

Modify the signature of the class to follow the base class

MarineChap · 2019-11-24T16:29:54Z

Thank you for your review! I followed your advices. (I didn't know the possibility to directly commit suggestions sorry I will do it next time).
For testing, I have a script which run main features but it is using my dropbox with my personal token. So I don't really know how to test it without using it.

martindurant · 2019-11-25T15:59:16Z

Yeah, I'm not sure we have a good testing story we can bring to this one. It might have to be included as "use at your own risk". We have different testing strategies for different implementations:

s3fs uses an emulator of the storage service
ssh and some others use real servers deployed on docker
gcsfs uses an HTTP request recording library (painful to use)
gdrive uses travis environment secrets and tests against the real service (this requires PRs are only on in-repo branches, not forks)

MarineChap · 2019-11-27T14:33:16Z

Do you think it could be merge soon ? Can I do something else to allow this ?
And if it is merged, do you have a schedule for the next release, please ?

I have one of my new feature based on this implementation + intake module and I am waiting to release it. I would appreciate to have just an approximat' deadline. Like this, I can decide, whether or not, I should do a monkey patch in waiting. Thank you !!

martindurant

I added a number of comments, most of which are easily addressed.

@TomAugspurger , how comfortable are we with including something alpha-quality and not tested in CI? I suggest we may be happy, so long as it says so in the documentation.

martindurant · 2019-11-27T14:37:35Z

fsspec/caching.py

@@ -370,6 +370,16 @@ def __len__(self):
        return len(self.cache)


+class AllBytes(object):


This is copied from the http implementation, which seems unnecessary duplication. Indeed, if we can fetch ranges, then this is not required anymore at all

see my comment below about "copy-past".
Also, your colleague advises me to put this class in the caching file. If the http implementation is modified to used it from the catching module, it will remove the redundancy problem at least for this part of the code and it seems also more coherent. But as the pull request is not about the http implementation, I didn't wanted to modify it my-self here.
Tell me what you prefer about this please. :)

Agree that this is clearly a caching method and should be included in the caching module, although it is not particularly useful for most files, since it just keeps the whole thing in memory. Moving the def out of the http module now is fine - there are tests for that functionality that would fail if somehow moving the code went wrong.

fsspec/implementations/dropboxdrivefs.py

martindurant · 2019-11-27T14:42:53Z

fsspec/implementations/dropboxdrivefs.py

+            list_item = self.dbx.files_list_folder_continue(list_item.cursor)
+            items = list_item.entries + items
+
+        if detail:


Noting that this does not cache listings for future calls

I try to look how it is done in other implementations but I don't see any difference... Can you give an hint about what you were expecting here please ?
Thanks :)

Some implementations populate a dirache attribute, which is a dict of {path: [file, file]} storing the listings that have been seen so far. It has proven tricky to use, but potentially speeds up repeat listings of the same directory. You are fine to proceed without it, but may get users later complaining that calls with glob and walk are slow.

martindurant · 2019-11-27T14:43:30Z

fsspec/implementations/dropboxdrivefs.py

+            configured for the FileSystem creating this file
+        """
+        for key, value in kwargs.items():
+            print("{0} = {1}".format(key, value))


This should be removed

martindurant · 2019-11-27T14:43:59Z

fsspec/implementations/dropboxdrivefs.py

+
+        self.path = path
+        self.dbx = self.fs.dbx
+        while "//" in path:


while is not needed here, replace does the job in one go, unless you expect "//////"-like strings

martindurant · 2019-11-27T14:46:14Z

fsspec/implementations/dropboxdrivefs.py

+        kwargs = self.kwargs.copy()
+        headers = kwargs.pop("headers", {})
+        headers["Range"] = "bytes=%i-%i" % (start, end - 1)
+        r = self.session.get(self.url, headers=headers, stream=True, **kwargs)


Much of this seems to be copied from HTTPFile - which is OK, I guess, but less duplication would be good

In a previous comment, I asked you how to proceed with this "copy-past". I did not know if I should import http implementation to use this method and avoid redundancy or if copy-past was fine as the implementations should stay independent between them.
Do you see another way to go ? and if not, which solution do you prefer? Thanks

I believe you could have subclassed HTTPFile successfully to reduce the amount of code here, but without trying it, I cannot be sure - you may have had to override many of the methods anyway. Initially I suspected that you could just use HTTPFile directly, passing the known size in the constructor.

Indeed, both methods are working. Except, that by subclassed HTTPFile, a little trick is needed to avoid error raise by the mode "wb"
=> send "rb" in super and set again the mode "wb" after (as well as 4 bool variables)
Like it makes the code little more dirty, I prefer the second solution (using the object HTTPFile object)

OK, well we don't like dirty code, so no need to change it.

TomAugspurger · 2019-11-27T14:57:10Z

Up to you I think, but my preference would be to have this in its own package.

Regradless of the decision here, we should define a policy for that.

…by using an instance of http to read files

MarineChap · 2019-12-03T07:39:26Z

Hello guys,
What do you think about the last version? Did you decide if you prefer a separate package or include the implementation in yours ?
Thanks for the feedback. :)

martindurant · 2019-12-03T15:29:15Z

@TomAugspurger would I believe prefer a separate package, and he is currently the main maintainer here, so I defer to him. That means more work for you, unfortunately: you will need to form your module into a package and release on pypi and ideally conda-forge too. At least, it does mean that you can change the code as fast as you need and release whenever you want. In that case, I would reduce this PR to include the move of the AllBytes class (if desired) and make the registry point to the new location of the implementation. Note that dropboxfs is already taken (that code may also be useful here).

MarineChap · 2019-12-04T21:55:47Z

Ok. No problem for the additional work, I am always happy to try to do new stuffs!
For what I can see, dropboxfs is now maintained here. And the good point is that it implements some tests!
But in any case, it won't be a problem as I called my module dropboxdrivefs and I didn't find any other occurrences in pypi and conda-forge.
Thanks you for your help !

MarineChap · 2020-01-07T13:10:24Z

Hello! Holiday are finished and I finally upload my package in pypi (here).
Do you think you will be able to merge this branch any time soon?

Also, I try to do a patch monkey but in fact I have no idea how to do it.
Could you give some hints please?
Thanks
Happy new year ! :)

martindurant · 2020-01-07T14:13:24Z

I try to do a patch monkey

Sorry, what are you trying to patch?

MarineChap · 2020-01-07T14:23:27Z

Sorry, I was not clear. In waiting to you, to eventually merge this branch and release a new version of your package, I was trying to implement a monkey patch to make it already available for my colleagues in our environment. (basically, modify the registry file to add the dropbox implementation class). My manager advises me to it but it is the first time I am trying to do something like that. ^^

martindurant · 2020-01-07T14:29:50Z

Oh I see. The following would be enough:

from fsspec.registry import registry
import dropboxfilefs
registry["dropbox"] = dropboxdrivefs.DropboxDriveFileSystem

...but I will merge this now. Releases are pretty frequent, but I can't promise a specific date.

By the way, you may want to raise a new issue in this repo announcing the availability of your repo and pypi package, so that people can comment.

MarineChap · 2020-01-07T14:38:30Z

Nice! Thank you for your help, I learn a lot! It is a good start for the year, my first PR merged ! :)

About the patch monkey, I have already tried that but as I am working directly with intake module (and not fsspec) it does not work for some reasons. But as you merged it, it is enough for me to develop the next step of my project on top of that in waiting for the release.
I will create the issue as you advice it.

Dropbox implementation + add existing implementation in the registry …

ba37f48

…file Dropbox (partial) implementation : Only download 1 file at the time (tested with jpeg image)

dropbox implementation : file listing and info method

1e84ce2

Dropbox implementation: upload file

5f862ff

marine chaput added 2 commits November 22, 2019 12:00

dropbox implementation: dowload with fetching part of the file

aa6517f

Update the error message for dropbox implementation

4b06da3

ask for install the request module as well as the dropbox module

TomAugspurger reviewed Nov 22, 2019

View reviewed changes

Transfer AllBytes in caching + refactor the dropbox implementation

fda2e8d

Modify the signature of the class to follow the base class

martindurant reviewed Nov 27, 2019

View reviewed changes

update http by using allBytes in catching.py + update dropboxdrivefs …

8149569

…by using an instance of http to read files

remove dropboxdrivefs to create a separate package

7f6a462

marine chaput added 2 commits December 4, 2019 22:57

Modify class name and error message for the dropbox implementation

d193f04

fix lint

5577251

martindurant merged commit c325a40 into fsspec:master Jan 7, 2020

MarineChap deleted the dropbox_implementation branch January 7, 2020 14:38

MarineChap restored the dropbox_implementation branch January 7, 2020 14:42

efiop mentioned this pull request Mar 8, 2021

Create Dropbox remote iterative/dvc#4793

Closed

5 tasks

	def __init__(self, **storage_options):
	def __init__(self, token, args, *storage_options):

-        self, fs, dbx, path, session=None, block_size=None, mode="rb", **kwargs
+        self,
+        fs,
+        path,
+        mode="rb",
+        block_size="default",
+        autocommit=True,
+        cache_type="readahead",
+        cache_options=None,
+        *,
+        dbx
+        **kwargs

		@@ -370,6 +370,16 @@ def __len__(self):
		return len(self.cache)


		class AllBytes(object):

Dropbox implementation + add existing implementation in the registry #207

Dropbox implementation + add existing implementation in the registry #207

Conversation

MarineChap commented Nov 21, 2019

martindurant commented Nov 21, 2019

MarineChap commented Nov 21, 2019

martindurant commented Nov 21, 2019

MarineChap commented Nov 22, 2019

martindurant commented Nov 22, 2019

TomAugspurger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MarineChap commented Nov 24, 2019

martindurant commented Nov 25, 2019

MarineChap commented Nov 27, 2019

martindurant left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Nov 27, 2019

MarineChap commented Dec 3, 2019

martindurant commented Dec 3, 2019

MarineChap commented Dec 4, 2019

MarineChap commented Jan 7, 2020

martindurant commented Jan 7, 2020

MarineChap commented Jan 7, 2020

martindurant commented Jan 7, 2020

MarineChap commented Jan 7, 2020