Create Dropbox remote #4793

gmrukwa · 2020-10-27T07:57:34Z

❗ I have followed the Contributing to DVC checklist.
📖 If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.

Don't have time to play with that for the next few days, but uploading to make it visible already. Will polish that soon and:

create docs update
add tests
discuss the app configuration for the API - you guys need to create your own app on Dropbox for the API usage to keep control over it, probably @efiop would be the proper person

Resolves #1277

gmrukwa · 2020-10-27T07:59:00Z

@efiop, just pinging to let you know the direction this issue is going.

efiop · 2020-10-27T13:46:43Z

Thank you @gmrukwa ! 🙏 At the first glance, this looks good and the direction looks correct too, you can safely continue with the rest of the points. We'll need some time for a proper review though, we currently have a lot on our plate, so thank you in advance for your patience 🙂

gmrukwa · 2020-10-31T01:36:26Z

Okay guys, I've created docs, tests and the last element to do is to provide the details of the Dropbox API connection - this is to be done by you. Not sure how you want to secure the App Key and App Secret, but here is the spot.

You need to go to the Dropbox App Console, create your application (permission and scope details here). Dropbox allows you to apply for production usage as soon as you have 50 users linked to your app, but there's some possibility to have the acceptance in the eager mode.

I didn't go with such a level of complexity as the Google Drive remote is done, as there's no need to give user an option to configure separate app key and secret at this stage since the Dropbox API has no limits specified publicly.

shcheklein · 2020-11-06T00:27:33Z

dvc/tree/dropbox.py

gmrukwa · 2020-11-10T20:55:02Z

There's an issue with dependencies for S3 mocks in Python < 3.8 after rebasing onto upstream changes.

shcheklein · 2020-11-13T08:07:45Z

Regarding the rate limits: https://www.dropbox.com/lp/developers/reference/dbx-performance-guide

It looks like we need to test on parallel upload of many files (that's what DVC does by default anyway). It might be very unstable if we don't handle 429 properly and do not respect Retry-after.

We can do parallel uploads, but it might require changing the remote class interface.

pared · 2020-11-13T10:32:57Z

@gmrukwa could you rebase? Dependencies problems should be fixed on the current master.

shcheklein · 2020-11-14T20:37:04Z

Testing it on a bunch of images. This is extremely slow for me for now- is it related to the rate limits I shared above?

Again, we can compare with rclone, I think.

shcheklein · 2020-11-14T21:41:15Z

Getting an exception now on uploading a dir (it got interrupted in the middle) . Now I can't recover from this.

ERROR: failed to upload '../../../../tmp/cache/f7/a911d2e8b3665d3656107a1e054e8f' to 'dropbox://my-project/f7/a911d2e8b3665d3656107a1e054e8f' - ApiError('00d392019d75463ba17df2870ee166ee', UploadSessionFinishError('lookup_failed', UploadSessionLookupError('incorrect_offset', UploadSessionOffsetError(correct_offset=6007346))))
ERROR: failed to upload '../../../../tmp/cache/05/be03208df9c660e9bc0e4752c221ad.dir' to 'dropbox://my-project/05/be03208df9c660e9bc0e4752c221ad.dir'
ERROR: failed to push data to the cloud - 2 files failed to upload

shcheklein · 2020-11-22T00:02:34Z

@gmrukwa hey! could you please take a look at those issue we are getting?

gmrukwa · 2020-11-22T09:58:02Z

@shcheklein absolutely - I just need to sort few more things out and hopefully today I will be able to work on that.

gmrukwa · 2020-11-28T02:19:53Z

@shcheklein, I do have everything according to the guide, especially that they embedded most of these considerations into the SDK. I've checked and the upload is done in very similar manner in rclone (large file upload, small vs large file upload switch). They just return the error code instead of purposeful ApiError being released here. I've prepared a test for the scenario you described, but it's probably a standard issue with the Dropbox API.

On the other hand, Dropbox has some problems when running as desktop client with tons of small files as well. The issue probably is related to the fact, that they "lock the namespace" on each file upload. There's no file-lock, but directory-lock each time.

For tons of small files the upload speed is at the level of 300-500 KB / s, for few big ones it reaches 12-14 MB / s smoothly. I played with the file sizes, chunk sizes and number of jobs per upload. File size is the most important factor, number of jobs doesn't allow for any significant improvement here, neither chunk size does. Especially that chunk size is supposed to be a multiple of 4 MB. Under certain size it doesn't matter if the file is 5 KB or 100 KB - the locking process on Dropbox side lasts long enough to make it irrelevant.

Could you provide some more details on how you ended up in a non-recoverable state? I cannot reproduce that (my approaches already included as skipped tests).

Remaining elements you mentioned:

Test dvc gc: dvc gc -c -w removes files from remote.
Support multiple remotes- credentials path variable, similar to GDrive - it is supported via DROPBOX_CREDENTIALS_FILE env variable. Tested that with Dropbox Teams account to not to overwrite my existing creds.
Check if we need to do something special in advance for Teams
Rate limits - retry? - check rclone. - As described above, the problem is not with the rate limit, but the directory locking mechanism.
Test import-url, external deps throw NotImplemented
Check exceptions handling- files not found, general errors. Consult with @efiop on what should we unify. At minimum it's probably better to warp all "expected" exceptions into a DVC-specific one. (check rclone)
Can we set remote to avoid local sync via API?
Tests: is there a mock/stub lib?

Teams support looks as follows: you do auth the exact same way you do for individual account and here's where the files are going (I specified path dropbox://dvc/my-test in my remote definition):

shcheklein · 2020-11-28T17:37:35Z

@gmrukwa thanks for the udpate!

Could you provide some more details on how you ended up in a non-recoverable state?

I was interrupting it with Ctrl-C during the upload process of many image files as far as I remember.

For tons of small files the upload speed is at the level of 300-500 KB / s

per file? or overall? have you tried to run rclone - is it the same?

Dropbox suggests to use Batch Upload in that case. I'm curious if rclone utilizes it? How hard would it be to try it? Probably not an easy change.

gmrukwa · 2020-11-28T19:13:14Z

I was interrupting it with Ctrl-C during the upload process of many image files as far as I remember.

The test I prepared uses process killing. Not sure yet, how I could simulate Ctrl-C more precisely here.

per file? or overall? have you tried to run rclone - is it the same?

Unfortunately, it's overall speed. Will test that with rclone as well. But you need to keep in mind that it was tested with 5KB and 100KB files. I remember seeing similar performance of the Dropbox client itself when handling numerous small files. That's not a great news, so I'll test once again with rclone. The only promising option I see (that we discussed already) is extended below.

Dropbox suggests to use Batch Upload in that case. I'm curious if rclone utilizes it? How hard would it be to try it? Probably not an easy change.

From the technical perspective, there are two aspects to consider.

Batch Upload is not actually a batch upload of multiple files, but a batch commit of multiple files after they get uploaded. That could potentially be useful for the case with numerous small files if the issue is with committing the content, not with the upload. Luckily, that seems quite likely.
Current DVC BaseTree API exhibits only methods that upload a single file (there's upload vs _download_file and _download_dir). I am afraid that introducing Batch Upload mechanism here would require breaking changes all over the codebase. That's something I would not be able to handle alone for sure, especially not until the end of the year. Additionally I doubt you want to introduce so much change just for introducing a single remote.

Basically it seems that we need a single place to call this function and this function for all the files and commits together. If you see some other option for that than wide changes in the codebase, please suggest. It's not clear however, how could that work. rclone doesn't use the Batch Upload.

Test import-url, external deps throw NotImplemented

I do have some strange issues with path parsing here, need to dig deeper. I'd decide though, whether we want to keep that feature enabled for Dropbox, as the folders are not uniquely identified by IDs. The file URL doesn't work for multiple users (it may change), unless the shares are exactly the same - it would not guarantee similar reproducibility to HTTP or S3.

Check exceptions handling- files not found, general errors. Consult with @efiop on what should we unify. At minimum it's probably better to warp all "expected" exceptions into a DVC-specific one. (check rclone)

I saw how the rclone does it. Do you mean similar wrapping like in GDrive has it? (Azure doesn't BTW).

Can we set remote to avoid local sync via API?

Do you mean the following scenario?

User works locally and has Dropbox installed.
dvc add dummy.txt
dvc push
Files do not get synced back from Dropbox, but only "smart synced" - user can view them, but no duplicate appears on the PC.

If that's the case, Dropbox API doesn't allow it, as it is not the account-level setting, but a local client setting. There's second option called "selective sync" that could disable default sync of DVC root directory. There is a subtle difference:

with "smart sync" you still see files "appearing" on your PC, although they don't get synced - you can view that something is progressing
with "selective sync" some directories are invisible on your PC - you see no progress, but also do not have overhead of metadata synchronization

You can use none of them, one or both toghether. I think that making DVC targets by default disabled from "selective sync" would make the most sense.

After validation: Unfortunately the SDK allows for configuring selective sync settings only through the Team API.

Tests: is there a mock/stub lib?

I did not find any that we could use here. Most are outdated / incompatible.

shcheklein · 2020-11-28T20:37:20Z

The test I prepared uses process killing. Not sure yet, how I could simulate Ctrl-C more precisely here.

that's the problem. Can't reproduce it now with killing the process :(

Let's wait for it to appear again, I guess as we test more.

I'd decide though, whether we want to keep that feature enabled for Dropbox, as the folders are not uniquely identified by IDs

yep, def not needed as part of this PR. Just check that we handle it gracefully. We can figure out this later.

Unfortunately, it's overall speed.

that's sad :(. Let's see if rclone does a better job here.

I am afraid that introducing Batch Upload mechanism here would require breaking changes all over the codebase.

I hope not. But it'll require some refactoring. I would expect that we will need to override or wrap _process from the remote/base.py. Not sure about the details- first need to take a look at the batch API itself. I also that this function is extremely coupled and does too many things- some split might be required.

On the other hand 300-400 KB/s - would it be even enough for your case? Can we consider this as a reasonable implementation?

Azure doesn't BTW

yep, Azure is very outdated. GDrive is a better example.

You can use none of them, one or both toghether. I think that making DVC targets by default disabled from "selective sync" would make the most sense.

but neither of them could be enabled via API, right? and selective sync is probably a premium feature (which might be fine, since 2GB free account is not very reasonable).

gmrukwa · 2020-11-28T20:48:10Z

that's sad :(. Let's see if rclone does a better job here.

Just tested rclone - it is 1/6 faster, but tested a bit different scenario, when they could actually lock whole directory at once. They scored ~49 secs for 100 files of 100 KB instead of ~63 secs with current implementation of this PR.

On the other hand 300-400 KB/s - would it be even enough for your case? Can we consider this as a reasonable implementation?

I didn't work with numerous small files, I have few bigger files (some of them around 1 GB). That would be perfectly fine, as the upload speed would be approx. 12 MB / s as per my benchmark today.

but neither of them could be enabled via API, right? and selective sync is probably a premium feature (which might be fine, since 2GB free account is not very reasonable).

Selective sync can be pre-enabled by API for Teams plan - that's the only opportunity. Anyone can disable sync of a directory in their client manually though (including free users with 2 GB of space).

I hope not. But it'll require some refactoring. I would expect that we will need to override or wrap _process from the remote/base.py. Not sure about the details- first need to take a look at the batch API itself. I also that this function is extremely coupled and does too many things- some split might be required.

Here the problem is the "Batch API" isn't formed by Dropbox. That's not obvious how one could proceed best, unless a very specific case gets addressed (due to sessions, locks, commits and completely different approach for small file).

@shcheklein, I would need info from you what I need to adjust here to have it merged. I am now wrapping exceptions, but if you see anything else, please let me know.

shcheklein · 2020-11-28T21:26:03Z

Here the problem is the "Batch API" isn't formed by Dropbox.

you mean that you have to start "batch session", which means that you have to save it somewhere potentially, etc, etc? Yes, it might complicate things, but it can be part of the internal Dropbox-specific implementation, no need for a general API here.

we can split files into batches depending on their size (e.g. 1-2GB per batch?). In this case we can potentially get a reasonable performance with multiple jobs (--jobs) and it's fine to fail, it will restart some 1-2GB batches.

Even if it is the case, I guess it won't be part of this PR. It's just good to do a bit of research if we are missing some low hanging fruits.

Selective sync can be pre-enabled by API for Teams plan - that's the only opportunity. Anyone can disable sync of a directory in their client manually though (including free users with 2 GB of space).

okay. Let's recommend in the docs at least that sync should be disabled one way or another to avoid duplicating data and traffic?

Just tested rclone - it is 1/6 faster

:(

Btw, I see that rclone improved performance in some cases by providing a better chunk size, do we do the same? rclone/rclone#103

@shcheklein, I would need info from you what I need to adjust here to have it merged. I am now wrapping exceptions, but if you see anything else, please let me know.

yep, I don't see that much left for this PR. It's a very good and solid start.

One last thing that we'll need to make sure that we set properly are those vars

    DEFAULT_VERIFY = False
    LIST_OBJECT_PAGE_SIZE = 1000
    TRAVERSE_WEIGHT_MULTIPLIER = 5
    TRAVERSE_PREFIX_LEN = 3
    TRAVERSE_THRESHOLD_SIZE = 500000
    CAN_TRAVERSE = True

from the base.py. @pmrowla could you help us with this? what information do you need for this? Dropbox is similar to GDrive - it's API based, APIs are expensive (and potentially limited), so what you be the recommended setup here?

gmrukwa · 2020-11-28T21:59:21Z

you mean that you have to start "batch session", which means that you have to save it somewhere potentially, etc, etc?

The scenario in rclone, on Stack Overflow and everywhere else I saw was as follows:

Check file size vs chunk size.
If file size < chunk size, use client.files_upload and you're done, otherwise proceed with remaining points.
Open session client.files_upload_session_start.
Make cursor for the session dropbox.files.UploadSessionCursor.
Make commit info for the file that gets uploaded dropbox.files.CommitInfo.
Upload chunk-by-chunk client.files_upload_session_append_v2.
Finish session with cursor & commit info client.files_upload_session_finish.

No one ever shows how to use the batch mode. ☝️ There are batch operations in the API for moves, deletions, folder creations, but not upload. What I understood from the docs, client must acquire a lock over the namespace (==directory) before committing the file. This is automated for client.files_upload within the SDK - thus probably cannot support batch mode in this setup (as locks are acquired one-by-one). There exist functions:

Perhaps we could acquire locks in batch mode for a single client and then commit everything at once. Not sure though how to use these locks, as the SDK doesn't accept them while committing. Even not sure if we should, in the case of multiple clients writing different files to the same directory (reasonable case for computations distributed over few machines).

we can split files into batches depending on their size (e.g. 1-2GB per batch?). In this case we can potentially get a reasonable performance with multiple jobs (--jobs) and it's fine to fail, it will restart some 1-2GB batches.

Yeah, I can enable that. Even though rclone claims that it can speed up upload max by 10%. And this makes a difference only for big files - for small ones we're using client.files_upload anyway. The chunk cannot exceed 150 MB and needs to be a multiple of 4 MB as stated in the docs. rclone has default of 48 MB chunk.

okay. Let's recommend in the docs at least that sync should be disabled one way or another to avoid duplicating data and traffic?

On it.

To summarize the next steps:

wrap exceptions
parameterize chunk size
recommend enabling selective sync / smart sync in the docs
adjust the above mentioned variables

gmrukwa · 2021-01-03T13:02:13Z

@shcheklein, I tried to address all the above points and it seems I've finished. Could you take a look now? If we're done, let's merge it - my migration from Azure awaits. 😂

shcheklein · 2021-01-03T17:31:05Z

@gmrukwa thanks a lot 🙏 ! I'll try to take a look asap!

Fixes iterative#1277

gmrukwa · 2021-01-09T11:14:40Z

Rebased to resolve merge conflicts ☝🏻

shcheklein

Please see some requests for refactoring, see PR I made to wire up properly import-ur/get-ulr/external deps and fix certain things

Next steps on my end:

Try to figure out why is it SOOOO slow. It took me ~10minutes to import-url 140Mb/32files. Is it expected?
I got my-project (Ignored Item Conflict) (Ignored Item Conflict) in my Dropbox - what is this?
figure out case sensitivity - /Users/ivan/Projects/test-dropbox/Books/../books/ ... should we use regular path, not lower in walk_files?
review tests and try to run them, wire up import url and other regular tests for remotes
check if APP is setup properly
review docs once again

shcheklein · 2021-01-10T02:52:46Z

dvc/tree/dropbox.py

+    )(func)
+
+
+class DropboxWrapper:


Feel like we can do some meta-programming https://stackoverflow.com/questions/2704434/intercept-method-calls-in-python to wrap all existing methods with decorators?

shcheklein · 2021-01-10T03:45:33Z

dvc/tree/dropbox.py

+            app_key=self.DROPBOX_APP_KEY,
+            app_secret=self.DROPBOX_APP_SECRET,
+        )
+        dbx.check_and_refresh_access_token()


what happens if token expiries right after this call?

(in Google Drive the client itself handles this, if it gets a specific HTTP error code it runs token refresh and retries the call)

Looks like it also should be wrapped, otherwise I'm getting this:

File "/Users/ivan/Projects/dvc/dvc/tree/dropbox.py", line 131, in auth dbx.check_and_refresh_access_token() File "/Users/ivan/Projects/dvc/.env/lib/python3.8/site-packages/dropbox/dropbox_client.py", line 358, in check_and_refresh_access_token self.refresh_access_token(scope=self._scope) File "/Users/ivan/Projects/dvc/.env/lib/python3.8/site-packages/dropbox/dropbox_client.py", line 397, in refresh_access_token raise AuthError(request_id, err) dropbox.exceptions.AuthError: AuthError('f584e934424c45b6a9af769f1b5edc84', AuthError('invalid_access_token', None))

if token is invalid

shcheklein · 2021-01-10T04:38:15Z

dvc/tree/dropbox.py

+        cred_prv = FileCredProvider(self.repo)
+        if cred_prv.can_auth():
+            logger.debug("Logging into Dropbox with credentials file")
+            return cred_prv.auth().client


should we save credentials here in this flow (in case they are refreshed)?

I didn't spot the refresh token to change anytime. Therefore, actually you only need the refresh token to get the access token each time before the API call. The remaining elements could probably be dropped, but I introduced them to keep compatibility if they start to timeout the refresh token.

yep, but it means we are going to keep refreshing them every run? (extra API call, extra time to an operation). Probably not the biggest problem, but I thought it can be a simple fix since you already have save() mechanics.

The access token lives 4 hours. It's not much. We create the client once per dvc command being ran, so the update takes place at most once per CLI run.

shcheklein · 2021-01-10T04:43:23Z

dvc/tree/dropbox.py

+
+class EnvCredProvider(DropboxCredProvider):
+    def can_auth(self):
+        return bool(os.environ.get(REFRESH_TOKEN, False))


I think existence of any of three vars should trigger this flow. Also would be great to detect then that some of them is not defined in throw DvcException.

I'll refer to my comment above. You actually need only the refresh token, as it allows you to get the access token anytime. Refresh token is not ephemeral one-time token just for a single refresh, you can use it as many time as you want.

interesting, thanks!

Thinking from the user perspective I still think it's better try to this type of auth if we detected any of those vars (and fail if they are not enough, e.g. REFRESH_TOKEN is not set). Just to be extra explicit. To avoid situations you set some vars, but it still uses file w/o any explanation. WDYT?

I thought of two approaches:

requiring just the refresh token - that seems reasonable if it's the only must-have element

requiring everything - in the case they will shorten life of refresh token or make it usable just once

It's your decision, I can adjust the code as needed.

So, there are two different things (correct me if I wrong)

Trigger - some factor that defines that we go the path of a specific auth (and don't try others)

Validation - check if set of params is enough to go that path

So, in this specific case (to my mind):

Can be any of the env vars present. It's better from the user perspective, we are extra explicit.

Can check existence of the refresh token. Or also allow both keys to be present (I'm not sure if it makes sense at all)?

Btw, how would the actual workflow with env vars would look like? Where do users get them in the first place?

shcheklein · 2021-01-10T04:44:18Z

dvc/tree/dropbox.py

+    def can_auth(self):
+        if not os.path.exists(self._cred_location):
+            return False
+        with open(self._cred_location) as infile:


I think existence of the file should be enough to trigger it. If content is bad we can rely on the client to fail and ideally handle it gracefully.

True, we could.

shcheklein · 2021-01-11T02:06:31Z

dvc/tree/dropbox.py

+            if ex.error.is_unsupported_file():
+                raise DvcException(
+                    "Path '{}' is not downloadable:\n\n"
+                    "Confirm the file is a casual file, not a Dropbox thing."


hmm, could you clarify, what does Dropbox thing mean?

They do have e.g. Dropbox Paper documents or Dropbox Vault items which are not downloadable this way.

yep, makes sense, Dropbox thing sounds a bit strange though :), how do they call them in docs?

They do not call them in the SDK docs at all (or at least I haven't spotted) 😂 There's just written, that something may not be downloadable.

Look like we call them Non-downloadable files - https://www.dropbox.com/lp/developers/reference/dbx-file-access-guide#files-that-require-special-consideration .

Confirm that it's a regular file, not a non-downloadable Dropbox file

(I don't like it either, but at least it's sounds a bit more formal and can be googled)

shcheklein · 2021-01-11T02:08:19Z

dvc/tree/dropbox.py

+            json.dump(_creds, outfile)
+
+
+def path_info_to_dropbox_path(path_info):


@efiop I would need your help here - what would be the best way to handle this? instead of writing this conversion over and over again in each method below, is there a better way?

btw, may be push it into DropboxWrapper above?

actually, I think that we should ask users to use dropbox:///path notation (as in file:///path) when hostname is not specified. This is a proper way to handle this in this case, since Dropbox doesn't have buckets, servers, etc

I've updated the code in gmrukwa#1 to allow empty hostname so that we don't have to deal with bucket that is not bucket at all (and to deal with cases like import-url dropbox:///file.pdf

still, I wonder what is the best way to handle this leading / here, @efiop ? Dropbox expects them this way? Is there a convenience method in PathInfo to return path with /?

Dropbox generally expects a path that is like /a/path/to/file, which gets rooted at the root of your Dropbox home. I couldn't get it to work simply without such a construct unfortunately. You will rather know some better way.

yep, I did some changes to support URL without hostname (which is not present here). Usually the look like file:///path (mind triple ///).

shcheklein · 2021-01-11T02:09:51Z

dvc/tree/dropbox.py

+            fobj, "read", desc=name, total=file_size, disable=no_progress_bar
+        ) as wrapped:
+            if file_size <= chunk_size:
+                logger.debug("Small file upload")


feels a bit too excessive - make it trace? or remove?

shcheklein · 2021-01-11T02:10:31Z

dvc/tree/dropbox.py

+        chunk_size = self.chunk_size_mb * 1024 * 1024
+        to_path = path_info_to_dropbox_path(to_info)
+        file_size = os.path.getsize(from_file)
+        logger.debug("Uploading " + from_file + " to " + to_path)


should it be part of the parent's call? is it now? better double check to prevent multiple logs happening.

shcheklein · 2021-01-11T03:03:13Z

dvc/tree/dropbox.py

+    def __init__(self, repo, config):
+        super().__init__(repo, config)
+
+        self.path_info = self.PATH_CLS(config.get("url", "dropbox://default"))


should it be None instead? "dropbox://default" looks artificial and might affect existing folder for some users?

Right, it should be None and potentially throw an exception.

shcheklein · 2021-01-12T04:24:26Z

dvc/tree/dropbox.py

+        return self.client.files_download(*args, **kwargs)
+
+
+ACCESS_TOKEN = "DROPBOX_ACCESS_TOKEN"


It looks like rclone uses a different way of authenticating? It gets `

{"access_token":".***","token_type":"bearer","expiry":"0001-01-01T00:00:00Z"}

I don't see a refresh token, and expiry is not set. Do you know something about this?

They may be using legacy auth, which will be dropped soon by Dropbox.

shcheklein · 2021-01-12T05:17:38Z

@gmrukwa the biggest problem for me right now is that download takes a lot of time for a directory of ~142MB which contains files ~1MB to 10MB.

rclone does it in 20s
dvc pull works minutes, sometimes a single file is slower than 20s

Do you have any idea what to try, where to look for the problem?

gmrukwa · 2021-01-22T07:39:13Z

@gmrukwa the biggest problem for me right now is that download takes a lot of time for a directory of ~142MB which contains files ~1MB to 10MB.

rclone does it in 20s
dvc pull works minutes, sometimes a single file is slower than 20s

Do you have any idea what to try, where to look for the problem?

They use download function from SDK as well (here), here we do as well. I do not have any other ideas for that.

shcheklein · 2021-01-22T21:42:21Z

@gmrukwa yep, I have no idea either. Let me try to debug it a bit during this weekend to see if I can find something.

shcheklein · 2021-01-24T05:40:05Z

@gmrukwa improved download tremendously (it's now 25Mb/s vs 150Kb/s on a large file, and downloads a dir with files faster than rclone) here gmrukwa@3ea0908 . It's a simple fix. Haven't had time to check/compare upload yet.

gmrukwa · 2021-01-24T14:22:43Z

@shcheklein, that fix makes perfect sense! I was looking for that option in the Dropbox SDK actually, but it didn't came to my mind that download chunk size is steered on the level of iter_content... 🤯 I'd use the chunk_size setting value here, wouldn't it make sense?

shcheklein · 2021-01-25T18:55:45Z

@gmrukwa

I'd use the chunk_size setting value here, wouldn't it make sense?

good question 🤔 to be honest, not sure about this. My only concern is if they have a bit different limits and one expects a range in KBs, second in MBs to have a good performance. It looks like rclone's setting doesn't affect download?

gmrukwa · 2021-01-26T08:05:38Z

@gmrukwa

I'd use the chunk_size setting value here, wouldn't it make sense?

good question 🤔 to be honest, not sure about this. My only concern is if they have a bit different limits and one expects a range in KBs, second in MBs to have a good performance. It looks like rclone's setting doesn't affect download?

rclone seems to be using different method from the SDK, more like files_download_to_file, but with DVC we want to show the download progress. I've reviewed the code for files_download_to_file and Dropbox internally indeed uses 2**16 chunk size for download. Let's leave as you proposed then.

efiop · 2021-03-08T15:22:40Z

Quick update: we are migrating to fsspec and that will be our plugin mechanism for the future. We won't be able to provide proper support for dropbox ourselves, so the best way to go about it would be for you to implement dropboxfs using fsspec and maintain that implementation (try registering it in fsspec, other people might find it useful too). On our end, we will allow using that as a plugin as seamlessly as possible. Thank you so much for contributing!

There is an existing 3rd party implementation https://github.com/MarineChap/dropboxdrivefs , but I'm not sure how stable it is and whether it supports the same features that this PR supports. There has been some pretty active discussion around it in fsspec/filesystem_spec#207 . We might want to consider either contributing to the existing package or creating a new implementation.

gmrukwa force-pushed the master branch from 823595b to 424bf4b Compare October 30, 2020 20:09

gmrukwa mentioned this pull request Oct 31, 2020

Create Dropbox remote docs iterative/dvc.org#1896

Closed

gmrukwa marked this pull request as ready for review October 31, 2020 01:29

gmrukwa changed the title ~~[WIP] Create Dropbox remote~~ Create Dropbox remote Oct 31, 2020

shcheklein reviewed Nov 6, 2020

View reviewed changes

dvc/tree/dropbox.py Show resolved Hide resolved

shcheklein reviewed Nov 6, 2020

View reviewed changes

dvc/tree/dropbox.py Outdated Show resolved Hide resolved

gmrukwa force-pushed the master branch 2 times, most recently from d53bced to b5a654a Compare November 10, 2020 20:31

gmrukwa force-pushed the master branch from b5a654a to 0a5aee9 Compare November 27, 2020 21:41

gmrukwa force-pushed the master branch 2 times, most recently from 554fd1a to 2b96dfb Compare November 28, 2020 16:04

gmrukwa force-pushed the master branch from 01a29c5 to 8abab42 Compare January 3, 2021 15:44

remote: Create Dropbox remote

81962d8

Fixes iterative#1277

gmrukwa force-pushed the master branch from 8abab42 to 81962d8 Compare January 9, 2021 11:14

shcheklein requested changes Jan 11, 2021

View reviewed changes

shcheklein reviewed Jan 12, 2021

View reviewed changes

efiop closed this Mar 8, 2021

efiop mentioned this pull request Oct 8, 2021

remote: support Dropbox #1277

Closed

		json.dump(_creds, outfile)


		def path_info_to_dropbox_path(path_info):

		return self.client.files_download(args, *kwargs)


		ACCESS_TOKEN = "DROPBOX_ACCESS_TOKEN"

Create Dropbox remote #4793

Create Dropbox remote #4793

Uh oh!

Conversation

gmrukwa commented Oct 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gmrukwa commented Oct 27, 2020

Uh oh!

efiop commented Oct 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gmrukwa commented Oct 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shcheklein commented Nov 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gmrukwa commented Nov 10, 2020

Uh oh!

shcheklein commented Nov 13, 2020

Uh oh!

pared commented Nov 13, 2020

Uh oh!

shcheklein commented Nov 14, 2020

Uh oh!

shcheklein commented Nov 14, 2020

Uh oh!

shcheklein commented Nov 22, 2020

Uh oh!

gmrukwa commented Nov 22, 2020

Uh oh!

gmrukwa commented Nov 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shcheklein commented Nov 28, 2020

Uh oh!

gmrukwa commented Nov 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shcheklein commented Nov 28, 2020

Uh oh!

gmrukwa commented Nov 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shcheklein commented Nov 28, 2020

Uh oh!

gmrukwa commented Nov 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gmrukwa commented Jan 3, 2021

Uh oh!

shcheklein commented Jan 3, 2021

Uh oh!

gmrukwa commented Jan 9, 2021

Uh oh!

shcheklein left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gmrukwa commented Oct 27, 2020 •

edited

Loading

efiop commented Oct 27, 2020 •

edited

Loading

gmrukwa commented Oct 31, 2020 •

edited

Loading

shcheklein commented Nov 6, 2020 •

edited

Loading

gmrukwa commented Nov 28, 2020 •

edited

Loading

gmrukwa commented Nov 28, 2020 •

edited

Loading

gmrukwa commented Nov 28, 2020 •

edited

Loading

gmrukwa commented Nov 28, 2020 •

edited

Loading

shcheklein left a comment •

edited

Loading