Skip to content

Create Dropbox remote #4793

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

Create Dropbox remote #4793

wants to merge 1 commit into from

Conversation

gmrukwa
Copy link

@gmrukwa gmrukwa commented Oct 27, 2020

Don't have time to play with that for the next few days, but uploading to make it visible already. Will polish that soon and:

  • create docs update
  • add tests
  • discuss the app configuration for the API - you guys need to create your own app on Dropbox for the API usage to keep control over it, probably @efiop would be the proper person

Resolves #1277

@gmrukwa
Copy link
Author

gmrukwa commented Oct 27, 2020

@efiop, just pinging to let you know the direction this issue is going.

@efiop
Copy link
Contributor

efiop commented Oct 27, 2020

Thank you @gmrukwa ! πŸ™ At the first glance, this looks good and the direction looks correct too, you can safely continue with the rest of the points. We'll need some time for a proper review though, we currently have a lot on our plate, so thank you in advance for your patience πŸ™‚

@gmrukwa gmrukwa marked this pull request as ready for review October 31, 2020 01:29
@gmrukwa
Copy link
Author

gmrukwa commented Oct 31, 2020

Okay guys, I've created docs, tests and the last element to do is to provide the details of the Dropbox API connection - this is to be done by you. Not sure how you want to secure the App Key and App Secret, but here is the spot.

You need to go to the Dropbox App Console, create your application (permission and scope details here). Dropbox allows you to apply for production usage as soon as you have 50 users linked to your app, but there's some possibility to have the acceptance in the eager mode.

I didn't go with such a level of complexity as the Google Drive remote is done, as there's no need to give user an option to configure separate app key and secret at this stage since the Dropbox API has no limits specified publicly.

@gmrukwa gmrukwa changed the title [WIP] Create Dropbox remote Create Dropbox remote Oct 31, 2020
@shcheklein
Copy link
Member

shcheklein commented Nov 6, 2020

Okay, will keep track of the thing to review/improve/discuss here. Let's follow up in comments as we address and resolve, and check the box + leave a link to the comment.

  • Make sure that Application can be shared. How? May be it affects pretty much only the docs and app type. iterative/dvc.org@9a49327 as discussed here Create Dropbox remote docsΒ dvc.org#1896 (comment)
  • Check that download API in use doesn't fetch the whole file into memory (we had that with GDrive)
  • Rate limits - retry? - check rclone.
  • Check if we need to do something special in advance for Teams
  • Test dvc gc
  • Test import-url, external deps throw NotImplemented
  • Check exceptions handling- files not found, general errors. Consult with @efiop on what should we unify. At minimum it's probably better to warp all "expected" exceptions into a DVC-specific one. (check rclone)
  • Support multiple remotes- credentials path variable, similar to GDrive
  • Can we set remote to avoid local sync via API?
  • Tests: is there a mock/stub lib?

@gmrukwa gmrukwa force-pushed the master branch 2 times, most recently from d53bced to b5a654a Compare November 10, 2020 20:31
@gmrukwa
Copy link
Author

gmrukwa commented Nov 10, 2020

There's an issue with dependencies for S3 mocks in Python < 3.8 after rebasing onto upstream changes.

@shcheklein
Copy link
Member

Regarding the rate limits: https://www.dropbox.com/lp/developers/reference/dbx-performance-guide

It looks like we need to test on parallel upload of many files (that's what DVC does by default anyway). It might be very unstable if we don't handle 429 properly and do not respect Retry-after.

We can do parallel uploads, but it might require changing the remote class interface.

@pared
Copy link
Contributor

pared commented Nov 13, 2020

@gmrukwa could you rebase? Dependencies problems should be fixed on the current master.

@shcheklein
Copy link
Member

Testing it on a bunch of images. This is extremely slow for me for now- is it related to the rate limits I shared above?

Screen Shot 2020-11-14 at 12 34 45 PM

Again, we can compare with rclone, I think.

@shcheklein
Copy link
Member

Getting an exception now on uploading a dir (it got interrupted in the middle) . Now I can't recover from this.

ERROR: failed to upload '../../../../tmp/cache/f7/a911d2e8b3665d3656107a1e054e8f' to 'dropbox://my-project/f7/a911d2e8b3665d3656107a1e054e8f' - ApiError('00d392019d75463ba17df2870ee166ee', UploadSessionFinishError('lookup_failed', UploadSessionLookupError('incorrect_offset', UploadSessionOffsetError(correct_offset=6007346))))
ERROR: failed to upload '../../../../tmp/cache/05/be03208df9c660e9bc0e4752c221ad.dir' to 'dropbox://my-project/05/be03208df9c660e9bc0e4752c221ad.dir'
ERROR: failed to push data to the cloud - 2 files failed to upload

@shcheklein
Copy link
Member

@gmrukwa hey! could you please take a look at those issue we are getting?

@gmrukwa
Copy link
Author

gmrukwa commented Nov 22, 2020

@shcheklein absolutely - I just need to sort few more things out and hopefully today I will be able to work on that.

@gmrukwa
Copy link
Author

gmrukwa commented Nov 28, 2020

@shcheklein, I do have everything according to the guide, especially that they embedded most of these considerations into the SDK. I've checked and the upload is done in very similar manner in rclone (large file upload, small vs large file upload switch). They just return the error code instead of purposeful ApiError being released here. I've prepared a test for the scenario you described, but it's probably a standard issue with the Dropbox API.

On the other hand, Dropbox has some problems when running as desktop client with tons of small files as well. The issue probably is related to the fact, that they "lock the namespace" on each file upload. There's no file-lock, but directory-lock each time.

For tons of small files the upload speed is at the level of 300-500 KB / s, for few big ones it reaches 12-14 MB / s smoothly. I played with the file sizes, chunk sizes and number of jobs per upload. File size is the most important factor, number of jobs doesn't allow for any significant improvement here, neither chunk size does. Especially that chunk size is supposed to be a multiple of 4 MB. Under certain size it doesn't matter if the file is 5 KB or 100 KB - the locking process on Dropbox side lasts long enough to make it irrelevant.

Could you provide some more details on how you ended up in a non-recoverable state? I cannot reproduce that (my approaches already included as skipped tests).

Remaining elements you mentioned:

  • Test dvc gc: dvc gc -c -w removes files from remote.
  • Support multiple remotes- credentials path variable, similar to GDrive - it is supported via DROPBOX_CREDENTIALS_FILE env variable. Tested that with Dropbox Teams account to not to overwrite my existing creds.
  • Check if we need to do something special in advance for Teams
  • Rate limits - retry? - check rclone. - As described above, the problem is not with the rate limit, but the directory locking mechanism.
  • Test import-url, external deps throw NotImplemented
  • Check exceptions handling- files not found, general errors. Consult with @efiop on what should we unify. At minimum it's probably better to warp all "expected" exceptions into a DVC-specific one. (check rclone)
  • Can we set remote to avoid local sync via API?
  • Tests: is there a mock/stub lib?

Teams support looks as follows: you do auth the exact same way you do for individual account and here's where the files are going (I specified path dropbox://dvc/my-test in my remote definition):
image

@gmrukwa gmrukwa force-pushed the master branch 2 times, most recently from 554fd1a to 2b96dfb Compare November 28, 2020 16:04
@shcheklein
Copy link
Member

@gmrukwa thanks for the udpate!

Could you provide some more details on how you ended up in a non-recoverable state?

I was interrupting it with Ctrl-C during the upload process of many image files as far as I remember.

For tons of small files the upload speed is at the level of 300-500 KB / s

per file? or overall? have you tried to run rclone - is it the same?

Dropbox suggests to use Batch Upload in that case. I'm curious if rclone utilizes it? How hard would it be to try it? Probably not an easy change.

@gmrukwa
Copy link
Author

gmrukwa commented Nov 28, 2020

I was interrupting it with Ctrl-C during the upload process of many image files as far as I remember.

The test I prepared uses process killing. Not sure yet, how I could simulate Ctrl-C more precisely here.

per file? or overall? have you tried to run rclone - is it the same?

Unfortunately, it's overall speed. Will test that with rclone as well. But you need to keep in mind that it was tested with 5KB and 100KB files. I remember seeing similar performance of the Dropbox client itself when handling numerous small files. That's not a great news, so I'll test once again with rclone. The only promising option I see (that we discussed already) is extended below.

Dropbox suggests to use Batch Upload in that case. I'm curious if rclone utilizes it? How hard would it be to try it? Probably not an easy change.

From the technical perspective, there are two aspects to consider.

  1. Batch Upload is not actually a batch upload of multiple files, but a batch commit of multiple files after they get uploaded. That could potentially be useful for the case with numerous small files if the issue is with committing the content, not with the upload. Luckily, that seems quite likely.
  2. Current DVC BaseTree API exhibits only methods that upload a single file (there's upload vs _download_file and _download_dir). I am afraid that introducing Batch Upload mechanism here would require breaking changes all over the codebase. That's something I would not be able to handle alone for sure, especially not until the end of the year. Additionally I doubt you want to introduce so much change just for introducing a single remote.

Basically it seems that we need a single place to call this function and this function for all the files and commits together. If you see some other option for that than wide changes in the codebase, please suggest. It's not clear however, how could that work. rclone doesn't use the Batch Upload.

Test import-url, external deps throw NotImplemented

I do have some strange issues with path parsing here, need to dig deeper. I'd decide though, whether we want to keep that feature enabled for Dropbox, as the folders are not uniquely identified by IDs. The file URL doesn't work for multiple users (it may change), unless the shares are exactly the same - it would not guarantee similar reproducibility to HTTP or S3.

Check exceptions handling- files not found, general errors. Consult with @efiop on what should we unify. At minimum it's probably better to warp all "expected" exceptions into a DVC-specific one. (check rclone)

I saw how the rclone does it. Do you mean similar wrapping like in GDrive has it? (Azure doesn't BTW).

Can we set remote to avoid local sync via API?

Do you mean the following scenario?

  • User works locally and has Dropbox installed.
  • dvc add dummy.txt
  • dvc push
  • Files do not get synced back from Dropbox, but only "smart synced" - user can view them, but no duplicate appears on the PC.

If that's the case, Dropbox API doesn't allow it, as it is not the account-level setting, but a local client setting. There's second option called "selective sync" that could disable default sync of DVC root directory. There is a subtle difference:

  • with "smart sync" you still see files "appearing" on your PC, although they don't get synced - you can view that something is progressing
  • with "selective sync" some directories are invisible on your PC - you see no progress, but also do not have overhead of metadata synchronization

You can use none of them, one or both toghether. I think that making DVC targets by default disabled from "selective sync" would make the most sense.

After validation: Unfortunately the SDK allows for configuring selective sync settings only through the Team API.

Tests: is there a mock/stub lib?

I did not find any that we could use here. Most are outdated / incompatible.

@shcheklein
Copy link
Member

The test I prepared uses process killing. Not sure yet, how I could simulate Ctrl-C more precisely here.

that's the problem. Can't reproduce it now with killing the process :(

Let's wait for it to appear again, I guess as we test more.

I'd decide though, whether we want to keep that feature enabled for Dropbox, as the folders are not uniquely identified by IDs

yep, def not needed as part of this PR. Just check that we handle it gracefully. We can figure out this later.

Unfortunately, it's overall speed.

that's sad :(. Let's see if rclone does a better job here.

I am afraid that introducing Batch Upload mechanism here would require breaking changes all over the codebase.

I hope not. But it'll require some refactoring. I would expect that we will need to override or wrap _process from the remote/base.py. Not sure about the details- first need to take a look at the batch API itself. I also that this function is extremely coupled and does too many things- some split might be required.

On the other hand 300-400 KB/s - would it be even enough for your case? Can we consider this as a reasonable implementation?

Azure doesn't BTW

yep, Azure is very outdated. GDrive is a better example.

You can use none of them, one or both toghether. I think that making DVC targets by default disabled from "selective sync" would make the most sense.

but neither of them could be enabled via API, right? and selective sync is probably a premium feature (which might be fine, since 2GB free account is not very reasonable).

@gmrukwa
Copy link
Author

gmrukwa commented Nov 28, 2020

that's sad :(. Let's see if rclone does a better job here.

Just tested rclone - it is 1/6 faster, but tested a bit different scenario, when they could actually lock whole directory at once. They scored ~49 secs for 100 files of 100 KB instead of ~63 secs with current implementation of this PR.

On the other hand 300-400 KB/s - would it be even enough for your case? Can we consider this as a reasonable implementation?

I didn't work with numerous small files, I have few bigger files (some of them around 1 GB). That would be perfectly fine, as the upload speed would be approx. 12 MB / s as per my benchmark today.

but neither of them could be enabled via API, right? and selective sync is probably a premium feature (which might be fine, since 2GB free account is not very reasonable).

Selective sync can be pre-enabled by API for Teams plan - that's the only opportunity. Anyone can disable sync of a directory in their client manually though (including free users with 2 GB of space).

I hope not. But it'll require some refactoring. I would expect that we will need to override or wrap _process from the remote/base.py. Not sure about the details- first need to take a look at the batch API itself. I also that this function is extremely coupled and does too many things- some split might be required.

Here the problem is the "Batch API" isn't formed by Dropbox. That's not obvious how one could proceed best, unless a very specific case gets addressed (due to sessions, locks, commits and completely different approach for small file).

@shcheklein, I would need info from you what I need to adjust here to have it merged. I am now wrapping exceptions, but if you see anything else, please let me know.

@shcheklein
Copy link
Member

Here the problem is the "Batch API" isn't formed by Dropbox.

you mean that you have to start "batch session", which means that you have to save it somewhere potentially, etc, etc? Yes, it might complicate things, but it can be part of the internal Dropbox-specific implementation, no need for a general API here.

we can split files into batches depending on their size (e.g. 1-2GB per batch?). In this case we can potentially get a reasonable performance with multiple jobs (--jobs) and it's fine to fail, it will restart some 1-2GB batches.

Even if it is the case, I guess it won't be part of this PR. It's just good to do a bit of research if we are missing some low hanging fruits.

Selective sync can be pre-enabled by API for Teams plan - that's the only opportunity. Anyone can disable sync of a directory in their client manually though (including free users with 2 GB of space).

okay. Let's recommend in the docs at least that sync should be disabled one way or another to avoid duplicating data and traffic?

Just tested rclone - it is 1/6 faster

:(

Btw, I see that rclone improved performance in some cases by providing a better chunk size, do we do the same? rclone/rclone#103


@shcheklein, I would need info from you what I need to adjust here to have it merged. I am now wrapping exceptions, but if you see anything else, please let me know.

yep, I don't see that much left for this PR. It's a very good and solid start.

One last thing that we'll need to make sure that we set properly are those vars

    DEFAULT_VERIFY = False
    LIST_OBJECT_PAGE_SIZE = 1000
    TRAVERSE_WEIGHT_MULTIPLIER = 5
    TRAVERSE_PREFIX_LEN = 3
    TRAVERSE_THRESHOLD_SIZE = 500000
    CAN_TRAVERSE = True

from the base.py. @pmrowla could you help us with this? what information do you need for this? Dropbox is similar to GDrive - it's API based, APIs are expensive (and potentially limited), so what you be the recommended setup here?

@gmrukwa
Copy link
Author

gmrukwa commented Nov 28, 2020

you mean that you have to start "batch session", which means that you have to save it somewhere potentially, etc, etc?

The scenario in rclone, on Stack Overflow and everywhere else I saw was as follows:

  1. Check file size vs chunk size.
  2. If file size < chunk size, use client.files_upload and you're done, otherwise proceed with remaining points.
  3. Open session client.files_upload_session_start.
  4. Make cursor for the session dropbox.files.UploadSessionCursor.
  5. Make commit info for the file that gets uploaded dropbox.files.CommitInfo.
  6. Upload chunk-by-chunk client.files_upload_session_append_v2.
  7. Finish session with cursor & commit info client.files_upload_session_finish.

No one ever shows how to use the batch mode. ☝️ There are batch operations in the API for moves, deletions, folder creations, but not upload. What I understood from the docs, client must acquire a lock over the namespace (==directory) before committing the file. This is automated for client.files_upload within the SDK - thus probably cannot support batch mode in this setup (as locks are acquired one-by-one). There exist functions:

Perhaps we could acquire locks in batch mode for a single client and then commit everything at once. Not sure though how to use these locks, as the SDK doesn't accept them while committing. Even not sure if we should, in the case of multiple clients writing different files to the same directory (reasonable case for computations distributed over few machines).

we can split files into batches depending on their size (e.g. 1-2GB per batch?). In this case we can potentially get a reasonable performance with multiple jobs (--jobs) and it's fine to fail, it will restart some 1-2GB batches.

Yeah, I can enable that. Even though rclone claims that it can speed up upload max by 10%. And this makes a difference only for big files - for small ones we're using client.files_upload anyway. The chunk cannot exceed 150 MB and needs to be a multiple of 4 MB as stated in the docs. rclone has default of 48 MB chunk.

okay. Let's recommend in the docs at least that sync should be disabled one way or another to avoid duplicating data and traffic?

On it.

To summarize the next steps:

  • wrap exceptions
  • parameterize chunk size
  • recommend enabling selective sync / smart sync in the docs
  • adjust the above mentioned variables

@gmrukwa
Copy link
Author

gmrukwa commented Jan 3, 2021

@shcheklein, I tried to address all the above points and it seems I've finished. Could you take a look now? If we're done, let's merge it - my migration from Azure awaits. πŸ˜‚

@shcheklein
Copy link
Member

@gmrukwa thanks a lot πŸ™ ! I'll try to take a look asap!

@gmrukwa
Copy link
Author

gmrukwa commented Jan 9, 2021

Rebased to resolve merge conflicts ☝🏻

Copy link
Member

@shcheklein shcheklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see some requests for refactoring, see PR I made to wire up properly import-ur/get-ulr/external deps and fix certain things

Next steps on my end:

  • Try to figure out why is it SOOOO slow. It took me ~10minutes to import-url 140Mb/32files. Is it expected?
  • I got my-project (Ignored Item Conflict) (Ignored Item Conflict) in my Dropbox - what is this?
  • figure out case sensitivity - /Users/ivan/Projects/test-dropbox/Books/../books/ ... should we use regular path, not lower in walk_files?
  • review tests and try to run them, wire up import url and other regular tests for remotes
  • check if APP is setup properly
  • review docs once again

)(func)


class DropboxWrapper:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel like we can do some meta-programming https://stackoverflow.com/questions/2704434/intercept-method-calls-in-python to wrap all existing methods with decorators?

app_key=self.DROPBOX_APP_KEY,
app_secret=self.DROPBOX_APP_SECRET,
)
dbx.check_and_refresh_access_token()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if token expiries right after this call?

(in Google Drive the client itself handles this, if it gets a specific HTTP error code it runs token refresh and retries the call)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it also should be wrapped, otherwise I'm getting this:

  File "/Users/ivan/Projects/dvc/dvc/tree/dropbox.py", line 131, in auth
    dbx.check_and_refresh_access_token()
  File "/Users/ivan/Projects/dvc/.env/lib/python3.8/site-packages/dropbox/dropbox_client.py", line 358, in check_and_refresh_access_token
    self.refresh_access_token(scope=self._scope)
  File "/Users/ivan/Projects/dvc/.env/lib/python3.8/site-packages/dropbox/dropbox_client.py", line 397, in refresh_access_token
    raise AuthError(request_id, err)
dropbox.exceptions.AuthError: AuthError('f584e934424c45b6a9af769f1b5edc84', AuthError('invalid_access_token', None))

if token is invalid

cred_prv = FileCredProvider(self.repo)
if cred_prv.can_auth():
logger.debug("Logging into Dropbox with credentials file")
return cred_prv.auth().client
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we save credentials here in this flow (in case they are refreshed)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't spot the refresh token to change anytime. Therefore, actually you only need the refresh token to get the access token each time before the API call. The remaining elements could probably be dropped, but I introduced them to keep compatibility if they start to timeout the refresh token.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, but it means we are going to keep refreshing them every run? (extra API call, extra time to an operation). Probably not the biggest problem, but I thought it can be a simple fix since you already have save() mechanics.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The access token lives 4 hours. It's not much. We create the client once per dvc command being ran, so the update takes place at most once per CLI run.


class EnvCredProvider(DropboxCredProvider):
def can_auth(self):
return bool(os.environ.get(REFRESH_TOKEN, False))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think existence of any of three vars should trigger this flow. Also would be great to detect then that some of them is not defined in throw DvcException.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll refer to my comment above. You actually need only the refresh token, as it allows you to get the access token anytime. Refresh token is not ephemeral one-time token just for a single refresh, you can use it as many time as you want.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting, thanks!

Thinking from the user perspective I still think it's better try to this type of auth if we detected any of those vars (and fail if they are not enough, e.g. REFRESH_TOKEN is not set). Just to be extra explicit. To avoid situations you set some vars, but it still uses file w/o any explanation. WDYT?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought of two approaches:

  • requiring just the refresh token - that seems reasonable if it's the only must-have element
  • requiring everything - in the case they will shorten life of refresh token or make it usable just once

It's your decision, I can adjust the code as needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, there are two different things (correct me if I wrong)

  1. Trigger - some factor that defines that we go the path of a specific auth (and don't try others)
  2. Validation - check if set of params is enough to go that path

So, in this specific case (to my mind):

  1. Can be any of the env vars present. It's better from the user perspective, we are extra explicit.
  2. Can check existence of the refresh token. Or also allow both keys to be present (I'm not sure if it makes sense at all)?

Btw, how would the actual workflow with env vars would look like? Where do users get them in the first place?

def can_auth(self):
if not os.path.exists(self._cred_location):
return False
with open(self._cred_location) as infile:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think existence of the file should be enough to trigger it. If content is bad we can rely on the client to fail and ideally handle it gracefully.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, we could.

if ex.error.is_unsupported_file():
raise DvcException(
"Path '{}' is not downloadable:\n\n"
"Confirm the file is a casual file, not a Dropbox thing."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, could you clarify, what does Dropbox thing mean?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They do have e.g. Dropbox Paper documents or Dropbox Vault items which are not downloadable this way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, makes sense, Dropbox thing sounds a bit strange though :), how do they call them in docs?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They do not call them in the SDK docs at all (or at least I haven't spotted) πŸ˜‚ There's just written, that something may not be downloadable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look like we call them Non-downloadable files - https://www.dropbox.com/lp/developers/reference/dbx-file-access-guide#files-that-require-special-consideration .

Confirm that it's a regular file, not a non-downloadable Dropbox file

(I don't like it either, but at least it's sounds a bit more formal and can be googled)

json.dump(_creds, outfile)


def path_info_to_dropbox_path(path_info):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@efiop I would need your help here - what would be the best way to handle this? instead of writing this conversion over and over again in each method below, is there a better way?

btw, may be push it into DropboxWrapper above?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, I think that we should ask users to use dropbox:///path notation (as in file:///path) when hostname is not specified. This is a proper way to handle this in this case, since Dropbox doesn't have buckets, servers, etc

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the code in gmrukwa#1 to allow empty hostname so that we don't have to deal with bucket that is not bucket at all (and to deal with cases like import-url dropbox:///file.pdf

still, I wonder what is the best way to handle this leading / here, @efiop ? Dropbox expects them this way? Is there a convenience method in PathInfo to return path with /?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropbox generally expects a path that is like /a/path/to/file, which gets rooted at the root of your Dropbox home. I couldn't get it to work simply without such a construct unfortunately. You will rather know some better way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, I did some changes to support URL without hostname (which is not present here). Usually the look like file:///path (mind triple ///).

fobj, "read", desc=name, total=file_size, disable=no_progress_bar
) as wrapped:
if file_size <= chunk_size:
logger.debug("Small file upload")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feels a bit too excessive - make it trace? or remove?

chunk_size = self.chunk_size_mb * 1024 * 1024
to_path = path_info_to_dropbox_path(to_info)
file_size = os.path.getsize(from_file)
logger.debug("Uploading " + from_file + " to " + to_path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it be part of the parent's call? is it now? better double check to prevent multiple logs happening.

def __init__(self, repo, config):
super().__init__(repo, config)

self.path_info = self.PATH_CLS(config.get("url", "dropbox://default"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it be None instead? "dropbox://default" looks artificial and might affect existing folder for some users?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it should be None and potentially throw an exception.

return self.client.files_download(*args, **kwargs)


ACCESS_TOKEN = "DROPBOX_ACCESS_TOKEN"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like rclone uses a different way of authenticating? It gets `

{"access_token":".***","token_type":"bearer","expiry":"0001-01-01T00:00:00Z"}

I don't see a refresh token, and expiry is not set. Do you know something about this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They may be using legacy auth, which will be dropped soon by Dropbox.

@shcheklein
Copy link
Member

@gmrukwa the biggest problem for me right now is that download takes a lot of time for a directory of ~142MB which contains files ~1MB to 10MB.

rclone does it in 20s
dvc pull works minutes, sometimes a single file is slower than 20s

Do you have any idea what to try, where to look for the problem?

@gmrukwa
Copy link
Author

gmrukwa commented Jan 22, 2021

@gmrukwa the biggest problem for me right now is that download takes a lot of time for a directory of ~142MB which contains files ~1MB to 10MB.

rclone does it in 20s
dvc pull works minutes, sometimes a single file is slower than 20s

Do you have any idea what to try, where to look for the problem?

They use download function from SDK as well (here), here we do as well. I do not have any other ideas for that.

@shcheklein
Copy link
Member

@gmrukwa yep, I have no idea either. Let me try to debug it a bit during this weekend to see if I can find something.

@shcheklein
Copy link
Member

@gmrukwa improved download tremendously (it's now 25Mb/s vs 150Kb/s on a large file, and downloads a dir with files faster than rclone) here gmrukwa@3ea0908 . It's a simple fix. Haven't had time to check/compare upload yet.

@gmrukwa
Copy link
Author

gmrukwa commented Jan 24, 2021

@shcheklein, that fix makes perfect sense! I was looking for that option in the Dropbox SDK actually, but it didn't came to my mind that download chunk size is steered on the level of iter_content... 🀯 I'd use the chunk_size setting value here, wouldn't it make sense?

@shcheklein
Copy link
Member

@gmrukwa

I'd use the chunk_size setting value here, wouldn't it make sense?

good question πŸ€” to be honest, not sure about this. My only concern is if they have a bit different limits and one expects a range in KBs, second in MBs to have a good performance. It looks like rclone's setting doesn't affect download?

@gmrukwa
Copy link
Author

gmrukwa commented Jan 26, 2021

@gmrukwa

I'd use the chunk_size setting value here, wouldn't it make sense?

good question πŸ€” to be honest, not sure about this. My only concern is if they have a bit different limits and one expects a range in KBs, second in MBs to have a good performance. It looks like rclone's setting doesn't affect download?

rclone seems to be using different method from the SDK, more like files_download_to_file, but with DVC we want to show the download progress. I've reviewed the code for files_download_to_file and Dropbox internally indeed uses 2**16 chunk size for download. Let's leave as you proposed then.

@efiop
Copy link
Contributor

efiop commented Mar 8, 2021

Quick update: we are migrating to fsspec and that will be our plugin mechanism for the future. We won't be able to provide proper support for dropbox ourselves, so the best way to go about it would be for you to implement dropboxfs using fsspec and maintain that implementation (try registering it in fsspec, other people might find it useful too). On our end, we will allow using that as a plugin as seamlessly as possible. Thank you so much for contributing!

There is an existing 3rd party implementation https://github.com/MarineChap/dropboxdrivefs , but I'm not sure how stable it is and whether it supports the same features that this PR supports. There has been some pretty active discussion around it in fsspec/filesystem_spec#207 . We might want to consider either contributing to the existing package or creating a new implementation.

@efiop efiop closed this Mar 8, 2021
@efiop efiop mentioned this pull request Oct 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

remote: support Dropbox
4 participants