-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Create Dropbox remote #4793
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create Dropbox remote #4793
Conversation
@efiop, just pinging to let you know the direction this issue is going. |
Thank you @gmrukwa ! π At the first glance, this looks good and the direction looks correct too, you can safely continue with the rest of the points. We'll need some time for a proper review though, we currently have a lot on our plate, so thank you in advance for your patience π |
Okay guys, I've created docs, tests and the last element to do is to provide the details of the Dropbox API connection - this is to be done by you. Not sure how you want to secure the App Key and App Secret, but here is the spot. You need to go to the Dropbox App Console, create your application (permission and scope details here). Dropbox allows you to apply for production usage as soon as you have 50 users linked to your app, but there's some possibility to have the acceptance in the eager mode. I didn't go with such a level of complexity as the Google Drive remote is done, as there's no need to give user an option to configure separate app key and secret at this stage since the Dropbox API has no limits specified publicly. |
Okay, will keep track of the thing to review/improve/discuss here. Let's follow up in comments as we address and resolve, and check the box + leave a link to the comment.
|
d53bced
to
b5a654a
Compare
There's an issue with dependencies for S3 mocks in Python < 3.8 after rebasing onto upstream changes. |
Regarding the rate limits: https://www.dropbox.com/lp/developers/reference/dbx-performance-guide It looks like we need to test on parallel upload of many files (that's what DVC does by default anyway). It might be very unstable if we don't handle 429 properly and do not respect Retry-after. We can do parallel uploads, but it might require changing the remote class interface. |
@gmrukwa could you rebase? Dependencies problems should be fixed on the current master. |
Getting an exception now on uploading a dir (it got interrupted in the middle) . Now I can't recover from this.
|
@gmrukwa hey! could you please take a look at those issue we are getting? |
@shcheklein absolutely - I just need to sort few more things out and hopefully today I will be able to work on that. |
@shcheklein, I do have everything according to the guide, especially that they embedded most of these considerations into the SDK. I've checked and the upload is done in very similar manner in rclone (large file upload, small vs large file upload switch). They just return the error code instead of purposeful On the other hand, Dropbox has some problems when running as desktop client with tons of small files as well. The issue probably is related to the fact, that they "lock the namespace" on each file upload. There's no file-lock, but directory-lock each time. For tons of small files the upload speed is at the level of 300-500 KB / s, for few big ones it reaches 12-14 MB / s smoothly. I played with the file sizes, chunk sizes and number of jobs per upload. File size is the most important factor, number of jobs doesn't allow for any significant improvement here, neither chunk size does. Especially that chunk size is supposed to be a multiple of 4 MB. Under certain size it doesn't matter if the file is 5 KB or 100 KB - the locking process on Dropbox side lasts long enough to make it irrelevant. Could you provide some more details on how you ended up in a non-recoverable state? I cannot reproduce that (my approaches already included as skipped tests). Remaining elements you mentioned:
Teams support looks as follows: you do auth the exact same way you do for individual account and here's where the files are going (I specified path |
554fd1a
to
2b96dfb
Compare
@gmrukwa thanks for the udpate!
I was interrupting it with Ctrl-C during the upload process of many image files as far as I remember.
per file? or overall? have you tried to run rclone - is it the same? Dropbox suggests to use |
The test I prepared uses process killing. Not sure yet, how I could simulate Ctrl-C more precisely here.
Unfortunately, it's overall speed. Will test that with rclone as well. But you need to keep in mind that it was tested with 5KB and 100KB files. I remember seeing similar performance of the Dropbox client itself when handling numerous small files. That's not a great news, so I'll test once again with rclone. The only promising option I see (that we discussed already) is extended below.
From the technical perspective, there are two aspects to consider.
Basically it seems that we need a single place to call this function and this function for all the files and commits together. If you see some other option for that than wide changes in the codebase, please suggest. It's not clear however, how could that work.
I do have some strange issues with path parsing here, need to dig deeper. I'd decide though, whether we want to keep that feature enabled for Dropbox, as the folders are not uniquely identified by IDs. The file URL doesn't work for multiple users (it may change), unless the shares are exactly the same - it would not guarantee similar reproducibility to HTTP or S3.
I saw how the rclone does it. Do you mean similar wrapping like in GDrive has it? (Azure doesn't BTW).
Do you mean the following scenario?
If that's the case, Dropbox API doesn't allow it, as it is not the account-level setting, but a local client setting. There's second option called "selective sync" that could disable default sync of DVC root directory. There is a subtle difference:
You can use none of them, one or both toghether. I think that making DVC targets by default disabled from "selective sync" would make the most sense. After validation: Unfortunately the SDK allows for configuring selective sync settings only through the Team API.
I did not find any that we could use here. Most are outdated / incompatible. |
that's the problem. Can't reproduce it now with killing the process :( Let's wait for it to appear again, I guess as we test more.
yep, def not needed as part of this PR. Just check that we handle it gracefully. We can figure out this later.
that's sad :(. Let's see if rclone does a better job here.
I hope not. But it'll require some refactoring. I would expect that we will need to override or wrap On the other hand 300-400 KB/s - would it be even enough for your case? Can we consider this as a reasonable implementation?
yep, Azure is very outdated. GDrive is a better example.
but neither of them could be enabled via API, right? and selective sync is probably a premium feature (which might be fine, since 2GB free account is not very reasonable). |
Just tested
I didn't work with numerous small files, I have few bigger files (some of them around 1 GB). That would be perfectly fine, as the upload speed would be approx. 12 MB / s as per my benchmark today.
Selective sync can be pre-enabled by API for Teams plan - that's the only opportunity. Anyone can disable sync of a directory in their client manually though (including free users with 2 GB of space).
Here the problem is the "Batch API" isn't formed by Dropbox. That's not obvious how one could proceed best, unless a very specific case gets addressed (due to sessions, locks, commits and completely different approach for small file). @shcheklein, I would need info from you what I need to adjust here to have it merged. I am now wrapping exceptions, but if you see anything else, please let me know. |
you mean that you have to start "batch session", which means that you have to save it somewhere potentially, etc, etc? Yes, it might complicate things, but it can be part of the internal Dropbox-specific implementation, no need for a general API here. we can split files into batches depending on their size (e.g. 1-2GB per batch?). In this case we can potentially get a reasonable performance with multiple jobs ( Even if it is the case, I guess it won't be part of this PR. It's just good to do a bit of research if we are missing some low hanging fruits.
okay. Let's recommend in the docs at least that sync should be disabled one way or another to avoid duplicating data and traffic?
:( Btw, I see that rclone improved performance in some cases by providing a better chunk size, do we do the same? rclone/rclone#103 @shcheklein, I would need info from you what I need to adjust here to have it merged. I am now wrapping exceptions, but if you see anything else, please let me know. yep, I don't see that much left for this PR. It's a very good and solid start. One last thing that we'll need to make sure that we set properly are those vars
from the |
The scenario in
No one ever shows how to use the batch mode. βοΈ There are batch operations in the API for moves, deletions, folder creations, but not upload. What I understood from the docs, client must acquire a lock over the namespace (==directory) before committing the file. This is automated for Perhaps we could acquire locks in batch mode for a single client and then commit everything at once. Not sure though how to use these locks, as the SDK doesn't accept them while committing. Even not sure if we should, in the case of multiple clients writing different files to the same directory (reasonable case for computations distributed over few machines).
Yeah, I can enable that. Even though
On it. To summarize the next steps:
|
@shcheklein, I tried to address all the above points and it seems I've finished. Could you take a look now? If we're done, let's merge it - my migration from Azure awaits. π |
@gmrukwa thanks a lot π ! I'll try to take a look asap! |
Rebased to resolve merge conflicts βπ» |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see some requests for refactoring, see PR I made to wire up properly import-ur/get-ulr/external deps and fix certain things
Next steps on my end:
- Try to figure out why is it SOOOO slow. It took me ~10minutes to
import-url
140Mb/32files. Is it expected? - I got
my-project (Ignored Item Conflict) (Ignored Item Conflict)
in my Dropbox - what is this? - figure out case sensitivity
- /Users/ivan/Projects/test-dropbox/Books/../books/
... should we use regular path, not lower inwalk_files
? - review tests and try to run them, wire up import url and other regular tests for remotes
- check if APP is setup properly
- review docs once again
)(func) | ||
|
||
|
||
class DropboxWrapper: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel like we can do some meta-programming https://stackoverflow.com/questions/2704434/intercept-method-calls-in-python to wrap all existing methods with decorators?
app_key=self.DROPBOX_APP_KEY, | ||
app_secret=self.DROPBOX_APP_SECRET, | ||
) | ||
dbx.check_and_refresh_access_token() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens if token expiries right after this call?
(in Google Drive the client itself handles this, if it gets a specific HTTP error code it runs token refresh and retries the call)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it also should be wrapped, otherwise I'm getting this:
File "/Users/ivan/Projects/dvc/dvc/tree/dropbox.py", line 131, in auth
dbx.check_and_refresh_access_token()
File "/Users/ivan/Projects/dvc/.env/lib/python3.8/site-packages/dropbox/dropbox_client.py", line 358, in check_and_refresh_access_token
self.refresh_access_token(scope=self._scope)
File "/Users/ivan/Projects/dvc/.env/lib/python3.8/site-packages/dropbox/dropbox_client.py", line 397, in refresh_access_token
raise AuthError(request_id, err)
dropbox.exceptions.AuthError: AuthError('f584e934424c45b6a9af769f1b5edc84', AuthError('invalid_access_token', None))
if token is invalid
cred_prv = FileCredProvider(self.repo) | ||
if cred_prv.can_auth(): | ||
logger.debug("Logging into Dropbox with credentials file") | ||
return cred_prv.auth().client |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we save credentials here in this flow (in case they are refreshed)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't spot the refresh token to change anytime. Therefore, actually you only need the refresh token to get the access token each time before the API call. The remaining elements could probably be dropped, but I introduced them to keep compatibility if they start to timeout the refresh token.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, but it means we are going to keep refreshing them every run? (extra API call, extra time to an operation). Probably not the biggest problem, but I thought it can be a simple fix since you already have save()
mechanics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The access token lives 4 hours. It's not much. We create the client once per dvc command being ran, so the update takes place at most once per CLI run.
|
||
class EnvCredProvider(DropboxCredProvider): | ||
def can_auth(self): | ||
return bool(os.environ.get(REFRESH_TOKEN, False)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think existence of any of three vars should trigger this flow. Also would be great to detect then that some of them is not defined in throw DvcException.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll refer to my comment above. You actually need only the refresh token, as it allows you to get the access token anytime. Refresh token is not ephemeral one-time token just for a single refresh, you can use it as many time as you want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting, thanks!
Thinking from the user perspective I still think it's better try to this type of auth if we detected any of those vars (and fail if they are not enough, e.g. REFRESH_TOKEN is not set). Just to be extra explicit. To avoid situations you set some vars, but it still uses file w/o any explanation. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought of two approaches:
- requiring just the refresh token - that seems reasonable if it's the only must-have element
- requiring everything - in the case they will shorten life of refresh token or make it usable just once
It's your decision, I can adjust the code as needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, there are two different things (correct me if I wrong)
- Trigger - some factor that defines that we go the path of a specific auth (and don't try others)
- Validation - check if set of params is enough to go that path
So, in this specific case (to my mind):
- Can be any of the env vars present. It's better from the user perspective, we are extra explicit.
- Can check existence of the refresh token. Or also allow both keys to be present (I'm not sure if it makes sense at all)?
Btw, how would the actual workflow with env vars would look like? Where do users get them in the first place?
def can_auth(self): | ||
if not os.path.exists(self._cred_location): | ||
return False | ||
with open(self._cred_location) as infile: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think existence of the file should be enough to trigger it. If content is bad we can rely on the client to fail and ideally handle it gracefully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, we could.
if ex.error.is_unsupported_file(): | ||
raise DvcException( | ||
"Path '{}' is not downloadable:\n\n" | ||
"Confirm the file is a casual file, not a Dropbox thing." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, could you clarify, what does Dropbox thing
mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They do have e.g. Dropbox Paper documents or Dropbox Vault items which are not downloadable this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, makes sense, Dropbox thing
sounds a bit strange though :), how do they call them in docs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They do not call them in the SDK docs at all (or at least I haven't spotted) π There's just written, that something may not be downloadable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look like we call them Non-downloadable files - https://www.dropbox.com/lp/developers/reference/dbx-file-access-guide#files-that-require-special-consideration .
Confirm that it's a regular file, not a non-downloadable Dropbox file
(I don't like it either, but at least it's sounds a bit more formal and can be googled)
json.dump(_creds, outfile) | ||
|
||
|
||
def path_info_to_dropbox_path(path_info): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@efiop I would need your help here - what would be the best way to handle this? instead of writing this conversion over and over again in each method below, is there a better way?
btw, may be push it into DropboxWrapper above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, I think that we should ask users to use dropbox:///path
notation (as in file:///path
) when hostname
is not specified. This is a proper way to handle this in this case, since Dropbox doesn't have buckets, servers, etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the code in gmrukwa#1 to allow empty hostname so that we don't have to deal with bucket
that is not bucket at all (and to deal with cases like import-url dropbox:///file.pdf
still, I wonder what is the best way to handle this leading /
here, @efiop ? Dropbox expects them this way? Is there a convenience method in PathInfo to return path with /
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dropbox generally expects a path that is like /a/path/to/file
, which gets rooted at the root of your Dropbox home. I couldn't get it to work simply without such a construct unfortunately. You will rather know some better way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, I did some changes to support URL without hostname (which is not present here). Usually the look like file:///path
(mind triple ///
).
fobj, "read", desc=name, total=file_size, disable=no_progress_bar | ||
) as wrapped: | ||
if file_size <= chunk_size: | ||
logger.debug("Small file upload") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
feels a bit too excessive - make it trace? or remove?
chunk_size = self.chunk_size_mb * 1024 * 1024 | ||
to_path = path_info_to_dropbox_path(to_info) | ||
file_size = os.path.getsize(from_file) | ||
logger.debug("Uploading " + from_file + " to " + to_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should it be part of the parent's call? is it now? better double check to prevent multiple logs happening.
def __init__(self, repo, config): | ||
super().__init__(repo, config) | ||
|
||
self.path_info = self.PATH_CLS(config.get("url", "dropbox://default")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should it be None
instead? "dropbox://default" looks artificial and might affect existing folder for some users?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, it should be None and potentially throw an exception.
return self.client.files_download(*args, **kwargs) | ||
|
||
|
||
ACCESS_TOKEN = "DROPBOX_ACCESS_TOKEN" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like rclone uses a different way of authenticating? It gets `
{"access_token":".***","token_type":"bearer","expiry":"0001-01-01T00:00:00Z"}
I don't see a refresh token, and expiry is not set. Do you know something about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They may be using legacy auth, which will be dropped soon by Dropbox.
@gmrukwa the biggest problem for me right now is that download takes a lot of time for a directory of ~142MB which contains files ~1MB to 10MB.
Do you have any idea what to try, where to look for the problem? |
They use |
@gmrukwa yep, I have no idea either. Let me try to debug it a bit during this weekend to see if I can find something. |
@gmrukwa improved download tremendously (it's now 25Mb/s vs 150Kb/s on a large file, and downloads a dir with files faster than rclone) here gmrukwa@3ea0908 . It's a simple fix. Haven't had time to check/compare upload yet. |
@shcheklein, that fix makes perfect sense! I was looking for that option in the Dropbox SDK actually, but it didn't came to my mind that download chunk size is steered on the level of |
good question π€ to be honest, not sure about this. My only concern is if they have a bit different limits and one expects a range in KBs, second in MBs to have a good performance. It looks like rclone's setting doesn't affect download? |
rclone seems to be using different method from the SDK, more like |
Quick update: we are migrating to fsspec and that will be our plugin mechanism for the future. We won't be able to provide proper support for dropbox ourselves, so the best way to go about it would be for you to implement dropboxfs using fsspec and maintain that implementation (try registering it in fsspec, other people might find it useful too). On our end, we will allow using that as a plugin as seamlessly as possible. Thank you so much for contributing! There is an existing 3rd party implementation https://github.com/MarineChap/dropboxdrivefs , but I'm not sure how stable it is and whether it supports the same features that this PR supports. There has been some pretty active discussion around it in fsspec/filesystem_spec#207 . We might want to consider either contributing to the existing package or creating a new implementation. |
β I have followed the Contributing to DVC checklist.
π If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.
Don't have time to play with that for the next few days, but uploading to make it visible already. Will polish that soon and:
Resolves #1277