Skip to content

_ls_from_cache: weird behavior regarding comparisons with trailing slash #554

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
isidentical opened this issue Mar 3, 2021 · 6 comments · Fixed by #557
Closed

_ls_from_cache: weird behavior regarding comparisons with trailing slash #554

isidentical opened this issue Mar 3, 2021 · 6 comments · Fixed by #557

Comments

@isidentical
Copy link
Member

In this example, the path is stripped only for the contain check. When dircache contains the normalized version (a/b) but not a/b/ it will throw a KeyError though it probably shouldn't since this is what that check appears to be stand for. Another case is right below that code that compares the normalized version with the obtained one from the cache, which might not be normalized. So I was wondering whether this is intentional or should we use .rstrip() in both of the cases (self.dircache[path.rstrip('/')], f["name"].rstrip("/") == path.rstrip("/"))
https://github.com/intake/filesystem_spec/blob/6ee47aa6c10868c383be5cb13837f2a26a099f5d/fsspec/spec.py#L342-L343
https://github.com/intake/filesystem_spec/blob/6ee47aa6c10868c383be5cb13837f2a26a099f5d/fsspec/spec.py#L349

@martindurant
Copy link
Member

It would be a good idea to encode this into a test, so that we can not only fix it, but make sure we don't break it again. I would side with: the directory names in dircache should not include the trailing "/".

@isidentical
Copy link
Member Author

I can send a patch for the first normalization and a test though

I would side with: the directory names in dircache should not include the trailing "/".

What do you think about the second normalization? (f["name"].rstrip("/") == path.rstrip("/"))

@hayesgb
Copy link
Contributor

hayesgb commented Mar 8, 2021

@martindurant, @isidentical One of the issues I've encountered with Azure is that, because its a flat filesystem, each of these can be different:

container/prefix1/prefix2  <-- blob
container/prefix1/prefix2/ <-- can be a blob, but different than above, or a BlobPrefix, which is an artificial construct that to mimic folders, but can not be created directly

The distinctions are causing issues with Hive partitioning, as described here. I just merged this fix into master, but realized the fix doesn't (for the moment) align with this guidance, because a folder would, by default, have a trailing slash.

@martindurant
Copy link
Member

gcsfs and s3fs keep getting hit by this kind of thing too.
Let's come up with a set of rules that we can adhere to and test thoroughly, and document for users (who are confused by the different ways to access their data).

@TomAugspurger
Copy link
Collaborator

Let's come up with a set of rules that we can adhere to and test thoroughly, and document for users (who are confused by the different ways to access their data).

@martindurant is there a document for this (or issue for tracking writing this document)?

@martindurant
Copy link
Member

#562 is the discussion where I hope we can come to an agreement. I imagined the outcome to be a set of tests hosted in this repo, but run against all three object stores.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants