[ENH] IO - Change origin attribute when not find on system#6555
Conversation
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #6555 +/- ##
=======================================
Coverage 87.78% 87.78%
=======================================
Files 321 321
Lines 69420 69445 +25
=======================================
+ Hits 60938 60962 +24
- Misses 8482 8483 +1 |
1764fde to
e81adb2
Compare
b189e71 to
ad5dc9a
Compare
|
/rebase |
ad5dc9a to
e9775cc
Compare
janezd
left a comment
There was a problem hiding this comment.
In general, I'm not very happy about this guessing, but something has to be done and I have no better idea. It would probably do the work.
|
|
||
| # all column paths in lookup dirs | ||
| for ld in lookup_dirs: | ||
| if all(os.path.exists(os.path.join(ld, v)) for v in table.get_column(attr)): |
There was a problem hiding this comment.
Maybe we would skip unknown values here, e.g. by adding if v?
There was a problem hiding this comment.
Added. I also added the test case for it.
| file_dir = os.path.dirname(file_path) | ||
| parent_dir = os.path.dirname(file_dir) | ||
| # if file_dir already root file_dir == parent_dir | ||
| lookup_dirs = tuple({file_dir, parent_dir}) |
There was a problem hiding this comment.
We probably want to look into file_dir first, and only then into parent_dir? Sets are unordered.
If you want to keep it short, use tuple({file_dir: 0, parent_dir: 0}). :)
There was a problem hiding this comment.
You are right. Fixed
| parent_dir = os.path.dirname(file_dir) | ||
| # if file_dir already root file_dir == parent_dir | ||
| lookup_dirs = tuple({file_dir, parent_dir}) | ||
| for attr in table.domain: |
There was a problem hiding this comment.
Why not just table.domain.metas?
Not only because of efficiency; I'm never sure whether a loop over domain includes metas or not. :)
There was a problem hiding this comment.
Good idea. Also image analytics only consider metas https://github.com/biolab/orange3-imageanalytics/blob/02356b8c14b2ec2d63f3a2f697ce69e47f05c4fa/orangecontrib/imageanalytics/utils/image_utils.py#L15-L33
| # if file_dir already root file_dir == parent_dir | ||
| lookup_dirs = tuple({file_dir, parent_dir}) | ||
| for attr in table.domain: | ||
| if "origin" in attr.attributes: |
There was a problem hiding this comment.
You could add if attr.is_string as precaution. Is someone uses origin for something else this could prevent some false positives.
There was a problem hiding this comment.
I extended this condition on discrete variables also since Image Analytics also search in those: https://github.com/biolab/orange3-imageanalytics/blob/02356b8c14b2ec2d63f3a2f697ce69e47f05c4fa/orangecontrib/imageanalytics/utils/image_utils.py#L15-L33
e9775cc to
2e8de49
Compare
|
Tests fail because of xgboost release. Fixed in #6570 |
|
/rebase |
2e8de49 to
72ccefe
Compare
Issue
For example, files that Orange loads (with the File widget) may contain paths in one or more columns—for example, the path to images for image analysis. One way to store paths may be to keep them as origin prefixes in attributes of attribute and the last part of the path as column values. Paths (origin + column values) are absolute paths. When the table is transferred to another computer, the path may not be valid anymore.
Description of changes
If the dataset's author provides files besides the CSV file, they may be discovered, and the origin can be fixed.
Additionally, I replaced StringIO reader inputs with actual files in the test. The code should not be adapted to cases only possible in tests.
Includes