-
Notifications
You must be signed in to change notification settings - Fork 45
Fix issues with tensorflow tf generator support for Object Storage #288
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@johnugeorge , could you take a look and see whether the changes look good to you? |
|
@hariharan-devarajan please review this PR also. |
| try: | ||
| if not use_pattern: | ||
| return tf.io.gfile.listdir(id) | ||
| # parse id to get bucket name and prefix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain the motivation in overriding the function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain the bug that you referred to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because when I run tf.io.gfile.listdir, I encountered an IndexError, saying the position passed into s.substr() is greater than s.size().
I looked into tensorflow code, tf.io.gfile.listdir is defined by list_directory_v2(), which calls GetChildren(). GetChildren() is defined in tensorflow/io and will basically return all the objects in the bucket with the given prefix. This func tries to do s.substr(s.size()+1). There is a detailed code walk in tensorflow/io#2149 that explains the bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@johnugeorge does this look good to you? Are you able to do a quick test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't this break the storage/framework abstraction? Right now, this PR will tie S3 storage with the tf framework.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@johnugeorge yes, but this is just to fix the original implementation, right? The original implementation for tf framework as well. All the create_node, walk_node functions are part of the framework class.
For pytorch, we can implement the same storage functions under pytorch_framework class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@annie-anna , after discussing with Johnu, we cannot accept the change in the framework layer. All the change should be inside the storage layer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhenghh04 @johnugeorge ack, I reverted changes in walk_node.
setup.py
Outdated
| f"hydra-core>={HYDRA_VERSION}", | ||
| "nvidia-dali-cuda120>=1.34.0", | ||
| "tensorflow>=2.11.0", | ||
| "tensorflow==2.13.1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we set tensorflow>=2.13.1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment below.
|
@annie-anna The github action jobs are failing because of the dependency issue. tensorflow==2.13.1 and torch>=2.2.0 have incompatible dependencies. Could you check whether changing to tensorflow>=2.13.1 (and also releasing the constrain for tensorflow_io) still works for the S3 storage? |
@zhenghh04 Yes, we can set tensorflow to be >= 2.13.1. As long as tensorflow_io is installed the compatible version, it will still work for S3 storage. |
|
@zhenghh04 @johnugeorge Is there any other question or concern regarding the PR? |
|
Thanks very much @annie-anna. |
|
@zhenghh04 Why did the build fail again? |
|
@annie-anna , I incorporated your changes in the most recent PR #294 . Could you try to pull the most recent changes from main? |
Found a few bugs when I run MLPerf storage on Object Storage:
s3://.makedirsinstead ofmkdirto create intermediate dirs.walk_nodeis usingtf.io.gfile.listdir, which has a bug that causes out-of-index for substr.list_object_v2.