Skip to content

Conversation

@annie-anna
Copy link
Contributor

Found a few bugs when I run MLPerf storage on Object Storage:

  1. tfrecord index files are still pointing to local FS path rather than object path.
    • Update index file paths to include s3://.
  2. When creating node, unable to create intermediate directories.
    • Use makedirs instead of mkdir to create intermediate dirs.
  3. walk_node is using tf.io.gfile.listdir, which has a bug that causes out-of-index for substr.
    • Use boto3 list_object_v2.

@annie-anna annie-anna changed the title Fix Fix issues with tensorflow tf generator support for Object Storage Fix issues with tensorflow tf generator support for Object Storage May 6, 2025
@zhenghh04
Copy link
Member

@johnugeorge , could you take a look and see whether the changes look good to you?

@zhenghh04
Copy link
Member

@hariharan-devarajan please review this PR also.

try:
if not use_pattern:
return tf.io.gfile.listdir(id)
# parse id to get bucket name and prefix
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain the motivation in overriding the function?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain the bug that you referred to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because when I run tf.io.gfile.listdir, I encountered an IndexError, saying the position passed into s.substr() is greater than s.size().
I looked into tensorflow code, tf.io.gfile.listdir is defined by list_directory_v2(), which calls GetChildren(). GetChildren() is defined in tensorflow/io and will basically return all the objects in the bucket with the given prefix. This func tries to do s.substr(s.size()+1). There is a detailed code walk in tensorflow/io#2149 that explains the bug.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnugeorge does this look good to you? Are you able to do a quick test?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this break the storage/framework abstraction? Right now, this PR will tie S3 storage with the tf framework.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnugeorge yes, but this is just to fix the original implementation, right? The original implementation for tf framework as well. All the create_node, walk_node functions are part of the framework class.

For pytorch, we can implement the same storage functions under pytorch_framework class.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@annie-anna , after discussing with Johnu, we cannot accept the change in the framework layer. All the change should be inside the storage layer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhenghh04 @johnugeorge ack, I reverted changes in walk_node.

setup.py Outdated
f"hydra-core>={HYDRA_VERSION}",
"nvidia-dali-cuda120>=1.34.0",
"tensorflow>=2.11.0",
"tensorflow==2.13.1",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we set tensorflow>=2.13.1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment below.

@zhenghh04
Copy link
Member

@annie-anna The github action jobs are failing because of the dependency issue. tensorflow==2.13.1 and torch>=2.2.0 have incompatible dependencies.

Could you check whether changing to tensorflow>=2.13.1 (and also releasing the constrain for tensorflow_io) still works for the S3 storage?

@annie-anna
Copy link
Contributor Author

annie-anna commented May 13, 2025

@annie-anna The github action jobs are failing because of the dependency issue. tensorflow==2.13.1 and torch>=2.2.0 have incompatible dependencies.

Could you check whether changing to tensorflow>=2.13.1 (and also releasing the constrain for tensorflow_io) still works for the S3 storage?

@zhenghh04 Yes, we can set tensorflow to be >= 2.13.1. As long as tensorflow_io is installed the compatible version, it will still work for S3 storage.
One reason I pinned tensorflow to 2.13.1 is that when I tried with higher versions, I saw pure virtual method called; terminate called without exception. It does not affect functionality, but will create additional noise. This message is coming from cpp sdk and has been a long-lasting problem: tensorflow/io#1912
It is fixed in tensorflow latest version (2.19), however, we don't have compatible tensorflow_io releases. The latest tensorflow_io is 0.37.1, which maps to tensorflow 2.16 that does not have the fix. See version mapping here: https://pypi.org/project/tensorflow-io/

@annie-anna annie-anna closed this May 13, 2025
@annie-anna annie-anna reopened this May 13, 2025
@annie-anna
Copy link
Contributor Author

@zhenghh04 @johnugeorge Is there any other question or concern regarding the PR?

@zhenghh04
Copy link
Member

Thanks very much @annie-anna.

@annie-anna
Copy link
Contributor Author

@zhenghh04 Why did the build fail again?

@zhenghh04 zhenghh04 mentioned this pull request May 29, 2025
@zhenghh04
Copy link
Member

@annie-anna , I incorporated your changes in the most recent PR #294 . Could you try to pull the most recent changes from main?

@zhenghh04 zhenghh04 closed this Nov 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants