Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Translating the MM TTNN Python benchmark to C++ #18269

Open
mbahnasTT opened this issue Feb 25, 2025 · 8 comments
Open

Translating the MM TTNN Python benchmark to C++ #18269

mbahnasTT opened this issue Feb 25, 2025 · 8 comments
Assignees
Labels

Comments

@mbahnasTT
Copy link
Contributor

mbahnasTT commented Feb 25, 2025

Describe the bug
Optimizing the Mattel TTNN c++ benchmark, to match the perf of the python version.

To Reproduce
Steps to reproduce the behavior:

  1. (https://github.com/tenstorrent/tt-metal/blob/main/tests/tt_metal/tt_metal/perf_microbenchmark/1_compute_mm/test_compute_mm.cpp#L617)

Expected behavior
Based on Yu Gao's assessment, the needed work is around:

  • setup sharding tensor input output
  • use the correct kernel files
  • runtime compile time arg CB need updates
  • put the enqueue stuff into a loop and measure outside the loop
  • enabling trace flow

Screenshots
If applicable, add screenshots to help explain your problem.

Please complete the following environment information:

  • OS: [e.g. Ubuntu 20.04]
  • Version of software (eg. commit)

Additional context
Add any other context about the problem here.

** Action **

Translate tests/ttnn/unit_tests/benchmarks/test_benchmark.py to C++ by creating a file at tests/ttnn/unit_tests/gtests/benchmarks/test_benchmark.cpp

@smehtaTT
Copy link

FYI @cmaryanTT @bbradelTT

@bbradelTT
Copy link
Contributor

bbradelTT commented Feb 26, 2025

I talked to @mbahnasTT

This issue is related to a customer using ttnn C++ api, and not the python api, and finding the C++ api slower somehow than what is achieved via python in https://github.com/tenstorrent/tt-metal/blob/main/tech_reports/GEMM_FLOPS/GEMM_FLOPS.md.

@mbahnasTT will provide me with the customer's code.

Probably the two actions after that would be to:

@bbradelTT
Copy link
Contributor

@mbahnasTT I took at the code.

I'd expect them to have similar performance as bfloat16 HiFi2 reading from DRAM if their inputs are large enough.
3072x3072x4096 is 68 TFLOPS.
if they're inputs are smaller, then that would be lower, with 512x512x512 going down to maybe 18 TFLOPS.

Do you know what performance they are achieving and with which input sizes and data types?

Also, do you know if they want to have sharded L1 processing?

@mbahnasTT
Copy link
Contributor Author

@bbradelTT I'll ask them in the meeting tomorrow.
I believe they do need sharded L1 processing.

@bbradelTT
Copy link
Contributor

@mbahnasTT if that is the case, they might want to look at tests/ttnn/unit_tests/gtests/tensor/test_vector_conversion.cpp and ttnn/unit_tests/gtests/test_graph_add.cpp for ideas about how to created L1 sharded tensors. They will need to be careful about properly setting all the parameters.

@mbahnasTT
Copy link
Contributor Author

@bbradelTT I believe it's very hard for them to fetch for ideas in the code, we need to provide them with the exact benchmark code like we did with the python version with the exact combinations there. Please tell me when this code can be ready to provide to them?

@bbradelTT bbradelTT assigned edwinleeTT and unassigned bbradelTT and cmaryanTT Mar 3, 2025
@bbradelTT
Copy link
Contributor

@edwinleeTT will translate tests/ttnn/unit_tests/benchmarks/test_benchmark.py to C++. I'll edit the description.

@bbradelTT bbradelTT changed the title Optimizing the MM TTNN C++benchmark Translating the MM TTNN Python benchmark to C++ Mar 3, 2025
@edwinleeTT
Copy link
Contributor

@mbahnasTT I've started working on the Cpp version and have created a draft PR with the progress so far. This doesn't have all the functionality of the Python test yet, but should be enough to illustrate a matmul.

It can be run by checking out the "elee/ttnn_bench_port" branch, then building the tests with the following:

bash build_metal.sh -e -c --build-tests

Then run the following to run the new test:

./build_Release/test/ttnn/unit_tests_ttnn --gtest_filter="*Matmul2D*"

You should see the following when it finishes:

[ OK ] MatmulTests/Matmul2DHostPerfTestFixture.Matmul2DHostPerfTest/0 (139326 ms)
[----------] 1 test from MatmulTests/Matmul2DHostPerfTestFixture (139326 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (139326 ms total)
[ PASSED ] 1 test.
Device | INFO | Closing user mode device drivers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants