Translating the MM TTNN Python benchmark to C++ #18269

mbahnasTT · 2025-02-25T01:59:47Z

Describe the bug
Optimizing the Mattel TTNN c++ benchmark, to match the perf of the python version.

To Reproduce
Steps to reproduce the behavior:

(https://github.com/tenstorrent/tt-metal/blob/main/tests/tt_metal/tt_metal/perf_microbenchmark/1_compute_mm/test_compute_mm.cpp#L617)

Expected behavior
Based on Yu Gao's assessment, the needed work is around:

setup sharding tensor input output
use the correct kernel files
runtime compile time arg CB need updates
put the enqueue stuff into a loop and measure outside the loop
enabling trace flow

Screenshots
If applicable, add screenshots to help explain your problem.

Please complete the following environment information:

OS: [e.g. Ubuntu 20.04]
Version of software (eg. commit)

Additional context
Add any other context about the problem here.

** Action **

Translate tests/ttnn/unit_tests/benchmarks/test_benchmark.py to C++ by creating a file at tests/ttnn/unit_tests/gtests/benchmarks/test_benchmark.cpp

smehtaTT · 2025-02-25T17:12:37Z

FYI @cmaryanTT @bbradelTT

bbradelTT · 2025-02-26T14:29:58Z

I talked to @mbahnasTT

This issue is related to a customer using ttnn C++ api, and not the python api, and finding the C++ api slower somehow than what is achieved via python in https://github.com/tenstorrent/tt-metal/blob/main/tech_reports/GEMM_FLOPS/GEMM_FLOPS.md.

@mbahnasTT will provide me with the customer's code.

Probably the two actions after that would be to:

port the python code used in https://github.com/tenstorrent/tt-metal/blob/main/tech_reports/GEMM_FLOPS/GEMM_FLOPS.md to C++
identify the issues hindering performance in the customer's code

bbradelTT · 2025-02-28T19:21:25Z

@mbahnasTT I took at the code.

I'd expect them to have similar performance as bfloat16 HiFi2 reading from DRAM if their inputs are large enough.
3072x3072x4096 is 68 TFLOPS.
if they're inputs are smaller, then that would be lower, with 512x512x512 going down to maybe 18 TFLOPS.

Do you know what performance they are achieving and with which input sizes and data types?

Also, do you know if they want to have sharded L1 processing?

mbahnasTT · 2025-03-03T03:28:22Z

@bbradelTT I'll ask them in the meeting tomorrow.
I believe they do need sharded L1 processing.

bbradelTT · 2025-03-03T14:44:34Z

@mbahnasTT if that is the case, they might want to look at tests/ttnn/unit_tests/gtests/tensor/test_vector_conversion.cpp and ttnn/unit_tests/gtests/test_graph_add.cpp for ideas about how to created L1 sharded tensors. They will need to be careful about properly setting all the parameters.

mbahnasTT · 2025-03-03T16:26:05Z

@bbradelTT I believe it's very hard for them to fetch for ideas in the code, we need to provide them with the exact benchmark code like we did with the python version with the exact combinations there. Please tell me when this code can be ready to provide to them?

bbradelTT · 2025-03-03T17:18:54Z

@edwinleeTT will translate tests/ttnn/unit_tests/benchmarks/test_benchmark.py to C++. I'll edit the description.

edwinleeTT · 2025-03-07T19:42:48Z

@mbahnasTT I've started working on the Cpp version and have created a draft PR with the progress so far. This doesn't have all the functionality of the Python test yet, but should be enough to illustrate a matmul.

It can be run by checking out the "elee/ttnn_bench_port" branch, then building the tests with the following:

bash build_metal.sh -e -c --build-tests

Then run the following to run the new test:

./build_Release/test/ttnn/unit_tests_ttnn --gtest_filter="*Matmul2D*"

You should see the following when it finishes:

[ OK ] MatmulTests/Matmul2DHostPerfTestFixture.Matmul2DHostPerfTest/0 (139326 ms)
[----------] 1 test from MatmulTests/Matmul2DHostPerfTestFixture (139326 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (139326 ms total)
[ PASSED ] 1 test.
Device | INFO | Closing user mode device drivers

mbahnasTT added the bug Something isn't working label Feb 25, 2025

mbahnasTT assigned yugaoTT Feb 25, 2025

mbahnasTT added Customer_Bug op_cat: mm labels Feb 25, 2025

smehtaTT assigned cmaryanTT and bbradelTT Feb 25, 2025

davorchap unassigned yugaoTT Feb 25, 2025

mbahnasTT added the P1 label Feb 25, 2025

bbradelTT assigned edwinleeTT and unassigned bbradelTT and cmaryanTT Mar 3, 2025

bbradelTT changed the title ~~Optimizing the MM TTNN C++benchmark~~ Translating the MM TTNN Python benchmark to C++ Mar 3, 2025

bbradelTT mentioned this issue Mar 7, 2025

Matmul hang in Isolation in BH #16439

Open

edwinleeTT mentioned this issue Mar 7, 2025

Add Cpp Version of Matmul Benchmark #18792

Draft

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Translating the MM TTNN Python benchmark to C++ #18269

Translating the MM TTNN Python benchmark to C++ #18269

mbahnasTT commented Feb 25, 2025 •

edited by bbradelTT

Loading

smehtaTT commented Feb 25, 2025

bbradelTT commented Feb 26, 2025 •

edited

Loading

bbradelTT commented Feb 28, 2025

mbahnasTT commented Mar 3, 2025

bbradelTT commented Mar 3, 2025

mbahnasTT commented Mar 3, 2025

bbradelTT commented Mar 3, 2025

edwinleeTT commented Mar 7, 2025

Translating the MM TTNN Python benchmark to C++ #18269

Translating the MM TTNN Python benchmark to C++ #18269

Comments

mbahnasTT commented Feb 25, 2025 • edited by bbradelTT Loading

smehtaTT commented Feb 25, 2025

bbradelTT commented Feb 26, 2025 • edited Loading

bbradelTT commented Feb 28, 2025

mbahnasTT commented Mar 3, 2025

bbradelTT commented Mar 3, 2025

mbahnasTT commented Mar 3, 2025

bbradelTT commented Mar 3, 2025

edwinleeTT commented Mar 7, 2025

mbahnasTT commented Feb 25, 2025 •

edited by bbradelTT

Loading

bbradelTT commented Feb 26, 2025 •

edited

Loading