[ThunderFX report] Adds `thunderfx_benchmark_report` to do some automatic performance analysis on the FX graph #1773

kiya00 · 2025-02-17T20:06:51Z

Before submitting

Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
Did you read the contributor guideline, Pull Request section?
Did you make sure to update the docs?
Did you write any new necessary tests?

What does this PR do?

This PR:

Adds thunderfx_benchmark_report function that uses the previously implemented reporting APIs to analyze the the input callable, see the function comment for more information

example outputs

The input callable can be successfully executed by torch.compile.

graph0: Split Information:
The original graph is not split, and is entirely run by Thunder.


graph0_thunder_0 can be successfully executed by Thunder

Benchmark walltime for **graph0_thunder_0 forward** requires investigation: thunder(132.469 us) and torchcompile(28.990 us) is not close (rtol=0.1, atol=0.0)
Benchmark walltime for **graph0_thunder_0 backward** requires investigation: thunder(98.921 us) and torchcompile(34.108 us) is not close (rtol=0.1, atol=0.0)
The scripts are saved: graph0_thunder_0_torchcompile_walltime.py, graph0_thunder_0_thunder_walltime.py


Benchmark kerneltime ran successfully on **graph0_thunder_0 forward**: thunder(955.523 ns) , torchcompile(1.002 us)
Benchmark kerneltime ran successfully on **graph0_thunder_0 backward**: thunder(965.957 ns) , torchcompile(1.012 us)


Benchmark walltime for **graph0_thunder_0_nvFusion0_forward** requires investigation: nvfuser(22.932 us) and torchcompile(20.361 us) is not close (rtol=0.1, atol=0.0)
The scripts are saved: graph0_thunder_0_nvFusion0_forward_torchcompile_walltime.py, graph0_thunder_0_nvFusion0_forward_nvfuser_walltime.py


Benchmark kerneltime ran successfully on **graph0_thunder_0_nvFusion0_forward**: nvfuser(952.301 ns) , torchcompile(998.566 ns)


Benchmark walltime for **graph0_thunder_0_nvFusion0_backward** requires investigation: nvfuser(26.004 us) and torchcompile(21.240 us) is not close (rtol=0.1, atol=0.0)
The scripts are saved: graph0_thunder_0_nvFusion0_backward_torchcompile_walltime.py, graph0_thunder_0_nvFusion0_backward_nvfuser_walltime.py


Benchmark kerneltime ran successfully on **graph0_thunder_0_nvFusion0_backward**: nvfuser(962.118 ns) , torchcompile(1.007 us)

Adds TorchInductorSpecification which uses only TorchInductor without TorchDynamo, the reason is mentioned here Repro function saved from FX graph is segmented again when passed back to torch.compile #1521
Removes the previous write_repro interface, keeping only the one that takes "CompileSpecification" and timer function as inputs

NOTE: When running repro/benchmark on the FX graph, the inputs are newly allocated (i.e., they do not reuse the parameters and inputs of the original model). This can lead to OOM.
To prevent OOM, release the original model and inputs after calling get_thunder_split_reports. More details see the comment here:

lightning-thunder/thunder/dynamo/report.py

Lines 1055 to 1076 in ea56d4e

    
               Note: 
        
               - This function may run out of memory (OOM) as it allocates random tensors when executing 
        
               the graph module in each Report. To prevent OOM issues, users must manually free the 
        
               input model and arguments to free up memory for `get_nvfusion_reports`, `check_timing`, 
        
               and `check_timing_bsym`. 
        
               Here is an example: 
        
               ```python 
        
               split_reports = get_thunder_split_reports(model, x) 
        
               # Frees the parameters and inputs to make room for the reports 
        
               del model 
        
               del x 
        
               # Running the model generates the NVFusion symbol, which requires additional memory for the input. 
        
               # To free up space before generating the NVFusion reports, both the model and input are deleted. 
        
               nvfusion_reports = get_nvfusion_reports(split_reports) 
        
               check_timing(folder_path, split_reports[0], torchcompile, thunderjit_specification, WallTime, "walltime", rtol, atol) 
        
               check_timing_bsym(folder_path, nvfusion_reports[0], bsym_torchcompile, bsym_nvfuser, KernelTime, "kerneltime", rtol, atol) 
        
               ```

Makes some revisions according to the previous review comments
Adds the WAR for Triton error:

repro script

import torch

def test_graph0():
    class DynamoModule(torch.nn.Module):
        def forward(self, L_x_ : torch.Tensor):
            l_x_ = L_x_
            x = l_x_.exp();  l_x_ = None
            return (x,)
        

    inputs = [
        torch.testing.make_tensor((2, 2), dtype=torch.float32,  device='cuda:0', requires_grad=True),
    ]

    model = DynamoModule()
    compiled_model = torch.compile(model, )
    from thunder.dynamo.report import run_forward_backward

    result = compiled_model(*inputs)

    output_grads = [torch.ones_like(r) for r in result]

    torch.autograd.backward(result, output_grads)
    print(result)

# the WAR: https://github.com/pytorch/pytorch/issues/124565
# torch.empty(1, device='cuda', requires_grad=True).backward()
test_graph0()

<\details>

Adds WAR for triton error in torch.compile bwd

thunder/dynamo/benchmark_utils.py

thunder/dynamo/utils.py

thunder/tests/test_dynamo.py

thunder/dynamo/report.py

mruberry

Exciting stuff! I made a few small suggestions for your review, @kiya00

…r (e.g. needs integer tensor range)

kiya00 · 2025-02-19T20:41:24Z

Hi @mruberry , I met the problem that the original model runs successfully but the current report interface runs OOM on the model (that's a problem, we should be able to run the same data size with the report right?), because the report looks like:

def report(model, inputs...):
    try to run torch.compile
    # loop subgraphs
    reports = fx_report(model)
    for sub_report in reports.sub_reports:
        thunder_report = analyse_thunder_splits(sub_report)
        for split_report in thunder_report.sub_reports:
            ###this part creates new random input tensors, it causes OOM, needs to free the original model and inputs to get enough space
            split_report.run_benchmark(torchcompile)  
            split_report.run_benchmark(thunder)
            compare and save the benchmark script if necessary

but it's hard to free the model and inputs inside the function. I'm thinking of providing some utility function and let the user to free the model and inputs between the get_subgraph_reports and performance_analyse, like:

def get_subgraph_reports(model, inputs,...):
    report_list = []
    reports = fx_report(model)
    ...
    return report_list

# user code:
model = GPT(config)
inputs = make_tensor(...)
reports = get_subgraph_reports(model, inputs)
del model, input
performance_analyse(reports)

WDYT?

mruberry · 2025-02-19T20:58:25Z

def report(model, inputs...):
    try to run torch.compile
    # loop subgraphs
    reports = fx_report(model)
    for sub_report in reports.sub_reports:
        thunder_report = analyse_thunder_splits(sub_report)
        for split_report in thunder_report.sub_reports:
            ###this part creates new random input tensors, it causes OOM, needs to free the original model and inputs to get enough space
            split_report.run_benchmark(torchcompile)  
            split_report.run_benchmark(thunder)
            compare and save the benchmark script if necessary

but it's hard to free the model and inputs inside the function. I'm thinking of providing some utility function and let the user to free the model and inputs between the get_subgraph_reports and performance_analyse, like:

def get_subgraph_reports(model, inputs,...):
    report_list = []
    reports = fx_report(model)
    ...
    return report_list

# user code:
model = GPT(config)
inputs = make_tensor(...)
reports = get_subgraph_reports(model, inputs)
del model, input
performance_analyse(reports)

WDYT?

It makes a lot of sense that we want to get rid of the model and the inputs before starting our analysis, since each report can synthetically create the appropriate inputs and be run in isolation. We can also implement this pattern ourselves in the UX we expose to generate the reports.

Why a helper function vs. just documenting that if OOM is encountered the practitioner should try freeing other cuda tensors?

adds store nvfusion input metadata in last_input_meta adds example_input_meta in FXGraphReport class splits the thunderfx_benchmark_report function and gives example to free model and inputs

kiya00 · 2025-02-21T14:06:32Z

Hi @mruberry , it's ready to review, I keep the thunderfx_benchmark_report for easy use and give comments about if it goes OOM user needs to delete the model and inputs manually

lightning-thunder/thunder/dynamo/report.py

Lines 1055 to 1076 in ea56d4e

    
               Note: 
        
               - This function may run out of memory (OOM) as it allocates random tensors when executing 
        
               the graph module in each Report. To prevent OOM issues, users must manually free the 
        
               input model and arguments to free up memory for `get_nvfusion_reports`, `check_timing`, 
        
               and `check_timing_bsym`. 
        
               Here is an example: 
        
               ```python 
        
               split_reports = get_thunder_split_reports(model, x) 
        
               # Frees the parameters and inputs to make room for the reports 
        
               del model 
        
               del x 
        
               # Running the model generates the NVFusion symbol, which requires additional memory for the input. 
        
               # To free up space before generating the NVFusion reports, both the model and input are deleted. 
        
               nvfusion_reports = get_nvfusion_reports(split_reports) 
        
               check_timing(folder_path, split_reports[0], torchcompile, thunderjit_specification, WallTime, "walltime", rtol, atol) 
        
               check_timing_bsym(folder_path, nvfusion_reports[0], bsym_torchcompile, bsym_nvfuser, KernelTime, "kerneltime", rtol, atol) 
        
               ```

thunder/dynamo/benchmark_utils.py

mruberry

Improvements look great! While there will continue to be refinements and features, let's start using this and get customer feedback.

While getting customer feedback there are still a variety of extensions that could be implemented. What are your thoughts, @kiya00? Would it be interesting to look at measure memory usage, or maybe producing statistics about what operators are used and how they're executed, or would you like to start working on creating more isolated reproductions of correctness issues (and then performance issues), or summarize thunder split reasons?

I think the most interesting would be the creation of more isolated reproductions, but they're all good directions to go in.

kiya00 added 8 commits February 14, 2025 15:08

clean up; use write_repro_v2 by default

04e0ade

Adds torch._inductor.compile specification

8ebf3e2

Adds WAR for triton error in torch.compile bwd

fix for comments: print measurement

ad29be1

rm the old write_repro

c531521

Adds autotesting report function

c0a7ce8

fix

7fcd70d

rm unused

30427a5

adds comment

729ac9f

kiya00 requested a review from IvanYashchuk February 17, 2025 20:06