Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ThunderFX report] Adds thunderfx_benchmark_report to do some automatic performance analysis on the FX graph #1773

Merged
merged 23 commits into from
Feb 21, 2025

Conversation

kiya00
Copy link
Collaborator

@kiya00 kiya00 commented Feb 17, 2025

Before submitting
  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure to update the docs?
  • Did you write any new necessary tests?

What does this PR do?

This PR:

  1. Adds thunderfx_benchmark_report function that uses the previously implemented reporting APIs to analyze the the input callable, see the function comment for more information
example outputs
The input callable can be successfully executed by torch.compile.

graph0: Split Information:
The original graph is not split, and is entirely run by Thunder.


graph0_thunder_0 can be successfully executed by Thunder

Benchmark walltime for **graph0_thunder_0 forward** requires investigation: thunder(132.469 us) and torchcompile(28.990 us) is not close (rtol=0.1, atol=0.0)
Benchmark walltime for **graph0_thunder_0 backward** requires investigation: thunder(98.921 us) and torchcompile(34.108 us) is not close (rtol=0.1, atol=0.0)
The scripts are saved: graph0_thunder_0_torchcompile_walltime.py, graph0_thunder_0_thunder_walltime.py


Benchmark kerneltime ran successfully on **graph0_thunder_0 forward**: thunder(955.523 ns) , torchcompile(1.002 us)
Benchmark kerneltime ran successfully on **graph0_thunder_0 backward**: thunder(965.957 ns) , torchcompile(1.012 us)


Benchmark walltime for **graph0_thunder_0_nvFusion0_forward** requires investigation: nvfuser(22.932 us) and torchcompile(20.361 us) is not close (rtol=0.1, atol=0.0)
The scripts are saved: graph0_thunder_0_nvFusion0_forward_torchcompile_walltime.py, graph0_thunder_0_nvFusion0_forward_nvfuser_walltime.py


Benchmark kerneltime ran successfully on **graph0_thunder_0_nvFusion0_forward**: nvfuser(952.301 ns) , torchcompile(998.566 ns)


Benchmark walltime for **graph0_thunder_0_nvFusion0_backward** requires investigation: nvfuser(26.004 us) and torchcompile(21.240 us) is not close (rtol=0.1, atol=0.0)
The scripts are saved: graph0_thunder_0_nvFusion0_backward_torchcompile_walltime.py, graph0_thunder_0_nvFusion0_backward_nvfuser_walltime.py


Benchmark kerneltime ran successfully on **graph0_thunder_0_nvFusion0_backward**: nvfuser(962.118 ns) , torchcompile(1.007 us)

  1. Adds TorchInductorSpecification which uses only TorchInductor without TorchDynamo, the reason is mentioned here Repro function saved from FX graph is segmented again when passed back to torch.compile #1521
  2. Removes the previous write_repro interface, keeping only the one that takes "CompileSpecification" and timer function as inputs
  3. NOTE: When running repro/benchmark on the FX graph, the inputs are newly allocated (i.e., they do not reuse the parameters and inputs of the original model). This can lead to OOM.
    To prevent OOM, release the original model and inputs after calling get_thunder_split_reports. More details see the comment here:
    Note:
    - This function may run out of memory (OOM) as it allocates random tensors when executing
    the graph module in each Report. To prevent OOM issues, users must manually free the
    input model and arguments to free up memory for `get_nvfusion_reports`, `check_timing`,
    and `check_timing_bsym`.
    Here is an example:
    ```python
    split_reports = get_thunder_split_reports(model, x)
    # Frees the parameters and inputs to make room for the reports
    del model
    del x
    # Running the model generates the NVFusion symbol, which requires additional memory for the input.
    # To free up space before generating the NVFusion reports, both the model and input are deleted.
    nvfusion_reports = get_nvfusion_reports(split_reports)
    check_timing(folder_path, split_reports[0], torchcompile, thunderjit_specification, WallTime, "walltime", rtol, atol)
    check_timing_bsym(folder_path, nvfusion_reports[0], bsym_torchcompile, bsym_nvfuser, KernelTime, "kerneltime", rtol, atol)
    ```
  4. Makes some revisions according to the previous review comments
  5. Adds the WAR for Triton error:
repro script
import torch

def test_graph0():
    class DynamoModule(torch.nn.Module):
        def forward(self, L_x_ : torch.Tensor):
            l_x_ = L_x_
            x = l_x_.exp();  l_x_ = None
            return (x,)
        

    inputs = [
        torch.testing.make_tensor((2, 2), dtype=torch.float32,  device='cuda:0', requires_grad=True),
    ]

    model = DynamoModule()
    compiled_model = torch.compile(model, )
    from thunder.dynamo.report import run_forward_backward

    result = compiled_model(*inputs)

    output_grads = [torch.ones_like(r) for r in result]

    torch.autograd.backward(result, output_grads)
    print(result)

# the WAR: https://github.com/pytorch/pytorch/issues/124565
# torch.empty(1, device='cuda', requires_grad=True).backward()
test_graph0()

<\details>

@kiya00 kiya00 requested a review from IvanYashchuk February 17, 2025 20:06
Copy link
Collaborator

@mruberry mruberry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exciting stuff! I made a few small suggestions for your review, @kiya00

@kiya00
Copy link
Collaborator Author

kiya00 commented Feb 19, 2025

Hi @mruberry , I met the problem that the original model runs successfully but the current report interface runs OOM on the model (that's a problem, we should be able to run the same data size with the report right?), because the report looks like:

def report(model, inputs...):
    try to run torch.compile
    # loop subgraphs
    reports = fx_report(model)
    for sub_report in reports.sub_reports:
        thunder_report = analyse_thunder_splits(sub_report)
        for split_report in thunder_report.sub_reports:
            ###this part creates new random input tensors, it causes OOM, needs to free the original model and inputs to get enough space
            split_report.run_benchmark(torchcompile)  
            split_report.run_benchmark(thunder)
            compare and save the benchmark script if necessary

but it's hard to free the model and inputs inside the function. I'm thinking of providing some utility function and let the user to free the model and inputs between the get_subgraph_reports and performance_analyse, like:

def get_subgraph_reports(model, inputs,...):
    report_list = []
    reports = fx_report(model)
    ...
    return report_list

# user code:
model = GPT(config)
inputs = make_tensor(...)
reports = get_subgraph_reports(model, inputs)
del model, input
performance_analyse(reports)

WDYT?

@mruberry
Copy link
Collaborator

def report(model, inputs...):
    try to run torch.compile
    # loop subgraphs
    reports = fx_report(model)
    for sub_report in reports.sub_reports:
        thunder_report = analyse_thunder_splits(sub_report)
        for split_report in thunder_report.sub_reports:
            ###this part creates new random input tensors, it causes OOM, needs to free the original model and inputs to get enough space
            split_report.run_benchmark(torchcompile)  
            split_report.run_benchmark(thunder)
            compare and save the benchmark script if necessary

but it's hard to free the model and inputs inside the function. I'm thinking of providing some utility function and let the user to free the model and inputs between the get_subgraph_reports and performance_analyse, like:

def get_subgraph_reports(model, inputs,...):
    report_list = []
    reports = fx_report(model)
    ...
    return report_list

# user code:
model = GPT(config)
inputs = make_tensor(...)
reports = get_subgraph_reports(model, inputs)
del model, input
performance_analyse(reports)

WDYT?

It makes a lot of sense that we want to get rid of the model and the inputs before starting our analysis, since each report can synthetically create the appropriate inputs and be run in isolation. We can also implement this pattern ourselves in the UX we expose to generate the reports.

Why a helper function vs. just documenting that if OOM is encountered the practitioner should try freeing other cuda tensors?

kiya00 and others added 6 commits February 20, 2025 21:27
    adds store nvfusion input metadata in last_input_meta
    adds example_input_meta in FXGraphReport class
    splits the thunderfx_benchmark_report function and gives example to free model and inputs
@kiya00
Copy link
Collaborator Author

kiya00 commented Feb 21, 2025

Hi @mruberry , it's ready to review, I keep the thunderfx_benchmark_report for easy use and give comments about if it goes OOM user needs to delete the model and inputs manually

Note:
- This function may run out of memory (OOM) as it allocates random tensors when executing
the graph module in each Report. To prevent OOM issues, users must manually free the
input model and arguments to free up memory for `get_nvfusion_reports`, `check_timing`,
and `check_timing_bsym`.
Here is an example:
```python
split_reports = get_thunder_split_reports(model, x)
# Frees the parameters and inputs to make room for the reports
del model
del x
# Running the model generates the NVFusion symbol, which requires additional memory for the input.
# To free up space before generating the NVFusion reports, both the model and input are deleted.
nvfusion_reports = get_nvfusion_reports(split_reports)
check_timing(folder_path, split_reports[0], torchcompile, thunderjit_specification, WallTime, "walltime", rtol, atol)
check_timing_bsym(folder_path, nvfusion_reports[0], bsym_torchcompile, bsym_nvfuser, KernelTime, "kerneltime", rtol, atol)
```

@kiya00 kiya00 marked this pull request as ready for review February 21, 2025 14:07
@kiya00 kiya00 requested review from lantiga and t-vi as code owners February 21, 2025 14:07
Copy link
Collaborator

@mruberry mruberry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improvements look great! While there will continue to be refinements and features, let's start using this and get customer feedback.

While getting customer feedback there are still a variety of extensions that could be implemented. What are your thoughts, @kiya00? Would it be interesting to look at measure memory usage, or maybe producing statistics about what operators are used and how they're executed, or would you like to start working on creating more isolated reproductions of correctness issues (and then performance issues), or summarize thunder split reasons?

I think the most interesting would be the creation of more isolated reproductions, but they're all good directions to go in.

@mruberry mruberry enabled auto-merge (squash) February 21, 2025 16:45
@mruberry mruberry merged commit 7906dd8 into main Feb 21, 2025
52 checks passed
@mruberry mruberry deleted the test_report branch February 21, 2025 16:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants