-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ThunderFX report] Adds thunderfx_benchmark_report
to do some automatic performance analysis on the FX graph
#1773
Conversation
Adds WAR for triton error in torch.compile bwd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exciting stuff! I made a few small suggestions for your review, @kiya00
…r (e.g. needs integer tensor range)
Hi @mruberry , I met the problem that the original model runs successfully but the current report interface runs OOM on the model (that's a problem, we should be able to run the same data size with the report right?), because the report looks like:
but it's hard to free the model and inputs inside the function. I'm thinking of providing some utility function and let the user to free the model and inputs between the
WDYT? |
It makes a lot of sense that we want to get rid of the model and the inputs before starting our analysis, since each report can synthetically create the appropriate inputs and be run in isolation. We can also implement this pattern ourselves in the UX we expose to generate the reports. Why a helper function vs. just documenting that if OOM is encountered the practitioner should try freeing other cuda tensors? |
adds store nvfusion input metadata in last_input_meta adds example_input_meta in FXGraphReport class splits the thunderfx_benchmark_report function and gives example to free model and inputs
Hi @mruberry , it's ready to review, I keep the lightning-thunder/thunder/dynamo/report.py Lines 1055 to 1076 in ea56d4e
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improvements look great! While there will continue to be refinements and features, let's start using this and get customer feedback.
While getting customer feedback there are still a variety of extensions that could be implemented. What are your thoughts, @kiya00? Would it be interesting to look at measure memory usage, or maybe producing statistics about what operators are used and how they're executed, or would you like to start working on creating more isolated reproductions of correctness issues (and then performance issues), or summarize thunder split reasons?
I think the most interesting would be the creation of more isolated reproductions, but they're all good directions to go in.
Before submitting
What does this PR do?
This PR:
thunderfx_benchmark_report
function that uses the previously implemented reporting APIs to analyze the the input callable, see the function comment for more informationexample outputs
TorchInductorSpecification
which uses only TorchInductor without TorchDynamo, the reason is mentioned here Repro function saved from FX graph is segmented again when passed back totorch.compile
#1521To prevent OOM, release the original model and inputs after calling
get_thunder_split_reports
. More details see the comment here:lightning-thunder/thunder/dynamo/report.py
Lines 1055 to 1076 in ea56d4e
repro script
<\details>