forked from pytorch/pytorch
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[pull] master from pytorch:master #1
Open
pull
wants to merge
4,608
commits into
jessejay-ch:master
Choose a base branch
from
pytorch:master
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
### Overview This PR de-duplicates graph inputs in TorchDynamo, using the `Source` as the unique identifier for each input. This closes #98743 and #98625. ### Details `VariableBuilder.wrap_tensor()` should return a `VariableTracker` for the passed-in `value: Tensor`. If `value` is duplicated, we should avoid calling `OutputGraph.create_graph_input()` and `OutputGraph.add_grapharg()`. - Note that `create_graph_input()` and `add_grapharg()` are not 1:1. For a constant source and either `wrap_sym()` or `wrap_unspecialized_primitive()`, TorchDynamo still calls `create_graph_input()` but not `add_grapharg()`. - Note that `create_graph_input()` should be called before constructing the corresponding `VariableTracker`. TorchDynamo needs the `fx.Proxy` object to pass to `wrap_fx_proxy()`. In this PR, the `OutputGraph` saves an additional mapping `input_source_to_var` from each graph input's `Source` to its `VariableTracker`, which works because `Source` is now hashable. This mapping should be updated each time `create_graph_input()` is called. However, since we must construct the `VariableTracker` after `create_graph_input()` returns, we must have a separate call to the `OutputGraph` to update the mapping. If anyone has any suggestion on how to coalesce this logic and avoid having to remember to update `input_source_to_var` for each `create_graph_input()`, I would love to hear it. <details> <summary> Alternate Approach</summary> Initially, I tried having TorchDynamo construct a new but equivalent `VariableTracker` for the duplicated tensor. However, I abandoned this approach after hitting an assertion in `def wrap_fx_proxy_cls()` due to `"example_value"` already being in the proxy node's metadata because we were reusing the primary tensor's `Proxy` object. Reusing the exact `VariableTracker` also seems less error-prone instead of requiring constructing a new but identical `VariableTracker`. </details> ### Testing #### Global Variable Test ``` import torch @torch.compile() def f(): return x + x x = torch.randn(3) f() ``` Before: ``` ====== Forward graph 0 ====== <eval_with_key>.6 class <lambda>(torch.nn.Module): def forward(self, arg0_1: f32[3], arg1_1: f32[3]): # File: /data/users/ezyang/b/pytorch/ff.py:5, code: return x + x add: f32[3] = torch.ops.aten.add.Tensor(arg0_1, arg1_1); arg0_1 = arg1_1 = None return (add,) ``` After (only `arg0_1` and no more `arg1_1`): ``` ====== Forward graph 0 ====== <eval_with_key>.4 class <lambda>(torch.nn.Module): def forward(self, arg0_1: f32[3]): # File: dynamo/test_dup_global.py:8, code: return x + x add: f32[3] = torch.ops.aten.add.Tensor(arg0_1, arg0_1); arg0_1 = None return (add,) ``` #### FSDP Test Before we error on ``` File "/.../pytorch/torch/_guards.py", line 244, in __post_init__ assert self.input_source_a != self.input_source_b ``` and now there is no error. --- The rename from `name_to_input` to `input_name_to_proxy` is not part of the core logic change and is a remnant from initial attempts. I can undo it later if desired, but I also feel that the new name is more informative. It also fixes the type annotation. Pull Request resolved: #98775 Approved by: https://github.com/ezyang, https://github.com/voznesenskym
Fixes #ISSUE_NUMBER Pull Request resolved: #98458 Approved by: https://github.com/andrewor14
Summary: In some cases, zero_point is returned as an int tensor. We want it to be a long. This fixes a failed assertion in Executorch op_choose_qparams: https://www.internalfb.com/code/fbsource/[4609e7dbbf2e]/fbcode/executorch/kernels/quantized/cpu/op_choose_qparams.cpp?lines=49-52 Test Plan: CI Reviewed By: jerryzh168 Differential Revision: D44764070 Pull Request resolved: #98746 Approved by: https://github.com/jerryzh168
Signed-off-by: Edward Z. Yang <[email protected]> Pull Request resolved: #98500 Approved by: https://github.com/wconstab
Summary: This diff fixes more test failures (T150117218) caused by upgrading the "hypothesis" library to 6.70.1 (D44523679). # //caffe2/caffe2/python:hypothesis_test This test generates float numbers and filters out those whose absolute values are less than 1e-2. It is a known issue of the new version of "hypothesis" that it generates zeros or floats with small absolute values too often: HypothesisWorks/hypothesis#3603 I'm circumventing this issue by suppressing the health check `filter_too_much`. # //caffe2/caffe2/quantization/server:resize_nearest_dnnlowp_op_test All arithmetic should be done in float32 when calculating the reference, since the network being tested uses float32 everywhere. Mixing float32, float64 or even integers will result in intermediate values in float64. The different precision may cause off-by-1 errors when converting to integer. Test Plan: Run all the tests in both "dev" and "opt" modes: ``` for mode in dev opt; do buck2 test mode/$mode //caffe2/caffe2/python:hypothesis_test -- --run-disabled buck2 test mode/$mode //caffe2/caffe2/quantization/server:resize_nearest_dnnlowp_op_test -- --run-disabled buck2 test mode/$mode //caffe2/caffe2/fb/layers/tests:tum_history_test -- --run-disabled buck2 test mode/$mode //caffe2/caffe2/fb/dper/layer_models/tests:nn_ops_test -- --run-disabled buck2 test mode/$mode //caffe2/caffe2/fb/metrics:metrics_test -- --run-disabled buck2 test mode/$mode //deeplearning/numeric_suite/toolkit/test:net_transform_test -- --run-disabled buck2 test mode/$mode //f3/type_system:tests -- --run-disabled done ``` **NOTE:** In the first test (`//caffe2/caffe2/python:hypothesis_test`), the two methods `test_constant_fill_from_tensor` and `test_recurrent` would crash. But these crash on hypothesis 5.49.0, too, so I'm leaving them alone. Differential Revision: D44812706 Pull Request resolved: #98685 Approved by: https://github.com/malfet
This fixes a few failing cases where we fail to compute stride_hint for an indexing expression with ModularIndexing When can size_hint error out? It shouldn't happen when we are getting regular size hints for expressions where free vars are in ShapeEnv. But this is not the case when we try to recover strides from indexing expressions (which is what stride_hint is for). Suppose you have an indexing expression that looks like ``` 289*d0 + ModularIndexing(7399*d1 + d2, 1, 17) + 17*ModularIndexing(7399*d1 + d2, 17, 17) + 46240*ModularIndexing(7399*d1 + d2, 289, 128) ``` and want to understand its stride wrt to variable `d1`. Let's ignore for a moment that stride for ModularIndexing is not well defined, it'll become negative around modulo divisor value, but even without that, the way we usually compute stride is we substitute `0` and `1` for `d1` and compute difference in indexing expression with those substitutions - this is our stride. But for the expression above, the difference would result in an expression that still has free variable `d2` that we don't have a substitution for. The fix that this PR makes is it expands stride computation to substitute not only `0` and `1` for the variable we are computing a stride for, but also `0` for other variables in the indexing expression (`support_vars`). Note that computing strides in `stride_hints` is a performance optimization that we use to reorder dimensions or make split decisions for split reduction. If it fails, it's not a hard error - we may incorrectly apply reordering by it won't affect correctness. Pull Request resolved: #98783 Approved by: https://github.com/ezyang, https://github.com/voznesenskym
Summary: Modeled off of https://www.internalfb.com/code/fbsource/[5f363eaeab1b5d620b9df83ba0de65adfd96771b]/fbcode/caffe2/torch/fb/trainer/profilers/gpu_mem_signpost.py?lines=106-115 I didn't use the Scuba integration in torch/_inductor/fb/logging.py to avoid having to make a new Scuba table; probably should do this. Test Plan: ``` buck2 test //caffe2/test:test_dynamo ``` Differential Revision: D44850903 Pull Request resolved: #98790 Approved by: https://github.com/desertfire, https://github.com/bertmaher
This diff adds the ability to specify range constraints on dynamic dimensions. (Previously we only supported declaring a dynamic dimension, which gets the default range `[2, sympy.oo]`.) One point worth calling out: our initial design called for compound expressions like `lower <= dynamic_dim(x, d) <= upper`. However this seems difficult to support, because of a combination of desugaring and overloading semantics for such compound expressions in Python. Rather than silently doing the wrong thing, we explicitly error in this case and recommend users to specify multiple constraints, which is supported. Differential Revision: [D44847318](https://our.internmc.facebook.com/intern/diff/D44847318/) Pull Request resolved: #98779 Approved by: https://github.com/ezyang
And move ROCm distributed job there as it's very flaky in trunk at the moment. Also move ROCm slow job to `slow` workflow as it should be. Pull Request resolved: #98858 Approved by: https://github.com/malfet, https://github.com/ZainRizvi
Fixes Meta internal user case. Pull Request resolved: #98809 Approved by: https://github.com/wconstab
Pull Request resolved: #98854 Approved by: https://github.com/albanD
…ter tracing and expansion (#98182) This PR adds the GraphModuleTransformation class that can be used as the default transformation after the `train_step()` is traced and expand. The current implementation includes: 1. Wrap the input graph module with IterGraphModule. This will enable the futher graph optimizations which are all implemented based on IterGraphModule. 2. Ability to lower the graph module to the Inductor. To achieve this goal, `lower_to_inductor()` is implemented. TODO: 1. The `override` and `gm_transofmation` have overlapping functions -- `override.transform` can be used to achieve the same function as `gm_transformation`. However, the current semantics of `override` is to override and transform partial graphs while `gm_transformation` is to transform the entire expaned GM. The final UX of `compile()` needs some discussion. 2. The current `lower_to_inductor()` assumes that the entire graph can be lowered to Inductor. This assumption is okay for integration of graph optimizations but is too restrictive for many models. We should upstream `partial_lowering()`. Differential Revision: [D44616783](https://our.internmc.facebook.com/intern/diff/D44616783/) Pull Request resolved: #98182 Approved by: https://github.com/mrshenli
Summary: Replace _dynamo.config with an object instead of module Current usage patterns of setting and reading fields on config will work unchanged. Only changes needed going forward: 1. import torch._dynamo.config will not work. However, just doing import torch._dynamo is sufficient to access dynamo config as torch._dynamo.config. 2. Files inside of _dynamo folder need to access config via from torch._dynamo.config_util import config instead of from torch._dynamo import config. Because _dynamo/__init__.py imports some of the files so it would be circular import. Test Plan: Reviewers: Subscribers: Tasks: Tags: Fixes #ISSUE_NUMBER Pull Request resolved: #96455 Approved by: https://github.com/williamwen42
Pull Request resolved: #98791 Approved by: https://github.com/kumpera
Fixes lint errors introduced by [#98433](#98779) Pull Request resolved: #98873 Approved by: https://github.com/huydhn, https://github.com/malfet
Disable all flaky dynamic tests From #98626 (comment) Rerun all test cases and update skip reasons. The cases failing on both static and dynamic shapes are unittest.skipped. If it only fails on dynamic, it's skipped by skip_dynamic_test. There are a few skipped with skip_ort_min_version, since ORT is not supporting dynamic fx exporter until next version. Pull Request resolved: #98856 Approved by: https://github.com/BowenBao
https://github.com/pytorch/pytorch/blob/2fab2893aa7b255127e0b92560585d9ecf6dc506/torch/_dynamo/variables/builder.py#L759-L760 We already save `source = self.get_source()` to begin `wrap_tensor()`. Since the source should be fixed at `VariableBuilder` construction time, we should be okay to reuse the `source` variable instead of calling `get_source()` every time. Pull Request resolved: #98845 Approved by: https://github.com/ezyang
Fixes #98829 Pull Request resolved: #98831 Approved by: https://github.com/awgu
…ation debug (#98284) Throughout the compilation, there are multiple graphs that will be generated. This PR add an utils to dump the result graphs to a folder. Differential Revision: [D44661599](https://our.internmc.facebook.com/intern/diff/D44661599/) Pull Request resolved: #98284 Approved by: https://github.com/mrshenli
Summary * Fixed an issue with `skip` * Also removed some tests from test_misc.py and moved them to test_decorators.py as test_misc.py is becoming a dumping ground. ~~~ # Code - fn1 was not getting skipped earlier def fn2(x): return x.sin() @torch._dynamo.skip def fn1(x): x = x.sigmoid() return fn2(x.cos()) def fn(x): return fn1(x.tan()) # Extracted graph def forward(self, L_x_ : torch.Tensor): l_x_ = L_x_ tan = l_x_.tan(); l_x_ = None return (tan,) def forward(self, L_x_ : torch.Tensor): l_x_ = L_x_ sin = l_x_.sin(); l_x_ = None return (sin,) ~~~ Pull Request resolved: #98862 Approved by: https://github.com/ezyang, https://github.com/jansel
When there are > 15000 polygons trace_plot starts to get really slow. So order the allocations and take the smallest allocations beyond the 15000 limit and put them into a single summarized polygon. A slider allows this limit to be adjusted. Pull Request resolved: #98865 Approved by: https://github.com/yf225
Summary: It looks nccl 2.0+ no longer needs a lock to avoid being called concurrently with cudaFree. Test Plan: sandcastle + OSS CI Differential Revision: D44514446 Pull Request resolved: #97904 Approved by: https://github.com/malfet, https://github.com/kwen2501
…8285) This PR add `graph_optimization_pass` decorator which should be wrapped by all graph optimization passes. This PR also introduces the first graph optimization, `comm_fusion_with_cat`, as the first use case of `graph_optimization_pass`. Differential Revision: [D44661608](https://our.internmc.facebook.com/intern/diff/D44661608/) Pull Request resolved: #98285 Approved by: https://github.com/yifuwang
`schedule_comm_wait` delays the wait_tensor ops as late as possible. Note that this optimization currently does not reorder the computation ops. For `foreach` based optimizer, we observe that reordering the computation ops is required to achieve a good performance. Differential Revision: [D44761487](https://our.internmc.facebook.com/intern/diff/D44761487/) Pull Request resolved: #98578 Approved by: https://github.com/mrshenli
This will correctly functionalize the optimizer. Otherwise, there are orphand copy_. Differential Revision: [D44761512](https://our.internmc.facebook.com/intern/diff/D44761512/) Pull Request resolved: #98579 Approved by: https://github.com/mrshenli
This PR adds the ability to remove unused `copy_` (`len(node.users) == 0`) that generated by tracing the optimizer. Differential Revision: [D44761556](https://our.internmc.facebook.com/intern/diff/D44761556/) Pull Request resolved: #98580 Approved by: https://github.com/mrshenli
…ExecutorSymbolDef. (#98811) Summary: Due to change in upstream there are multiple builds that fail to build with llvm-17. llvm/llvm-project@8b1771b Added a llvm version check. Test Plan: local testing on failing build with trunk/llvm-12 Reviewed By: zhuhan0 Differential Revision: D44851324 Pull Request resolved: #98811 Approved by: https://github.com/malfet, https://github.com/bertmaher
Wrapper for users to insert constraints into model code. The constraints will not be maintained in the graph after tracing through make_fx so retracing with dynamo/make_fx will not work. This will be supported after torch._assert supported is implemented. Then we can convert the constrain_range calls to torch._asserts. Pull Request resolved: #98433 Approved by: https://github.com/avikchaudhuri, https://github.com/tugsbayasgalan
As we've deprecated 3.7 support for PyTorch Pull Request resolved: #98886 Approved by: https://github.com/PaliC, https://github.com/seemethere
Pull Request resolved: #99235 Approved by: https://github.com/kshitij12345
The strategy is that we will heap allocate a LargeNegativeIntSymNodeImpl whenever we have a large negative int, so that we can keep the old `is_symbolic` test (now called `is_heap_allocated`) on SymInt. Whenever we need to do something with these ints, though, we convert them back into a plain `int64_t` (and then, e.g., wrap it in whatever user specificed SymNodeImpl they need.) We cannot wrap directly in the user specified SymNodeImpl as we generally do not know what the "tracing context" is from C++. We expect large negative ints to be rare, so we don't apply optimizations like singleton-ifying INT_MIN. Here's the order to review: * c10/core/SymInt.h and cpp * `is_symbolic` renamed to `is_heap_allocated` as I needed to audit all use sites: the old `is_symbolic` test would return true for large negative int, but it would be wrong to then try to dispatch on the LargeNegativeIntSymNodeImpl which supports very few operations. In this file, I had to update expect_int, * If you pass in a large negative integer, we instead heap allocate it in `promote_to_negative`. The function is written in a funny way to keep compact constructor code for SymInt (the heap allocation happens out of line) * clone is now moved out-of-line * New method maybe_as_int which will give you a constant int if it is possible, either because it's stored inline or in LargeNegativeIntSymNodeImpl. This is the preferred replacement for previous use of is_symbolic() and then as_int_unchecked(). * Rename toSymNodeImpl to toSymNode, which is more correct (since it returns a SymNode) * Complete rewrite of `normalize_symints.cpp` to use new `maybe_as_int`. Cannot easily use the old code structure, so it's now done doing a macro and typing out each case manually (it's actually not that bad.) * Reimplementations of all the unary operators by hand to use `maybe_as_int`, relatively simple. * c10/core/LargeNegativeIntSymNodeImpl.h - Just stores a int64_t value, but it has to be big and negative. Most methods are not implemented, since we will rewrap the large negative int in the real SymNodeImpl subclass before doing operations with it * The rest of the files are just rewriting code to use `maybe_as_int`. There is a nontrivial comment in c10/core/SymIntArrayRef.h Very minor test adjustment in c10/test/core/SymInt_test.cpp . Plan to exercise this properly in next PR. Companion XLA PR: pytorch/xla#4882 Signed-off-by: Edward Z. Yang <[email protected]> Pull Request resolved: #99157 Approved by: https://github.com/albanD
Signed-off-by: Edward Z. Yang <[email protected]> Pull Request resolved: #98968 Approved by: https://github.com/xw285cornell
Differential Revision: [D45028686](https://our.internmc.facebook.com/intern/diff/D45028686) Pull Request resolved: #99190 Approved by: https://github.com/yifuwang
This commit uses `aten.arange.default` and `aten.arange.start` to test `aten.sym_numel`. Differential Revision: [](https://our.internmc.facebook.com/intern/diff/) Differential Revision: [D45028715](https://our.internmc.facebook.com/intern/diff/D45028715) Pull Request resolved: #99206 Approved by: https://github.com/yifuwang
Differential Revision: [](https://our.internmc.facebook.com/intern/diff/) Differential Revision: [](https://our.internmc.facebook.com/intern/diff/) Differential Revision: [D45028726](https://our.internmc.facebook.com/intern/diff/D45028726) Pull Request resolved: #99231 Approved by: https://github.com/yifuwang
Differential Revision: [](https://our.internmc.facebook.com/intern/diff/) Differential Revision: [](https://our.internmc.facebook.com/intern/diff/) Differential Revision: [](https://our.internmc.facebook.com/intern/diff/) Differential Revision: [D45028732](https://our.internmc.facebook.com/intern/diff/D45028732) Pull Request resolved: #99232 Approved by: https://github.com/yifuwang
Differential Revision: [](https://our.internmc.facebook.com/intern/diff/) Differential Revision: [](https://our.internmc.facebook.com/intern/diff/) Differential Revision: [](https://our.internmc.facebook.com/intern/diff/) Differential Revision: [](https://our.internmc.facebook.com/intern/diff/) Differential Revision: [D45028740](https://our.internmc.facebook.com/intern/diff/D45028740) Pull Request resolved: #99233 Approved by: https://github.com/yifuwang
Previously, we had a problem when partitioning forward-backward dynamic graphs, which is that we could end up with a backward graph that mentions a symbol in an input tensor (e.g., `f32[s0 + s1]`), but without this symbol being otherwise bound elsewhere. When this happens, we have no way of actually deriving the values of `s0` and `s1`. Our fix for this in #93059 was to just retrace the graph, so that s0 + s1 got allocated a new symbol s2 and everything was happy. However, this strategy had other problems, namely (1) we lost all information from the previous ShapeEnv, including guards and (2) we end up allocating a LOT of fresh new symbols in backwards. With this change, we preserve the same ShapeEnv between forward and backwards. How do we do this? We simply require that every symbol which may be present inside tensors, ALSO be a plain SymInt input to the graph. This invariant is enforced by Dynamo. Once we have done this, we can straightforwardly modify the partitioner to preserve these SymInt as saved for backwards, if they are needed in the backwards graph to preserve the invariant as well. This apparently breaks yolov3, but since everything else is OK I'm merging this as obviously good and investigating later. Signed-off-by: Edward Z. Yang <[email protected]> Pull Request resolved: #99089 Approved by: https://github.com/voznesenskym
**Summary** Enable the `decomposed dequant - pointwise ops - decomposed quant` vectorization code gen inside inductor. Here is the example in the UT and the generated code: Example: * `decomposed dequant - relu - decomposed quant` pattern. * Using `uint8` as the quantized tensor data type. Generated Code: ``` kernel_cpp_0 = async_compile.cpp(''' #include "/tmp/torchinductor_root/hw/chwr6vy6e6sd25sfh42qtywkuf2emodexm2aomp3lbrcxwznfwyi.h" extern "C" void kernel(const unsigned char* in_ptr0, unsigned char* out_ptr0) { #pragma omp parallel num_threads(56) { { #pragma omp for for(long i0=static_cast<long>(0); i0<static_cast<long>(27); i0+=static_cast<long>(1)) { auto tmp0 = at::vec::load_uint8_as_float(in_ptr0 + static_cast<long>(16*i0)); auto tmp1 = (tmp0); auto tmp2 = at::vec::Vectorized<float>(static_cast<float>(100)); auto tmp3 = tmp1 - tmp2; auto tmp4 = at::vec::Vectorized<float>(static_cast<float>(0.01)); auto tmp5 = tmp3 * tmp4; auto tmp6 = at::vec::clamp_min(tmp5, decltype(tmp5)(0)); auto tmp7 = at::vec::Vectorized<float>(static_cast<float>(100.0)); auto tmp8 = tmp6 * tmp7; auto tmp9 = tmp8.round(); auto tmp10 = tmp9 + tmp2; auto tmp11 = at::vec::Vectorized<float>(static_cast<float>(0)); auto tmp12 = at::vec::maximum(tmp10, tmp11); auto tmp13 = at::vec::Vectorized<float>(static_cast<float>(255)); auto tmp14 = at::vec::minimum(tmp12, tmp13); auto tmp15 = (tmp14); tmp15.store_float_as_uint8(out_ptr0 + static_cast<long>(16*i0)); } #pragma omp for simd simdlen(8) for(long i0=static_cast<long>(432); i0<static_cast<long>(441); i0+=static_cast<long>(1)) { auto tmp0 = in_ptr0[static_cast<long>(i0)]; auto tmp1 = static_cast<float>(tmp0); auto tmp2 = static_cast<float>(100); auto tmp3 = tmp1 - tmp2; auto tmp4 = static_cast<float>(0.01); auto tmp5 = tmp3 * tmp4; auto tmp6 = tmp5 * (tmp5>0); auto tmp7 = static_cast<float>(100.0); auto tmp8 = tmp6 * tmp7; auto tmp9 = std::nearbyint(tmp8); auto tmp10 = tmp9 + tmp2; auto tmp11 = static_cast<float>(0); auto tmp12 = (tmp11 != tmp11) ? tmp11 : std::max(tmp10, tmp11); auto tmp13 = static_cast<float>(255); auto tmp14 = (tmp13 != tmp13) ? tmp13 : std::min(tmp12, tmp13); auto tmp15 = static_cast<unsigned char>(tmp14); out_ptr0[static_cast<long>(i0)] = tmp15; } } } } ''') ``` **Test Plan** ``` cd test/inductor && python -m pytest test_cpu_repro.py -k test_decomposed_dequant_relu_quant ``` Pull Request resolved: #98489 Approved by: https://github.com/jgong5, https://github.com/jansel
Pull Request resolved: #99130 Approved by: https://github.com/ngimel
This PR introduces the functionalization of RNG ops. Key points are * Introduces a new `philox_rand` prim operator that accepts seed, offset. * Adds decompositions for random operators that use these philox_rand prims * Adds a PhiloxStateTracker to track the offset for each occurence of rand ops * Changes calling convention of AOT Autograd and adds <fwd_seed, fwd_base_offset> and <bwd_seed, bwd_base_offset> * Monkeypatches set_rng_state and get_rng_state while AOT Autograd tracing to record the rng state behavior * Raises assertion for CPU because CPU does not Philox RNG. Not dealt in this PR * dropout op - offset calculation is different * other distributions like normal, poisson etc * Inductor support * Cudagraph support * Dynamic shape support An example ~~~ class Custom(torch.autograd.Function): @staticmethod def forward(ctx, x): ctx.save_for_backward(x) a = torch.rand_like(x) * x a = torch.rand_like(x) * a return a @staticmethod def backward(ctx, grad_out): x, = ctx.saved_tensors return grad_out * torch.rand_like(grad_out) * torch.cos(x) ====== Forward graph 0 ====== def forward(self, fwd_seed_1: i64[], fwd_base_offset_1: i64[], primals_1: f32[16, 16]): # No stacktrace found for following nodes add: i64[] = torch.ops.aten.add.Tensor(fwd_base_offset_1, 0) philox_rand: f32[16, 16] = torch.ops.prims.philox_rand.default([16, 16], fwd_seed_1, add, [16, 1], device(type='cuda', index=0), torch.float32); add = None mul: f32[16, 16] = torch.ops.aten.mul.Tensor(philox_rand, primals_1); philox_rand = None add_1: i64[] = torch.ops.aten.add.Tensor(fwd_base_offset_1, 4); fwd_base_offset_1 = None philox_rand_1: f32[16, 16] = torch.ops.prims.philox_rand.default([16, 16], fwd_seed_1, add_1, [16, 1], device(type='cuda', index=0), torch.float32); fwd_seed_1 = add_1 = None mul_1: f32[16, 16] = torch.ops.aten.mul.Tensor(philox_rand_1, mul); philox_rand_1 = mul = None return [mul_1, primals_1] ====== Backward graph 0 ====== def forward(self, bwd_seed_1: i64[], bwd_base_offset_1: i64[], primals_1: f32[16, 16], tangents_1: f32[16, 16]): # No stacktrace found for following nodes add_2: i64[] = torch.ops.aten.add.Tensor(bwd_base_offset_1, 0); bwd_base_offset_1 = None philox_rand_2: f32[16, 16] = torch.ops.prims.philox_rand.default([16, 16], bwd_seed_1, add_2, [16, 1], device(type='cuda', index=0), torch.float32); bwd_seed_1 = add_2 = None mul_2: f32[16, 16] = torch.ops.aten.mul.Tensor(tangents_1, philox_rand_2); tangents_1 = philox_rand_2 = None cos: f32[16, 16] = torch.ops.aten.cos.default(primals_1); primals_1 = None mul_3: f32[16, 16] = torch.ops.aten.mul.Tensor(mul_2, cos); mul_2 = cos = None return [mul_3] ~~~ Pull Request resolved: #97377 Approved by: https://github.com/ezyang
* Introduce a frame counter which lets us uniquely identify frames. This makes it easier to tell if you are recompiling the same frame * Shorten evaluate_expr to eval for more visual distinctiveness Signed-off-by: Edward Z. Yang <[email protected]> Pull Request resolved: #99159 Approved by: https://github.com/Skylion007
Pull Request resolved: #99236 Approved by: https://github.com/kshitij12345
This reverts commit 6a50b83. Reverted #96995 on behalf of https://github.com/izaitsevfb due to Breaks internal tests
…8692) Fixes #ISSUE_NUMBER Add parameter for pin memory of storage to support other devices. In the future, other backends will provide their own allocators to create pin memory. Pull Request resolved: #98692 Approved by: https://github.com/ezyang
make ATen/native/cuda/Embedding.cu data_ptr-correct Test Plan: Rely on CI. Pull Request resolved: #99183 Approved by: https://github.com/ezyang
make untemplated gemm calls data_ptr-correct Test Plan: Rely on CI. Pull Request resolved: #99184 Approved by: https://github.com/ezyang
…re_fx and decomposed convert flow (#98905)" This reverts commit 9e0df23. Reverted #98905 on behalf of https://github.com/izaitsevfb due to Conflicts with D44918496 landed internally, blocks diff train import
Pull Request resolved: #99252 Approved by: https://github.com/ngimel
#99249) …te nan Pull Request resolved: #99249 Approved by: https://github.com/jansel, https://github.com/malfet
Fixes #99250, unfortunately I haven't figured out how to handle cross-entropy with smooth loss and weights. Pull Request resolved: #99255 Approved by: https://github.com/jansel, https://github.com/malfet
This diff renames quantization spec/config and operator config. It moves these datastructures to base quantizer. Base quantizer API now has get_supported_operators that returns list of patterns that a quantizer quantizes. There are two choices being debated for how to convey to user what a particular quantizer will quantize. 1. Modules. We just convey what nn.Modules will be quantized. Of course that does not mean that equivalent functional variants wont be quantized, however for simplifity we just use nn.Module. If certain ops are quatnzied in fused manner then that will considered internal details. Pros and cons of this approach pros: - Simple. Only nn Modules are listed. - User does not have to see fusion patterns. Cons: - confusing perhaps because it is not clear if supported = nn.Conv2d also means that the quantizer supported functional.conv2d - Hiding fusion pattern means user has no say in not fusing. Meaning if conv2d + relu is fused and user configures to quantize only conv, quantizer will also quantize the following relu as if conv2d + relu are fused. 2. Patterns. Be explicit about what is supported and enumerate all possible compbinations. Pros: - it is very clear what quantizer will do. no surprises. Cons: - It is not simple to parse. - It can be argued taht fusion is internal detail of the quantizer. So some quantizer implementation may chose to expose fusion patterns, while others may not and may not even provide any configurability. One option is to move set_supported_operators/modules out of base quantizer and let each quantizer define its own way of communicating what is supported. Issue with this is that when we want to "Compose" multiple quantizers there is no way for user to define the order of composition if user does not know what a quantizer supports. For exampl quantizer A may quantizer conv + relu while B only conv, but B's implementation is fast. In that case you may compose (B, A) such B quantizes conv and A quantizes relu. Not knowning what A and B support, makes such composition harder Differential Revision: [D44895547](https://our.internmc.facebook.com/intern/diff/D44895547/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D44895547/)! Pull Request resolved: #99063 Approved by: https://github.com/jerryzh168
Entire branch is about to get deleted, it needs only .github and README
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot]
Can you help keep this open source service alive? 💖 Please sponsor : )