Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pull] master from pytorch:master #1

Open
wants to merge 4,608 commits into
base: master
Choose a base branch
from
Open

Conversation

pull[bot]
Copy link

@pull pull bot commented Dec 21, 2022

See Commits and Changes for more details.


Created by pull[bot]

Can you help keep this open source service alive? 💖 Please sponsor : )

@pull pull bot added the ⤵️ pull label Dec 21, 2022
awgu and others added 29 commits April 11, 2023 18:07
###  Overview
This PR de-duplicates graph inputs in TorchDynamo, using the `Source` as the unique identifier for each input. This closes #98743 and #98625.

### Details
`VariableBuilder.wrap_tensor()` should return a `VariableTracker` for the passed-in `value: Tensor`. If `value` is duplicated, we should avoid calling `OutputGraph.create_graph_input()` and `OutputGraph.add_grapharg()`.
- Note that `create_graph_input()` and `add_grapharg()` are not 1:1. For a constant source and either `wrap_sym()` or `wrap_unspecialized_primitive()`, TorchDynamo still calls `create_graph_input()` but not `add_grapharg()`.
- Note that `create_graph_input()` should be called before constructing the corresponding `VariableTracker`. TorchDynamo needs the `fx.Proxy` object to pass to `wrap_fx_proxy()`.

In this PR, the `OutputGraph` saves an additional mapping `input_source_to_var` from each graph input's `Source` to its `VariableTracker`, which works because `Source` is now hashable. This mapping should be updated each time `create_graph_input()` is called. However, since we must construct the `VariableTracker` after `create_graph_input()` returns, we must have a separate call to the `OutputGraph` to update the mapping.

If anyone has any suggestion on how to coalesce this logic and avoid having to remember to update `input_source_to_var` for each `create_graph_input()`, I would love to hear it.

<details>
<summary> Alternate Approach</summary>

Initially, I tried having TorchDynamo construct a new but equivalent `VariableTracker` for the duplicated tensor. However, I abandoned this approach after hitting an assertion in `def wrap_fx_proxy_cls()` due to `"example_value"` already being in the proxy node's metadata because we were reusing the primary tensor's `Proxy` object. Reusing the exact `VariableTracker` also seems less error-prone instead of requiring constructing a new but identical `VariableTracker`.
</details>

### Testing
#### Global Variable Test
```
import torch
@torch.compile()
def f():
    return x + x
x = torch.randn(3)
f()
```

Before:
```
====== Forward graph 0 ======
 <eval_with_key>.6 class <lambda>(torch.nn.Module):
    def forward(self, arg0_1: f32[3], arg1_1: f32[3]):
        # File: /data/users/ezyang/b/pytorch/ff.py:5, code: return x + x
        add: f32[3] = torch.ops.aten.add.Tensor(arg0_1, arg1_1);  arg0_1 = arg1_1 = None
        return (add,)
```

After (only `arg0_1` and no more `arg1_1`):
```
 ====== Forward graph 0 ======
 <eval_with_key>.4 class <lambda>(torch.nn.Module):
    def forward(self, arg0_1: f32[3]):
        # File: dynamo/test_dup_global.py:8, code: return x + x
        add: f32[3] = torch.ops.aten.add.Tensor(arg0_1, arg0_1);  arg0_1 = None
        return (add,)
```

#### FSDP Test
Before we error on
```
File "/.../pytorch/torch/_guards.py", line 244, in __post_init__
    assert self.input_source_a != self.input_source_b
```
and now there is no error.

---
The rename from `name_to_input` to `input_name_to_proxy` is not part of the core logic change and is a remnant from initial attempts. I can undo it later if desired, but I also feel that the new name is more informative. It also fixes the type annotation.

Pull Request resolved: #98775
Approved by: https://github.com/ezyang, https://github.com/voznesenskym
Summary:
In some cases, zero_point is returned as an int tensor. We want it to be a long.

This fixes a failed assertion in Executorch op_choose_qparams:
https://www.internalfb.com/code/fbsource/[4609e7dbbf2e]/fbcode/executorch/kernels/quantized/cpu/op_choose_qparams.cpp?lines=49-52

Test Plan: CI

Reviewed By: jerryzh168

Differential Revision: D44764070

Pull Request resolved: #98746
Approved by: https://github.com/jerryzh168
Summary:
This diff fixes more test failures (T150117218) caused by upgrading the "hypothesis" library to 6.70.1 (D44523679).

# //caffe2/caffe2/python:hypothesis_test
This test generates float numbers and filters out those whose absolute values are less than 1e-2.
It is a known issue of the new version of "hypothesis" that it generates zeros or floats with small absolute values too often:
HypothesisWorks/hypothesis#3603
I'm circumventing this issue by suppressing the health check `filter_too_much`.

# //caffe2/caffe2/quantization/server:resize_nearest_dnnlowp_op_test
All arithmetic should be done in float32 when calculating the reference, since the network being tested uses float32 everywhere.
Mixing float32, float64 or even integers will result in intermediate values in float64.
The different precision may cause off-by-1 errors when converting to integer.

Test Plan:
Run all the tests in both "dev" and "opt" modes:
```
for mode in dev opt; do
  buck2 test mode/$mode //caffe2/caffe2/python:hypothesis_test -- --run-disabled
  buck2 test mode/$mode //caffe2/caffe2/quantization/server:resize_nearest_dnnlowp_op_test -- --run-disabled
  buck2 test mode/$mode //caffe2/caffe2/fb/layers/tests:tum_history_test -- --run-disabled
  buck2 test mode/$mode //caffe2/caffe2/fb/dper/layer_models/tests:nn_ops_test -- --run-disabled
  buck2 test mode/$mode //caffe2/caffe2/fb/metrics:metrics_test -- --run-disabled
  buck2 test mode/$mode //deeplearning/numeric_suite/toolkit/test:net_transform_test -- --run-disabled
  buck2 test mode/$mode //f3/type_system:tests -- --run-disabled
done
```

**NOTE:** In the first test (`//caffe2/caffe2/python:hypothesis_test`), the two methods `test_constant_fill_from_tensor` and `test_recurrent` would crash.
But these crash on hypothesis 5.49.0, too, so I'm leaving them alone.

Differential Revision: D44812706

Pull Request resolved: #98685
Approved by: https://github.com/malfet
This fixes a few failing cases where we fail to compute stride_hint for an indexing expression with ModularIndexing

When can size_hint error out? It shouldn't happen when we are getting regular size hints for expressions where free vars are in ShapeEnv. But this is not the case when we try to recover strides from indexing expressions (which is what stride_hint is for). Suppose you have an indexing expression that looks like
```
289*d0 + ModularIndexing(7399*d1 + d2, 1, 17) + 17*ModularIndexing(7399*d1 + d2, 17, 17) + 46240*ModularIndexing(7399*d1 + d2, 289, 128)
```
and want to understand its stride wrt to variable `d1`. Let's ignore for a moment that stride for ModularIndexing is not well defined, it'll become negative around modulo divisor value, but even without that, the way we usually compute stride is we substitute `0` and `1` for `d1` and compute difference in indexing expression with those substitutions - this is our stride. But for the expression above, the difference would result in an expression that still has free variable `d2` that we don't have a substitution for.
The fix that this PR makes is it expands stride computation to substitute not only `0` and `1` for the variable we are computing a stride for, but also `0` for other variables in the indexing expression (`support_vars`).
Note that computing strides in `stride_hints` is a performance optimization that we use to reorder dimensions or make split decisions for split reduction. If it fails, it's not a hard error - we may incorrectly apply reordering by it won't affect correctness.

Pull Request resolved: #98783
Approved by: https://github.com/ezyang, https://github.com/voznesenskym
Summary:
Modeled off of https://www.internalfb.com/code/fbsource/[5f363eaeab1b5d620b9df83ba0de65adfd96771b]/fbcode/caffe2/torch/fb/trainer/profilers/gpu_mem_signpost.py?lines=106-115

I didn't use the Scuba integration in torch/_inductor/fb/logging.py to avoid
having to make a new Scuba table; probably should do this.

Test Plan:
```
buck2 test //caffe2/test:test_dynamo
```

Differential Revision: D44850903

Pull Request resolved: #98790
Approved by: https://github.com/desertfire, https://github.com/bertmaher
This diff adds the ability to specify range constraints on dynamic dimensions. (Previously we only supported declaring a dynamic dimension, which gets the default range `[2, sympy.oo]`.)

One point worth calling out: our initial design called for compound expressions like `lower <= dynamic_dim(x, d) <= upper`. However this seems difficult to support, because of a combination of desugaring and overloading semantics for such compound expressions in Python. Rather than silently doing the wrong thing, we explicitly error in this case and recommend users to specify multiple constraints, which is supported.

Differential Revision: [D44847318](https://our.internmc.facebook.com/intern/diff/D44847318/)

Pull Request resolved: #98779
Approved by: https://github.com/ezyang
And move ROCm distributed job there as it's very flaky in trunk at the moment.  Also move ROCm slow job to `slow` workflow as it should be.

Pull Request resolved: #98858
Approved by: https://github.com/malfet, https://github.com/ZainRizvi
Fixes Meta internal user case.

Pull Request resolved: #98809
Approved by: https://github.com/wconstab
…ter tracing and expansion (#98182)

This PR adds the GraphModuleTransformation class that can be used as the
default transformation after the `train_step()` is traced and expand. The
current implementation includes:
1. Wrap the input graph module with IterGraphModule. This will enable the futher graph optimizations which are all implemented based on IterGraphModule.
2. Ability to lower the graph module to the Inductor. To achieve this goal, `lower_to_inductor()` is implemented.

TODO:
1. The `override` and `gm_transofmation` have overlapping functions -- `override.transform` can be used to achieve the same function as `gm_transformation`. However, the current semantics of `override` is to override and transform partial graphs while `gm_transformation` is to transform the entire expaned GM. The final UX of `compile()` needs some discussion.

2. The current `lower_to_inductor()` assumes that the entire graph can be lowered to Inductor. This assumption is okay for integration of graph optimizations but is too restrictive for many models. We should upstream `partial_lowering()`.

Differential Revision: [D44616783](https://our.internmc.facebook.com/intern/diff/D44616783/)
Pull Request resolved: #98182
Approved by: https://github.com/mrshenli
Summary:
    Replace _dynamo.config with an object instead of module

    Current usage patterns of setting and reading fields on config will work
    unchanged.

    Only changes needed going forward:
    1. import torch._dynamo.config will not work. However, just doing
       import torch._dynamo is sufficient to access dynamo config
       as torch._dynamo.config.

    2. Files inside of _dynamo folder need to access config via
       from torch._dynamo.config_util import config instead of
       from torch._dynamo import config. Because _dynamo/__init__.py
       imports some of the files so it would be circular import.

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: #96455
Approved by: https://github.com/williamwen42
Fixes lint errors introduced by [#98433](#98779)

Pull Request resolved: #98873
Approved by: https://github.com/huydhn, https://github.com/malfet
Disable all flaky dynamic tests
From #98626 (comment)

Rerun all test cases and update skip reasons. The cases failing on both static and dynamic shapes are unittest.skipped. If it only fails on dynamic, it's skipped by skip_dynamic_test. There are a few skipped with skip_ort_min_version, since ORT is not supporting dynamic fx exporter until next version.
Pull Request resolved: #98856
Approved by: https://github.com/BowenBao
https://github.com/pytorch/pytorch/blob/2fab2893aa7b255127e0b92560585d9ecf6dc506/torch/_dynamo/variables/builder.py#L759-L760
We already save `source = self.get_source()` to begin `wrap_tensor()`. Since the source should be fixed at `VariableBuilder` construction time, we should be okay to reuse the `source` variable instead of calling `get_source()` every time.

Pull Request resolved: #98845
Approved by: https://github.com/ezyang
…ation debug (#98284)

Throughout the compilation, there are multiple graphs that will be generated.  This PR add an utils to dump the result graphs to a folder.

Differential Revision: [D44661599](https://our.internmc.facebook.com/intern/diff/D44661599/)
Pull Request resolved: #98284
Approved by: https://github.com/mrshenli
Summary
* Fixed an issue with `skip`
* Also removed some tests from test_misc.py and moved them to test_decorators.py as test_misc.py is becoming a dumping ground.

~~~

# Code - fn1 was not getting skipped earlier
def fn2(x):
    return x.sin()

@torch._dynamo.skip
def fn1(x):
    x = x.sigmoid()
    return fn2(x.cos())

def fn(x):
    return fn1(x.tan())

# Extracted graph
def forward(self, L_x_ : torch.Tensor):
    l_x_ = L_x_
    tan = l_x_.tan();  l_x_ = None
    return (tan,)

def forward(self, L_x_ : torch.Tensor):
    l_x_ = L_x_
    sin = l_x_.sin();  l_x_ = None
    return (sin,)
~~~

Pull Request resolved: #98862
Approved by: https://github.com/ezyang, https://github.com/jansel
When there are > 15000 polygons trace_plot starts to get really slow.
So order the allocations and take the smallest allocations beyond the 15000
limit and put them into a single summarized polygon.
A slider allows this limit to be adjusted.
Pull Request resolved: #98865
Approved by: https://github.com/yf225
Summary: It looks nccl 2.0+ no longer needs a lock to avoid being called concurrently with cudaFree.

Test Plan: sandcastle + OSS CI

Differential Revision: D44514446

Pull Request resolved: #97904
Approved by: https://github.com/malfet, https://github.com/kwen2501
…8285)

This PR add `graph_optimization_pass` decorator which should be wrapped by all graph optimization passes. This PR also introduces the first graph optimization, `comm_fusion_with_cat`, as the first use case of `graph_optimization_pass`.

Differential Revision: [D44661608](https://our.internmc.facebook.com/intern/diff/D44661608/)
Pull Request resolved: #98285
Approved by: https://github.com/yifuwang
`schedule_comm_wait` delays the wait_tensor ops as late as possible. Note that this optimization currently does not reorder the computation ops. For `foreach` based optimizer, we observe that reordering the computation ops is required to achieve a good performance.

Differential Revision: [D44761487](https://our.internmc.facebook.com/intern/diff/D44761487/)
Pull Request resolved: #98578
Approved by: https://github.com/mrshenli
This will correctly functionalize the optimizer. Otherwise, there are orphand copy_.

Differential Revision: [D44761512](https://our.internmc.facebook.com/intern/diff/D44761512/)
Pull Request resolved: #98579
Approved by: https://github.com/mrshenli
This PR adds the ability to remove unused `copy_` (`len(node.users) == 0`) that generated by tracing the optimizer.

Differential Revision: [D44761556](https://our.internmc.facebook.com/intern/diff/D44761556/)
Pull Request resolved: #98580
Approved by: https://github.com/mrshenli
…ExecutorSymbolDef. (#98811)

Summary:
Due to change in upstream there are multiple builds that fail to build with llvm-17.
llvm/llvm-project@8b1771b
Added a llvm version check.

Test Plan: local testing on failing build with trunk/llvm-12

Reviewed By: zhuhan0

Differential Revision: D44851324

Pull Request resolved: #98811
Approved by: https://github.com/malfet, https://github.com/bertmaher
Wrapper for users to insert constraints into model code.

The constraints will not be maintained in the graph after tracing through make_fx so retracing with dynamo/make_fx will not work. This will be supported after torch._assert supported is implemented. Then we can convert the constrain_range calls to torch._asserts.

Pull Request resolved: #98433
Approved by: https://github.com/avikchaudhuri, https://github.com/tugsbayasgalan
As we've deprecated 3.7 support for PyTorch

Pull Request resolved: #98886
Approved by: https://github.com/PaliC, https://github.com/seemethere
qqaatw and others added 23 commits April 15, 2023 22:19
The strategy is that we will heap allocate a LargeNegativeIntSymNodeImpl whenever we have a large negative int, so that we can keep the old `is_symbolic` test (now called `is_heap_allocated`) on SymInt. Whenever we need to do something with these ints, though, we convert them back into a plain `int64_t` (and then, e.g., wrap it in whatever user specificed SymNodeImpl they need.) We cannot wrap directly in the user specified SymNodeImpl as we generally do not know what the "tracing context" is from C++. We expect large negative ints to be rare, so we don't apply optimizations like singleton-ifying INT_MIN.  Here's the order to review:

* c10/core/SymInt.h and cpp
  * `is_symbolic` renamed to `is_heap_allocated` as I needed to audit all use sites: the old `is_symbolic` test would return true for large negative int, but it would be wrong to then try to dispatch on the LargeNegativeIntSymNodeImpl which supports very few operations. In this file, I had to update expect_int,
  * If you pass in a large negative integer, we instead heap allocate it in `promote_to_negative`. The function is written in a funny way to keep compact constructor code for SymInt (the heap allocation happens out of line)
  * clone is now moved out-of-line
  * New method maybe_as_int which will give you a constant int if it is possible, either because it's stored inline or in LargeNegativeIntSymNodeImpl. This is the preferred replacement for previous use of is_symbolic() and then as_int_unchecked().
  * Rename toSymNodeImpl to toSymNode, which is more correct (since it returns a SymNode)
  * Complete rewrite of `normalize_symints.cpp` to use new `maybe_as_int`. Cannot easily use the old code structure, so it's now done doing a macro and typing out each case manually (it's actually not that bad.)
  * Reimplementations of all the unary operators by hand to use `maybe_as_int`, relatively simple.
* c10/core/LargeNegativeIntSymNodeImpl.h - Just stores a int64_t value, but it has to be big and negative. Most methods are not implemented, since we will rewrap the large negative int in the real SymNodeImpl subclass before doing operations with it
* The rest of the files are just rewriting code to use `maybe_as_int`. There is a nontrivial comment in c10/core/SymIntArrayRef.h

Very minor test adjustment in c10/test/core/SymInt_test.cpp . Plan to exercise this properly in next PR.

Companion XLA PR: pytorch/xla#4882

Signed-off-by: Edward Z. Yang <[email protected]>
Pull Request resolved: #99157
Approved by: https://github.com/albanD
Signed-off-by: Edward Z. Yang <[email protected]>
Pull Request resolved: #98968
Approved by: https://github.com/xw285cornell
This commit uses `aten.arange.default` and `aten.arange.start` to
test `aten.sym_numel`.

Differential Revision: [](https://our.internmc.facebook.com/intern/diff/)

Differential Revision: [D45028715](https://our.internmc.facebook.com/intern/diff/D45028715)
Pull Request resolved: #99206
Approved by: https://github.com/yifuwang
Previously, we had a problem when partitioning forward-backward dynamic graphs, which is that we could end up with a backward graph that mentions a symbol in an input tensor (e.g., `f32[s0 + s1]`), but without this symbol being otherwise bound elsewhere. When this happens, we have no way of actually deriving the values of `s0` and `s1`. Our fix for this in #93059 was to just retrace the graph, so that s0 + s1 got allocated a new symbol s2 and everything was happy. However, this strategy had other problems, namely (1) we lost all information from the previous ShapeEnv, including guards and (2) we end up allocating a LOT of fresh new symbols in backwards.

With this change, we preserve the same ShapeEnv between forward and backwards. How do we do this? We simply require that every symbol which may be present inside tensors, ALSO be a plain SymInt input to the graph. This invariant is enforced by Dynamo. Once we have done this, we can straightforwardly modify the partitioner to preserve these SymInt as saved for backwards, if they are needed in the backwards graph to preserve the invariant as well.

This apparently breaks yolov3, but since everything else is OK I'm merging this as obviously good and investigating later.

Signed-off-by: Edward Z. Yang <[email protected]>

Pull Request resolved: #99089
Approved by: https://github.com/voznesenskym
**Summary**
Enable the `decomposed dequant - pointwise ops - decomposed quant` vectorization code gen inside inductor.
Here is the example in the UT and the generated code:

Example:
* `decomposed dequant - relu - decomposed quant` pattern.
* Using `uint8` as the quantized tensor data type.

Generated Code:
```
kernel_cpp_0 = async_compile.cpp('''
#include "/tmp/torchinductor_root/hw/chwr6vy6e6sd25sfh42qtywkuf2emodexm2aomp3lbrcxwznfwyi.h"
extern "C" void kernel(const unsigned char* in_ptr0,
                       unsigned char* out_ptr0)
{
    #pragma omp parallel num_threads(56)
    {
        {
            #pragma omp for
            for(long i0=static_cast<long>(0); i0<static_cast<long>(27); i0+=static_cast<long>(1))
            {
                auto tmp0 = at::vec::load_uint8_as_float(in_ptr0 + static_cast<long>(16*i0));
                auto tmp1 = (tmp0);
                auto tmp2 = at::vec::Vectorized<float>(static_cast<float>(100));
                auto tmp3 = tmp1 - tmp2;
                auto tmp4 = at::vec::Vectorized<float>(static_cast<float>(0.01));
                auto tmp5 = tmp3 * tmp4;
                auto tmp6 = at::vec::clamp_min(tmp5, decltype(tmp5)(0));
                auto tmp7 = at::vec::Vectorized<float>(static_cast<float>(100.0));
                auto tmp8 = tmp6 * tmp7;
                auto tmp9 = tmp8.round();
                auto tmp10 = tmp9 + tmp2;
                auto tmp11 = at::vec::Vectorized<float>(static_cast<float>(0));
                auto tmp12 = at::vec::maximum(tmp10, tmp11);
                auto tmp13 = at::vec::Vectorized<float>(static_cast<float>(255));
                auto tmp14 = at::vec::minimum(tmp12, tmp13);
                auto tmp15 = (tmp14);
                tmp15.store_float_as_uint8(out_ptr0 + static_cast<long>(16*i0));
            }
            #pragma omp for simd simdlen(8)
            for(long i0=static_cast<long>(432); i0<static_cast<long>(441); i0+=static_cast<long>(1))
            {
                auto tmp0 = in_ptr0[static_cast<long>(i0)];
                auto tmp1 = static_cast<float>(tmp0);
                auto tmp2 = static_cast<float>(100);
                auto tmp3 = tmp1 - tmp2;
                auto tmp4 = static_cast<float>(0.01);
                auto tmp5 = tmp3 * tmp4;
                auto tmp6 = tmp5 * (tmp5>0);
                auto tmp7 = static_cast<float>(100.0);
                auto tmp8 = tmp6 * tmp7;
                auto tmp9 = std::nearbyint(tmp8);
                auto tmp10 = tmp9 + tmp2;
                auto tmp11 = static_cast<float>(0);
                auto tmp12 = (tmp11 != tmp11) ? tmp11 : std::max(tmp10, tmp11);
                auto tmp13 = static_cast<float>(255);
                auto tmp14 = (tmp13 != tmp13) ? tmp13 : std::min(tmp12, tmp13);
                auto tmp15 = static_cast<unsigned char>(tmp14);
                out_ptr0[static_cast<long>(i0)] = tmp15;
            }
        }
    }
}
''')
```

**Test Plan**
```
cd test/inductor &&  python -m pytest test_cpu_repro.py -k test_decomposed_dequant_relu_quant
```

Pull Request resolved: #98489
Approved by: https://github.com/jgong5, https://github.com/jansel
This PR introduces the functionalization of RNG ops. Key points are

* Introduces a new `philox_rand` prim operator that accepts seed, offset.
* Adds decompositions for random operators that use these philox_rand prims
* Adds a PhiloxStateTracker to track the offset for each occurence of rand ops
* Changes calling convention of AOT Autograd and adds <fwd_seed, fwd_base_offset> and <bwd_seed, bwd_base_offset>
* Monkeypatches set_rng_state and get_rng_state while AOT Autograd tracing to record the rng state behavior
* Raises assertion for CPU because CPU does not Philox RNG.

Not dealt in this PR
* dropout op - offset calculation is different
* other distributions like normal, poisson etc
* Inductor support
* Cudagraph support
* Dynamic shape support

An example
~~~

class Custom(torch.autograd.Function):
    @staticmethod
    def forward(ctx, x):
        ctx.save_for_backward(x)
        a = torch.rand_like(x) * x
        a = torch.rand_like(x) * a
        return a

    @staticmethod
    def backward(ctx, grad_out):
        x, = ctx.saved_tensors
        return grad_out * torch.rand_like(grad_out) * torch.cos(x)

====== Forward graph 0 ======
def forward(self, fwd_seed_1: i64[], fwd_base_offset_1: i64[], primals_1: f32[16, 16]):
    # No stacktrace found for following nodes
    add: i64[] = torch.ops.aten.add.Tensor(fwd_base_offset_1, 0)
    philox_rand: f32[16, 16] = torch.ops.prims.philox_rand.default([16, 16], fwd_seed_1, add, [16, 1], device(type='cuda', index=0), torch.float32);  add = None
    mul: f32[16, 16] = torch.ops.aten.mul.Tensor(philox_rand, primals_1);  philox_rand = None
    add_1: i64[] = torch.ops.aten.add.Tensor(fwd_base_offset_1, 4);  fwd_base_offset_1 = None
    philox_rand_1: f32[16, 16] = torch.ops.prims.philox_rand.default([16, 16], fwd_seed_1, add_1, [16, 1], device(type='cuda', index=0), torch.float32);  fwd_seed_1 = add_1 = None
    mul_1: f32[16, 16] = torch.ops.aten.mul.Tensor(philox_rand_1, mul);  philox_rand_1 = mul = None
    return [mul_1, primals_1]

====== Backward graph 0 ======
def forward(self, bwd_seed_1: i64[], bwd_base_offset_1: i64[], primals_1: f32[16, 16], tangents_1: f32[16, 16]):
    # No stacktrace found for following nodes
    add_2: i64[] = torch.ops.aten.add.Tensor(bwd_base_offset_1, 0);  bwd_base_offset_1 = None
    philox_rand_2: f32[16, 16] = torch.ops.prims.philox_rand.default([16, 16], bwd_seed_1, add_2, [16, 1], device(type='cuda', index=0), torch.float32);  bwd_seed_1 = add_2 = None
    mul_2: f32[16, 16] = torch.ops.aten.mul.Tensor(tangents_1, philox_rand_2);  tangents_1 = philox_rand_2 = None
    cos: f32[16, 16] = torch.ops.aten.cos.default(primals_1);  primals_1 = None
    mul_3: f32[16, 16] = torch.ops.aten.mul.Tensor(mul_2, cos);  mul_2 = cos = None
    return [mul_3]

~~~

Pull Request resolved: #97377
Approved by: https://github.com/ezyang
* Introduce a frame counter which lets us uniquely identify frames.
  This makes it easier to tell if you are recompiling the same frame
* Shorten evaluate_expr to eval for more visual distinctiveness

Signed-off-by: Edward Z. Yang <[email protected]>

Pull Request resolved: #99159
Approved by: https://github.com/Skylion007
This reverts commit 6a50b83.

Reverted #96995 on behalf of https://github.com/izaitsevfb due to Breaks internal tests
…8692)

Fixes #ISSUE_NUMBER

Add parameter for pin memory of storage to support other devices.
In the future, other backends will provide their own allocators to create pin memory.
Pull Request resolved: #98692
Approved by: https://github.com/ezyang
make ATen/native/cuda/Embedding.cu data_ptr-correct

Test Plan: Rely on CI.

Pull Request resolved: #99183
Approved by: https://github.com/ezyang
make untemplated gemm calls data_ptr-correct

Test Plan: Rely on CI.

Pull Request resolved: #99184
Approved by: https://github.com/ezyang
…re_fx and decomposed convert flow (#98905)"

This reverts commit 9e0df23.

Reverted #98905 on behalf of https://github.com/izaitsevfb due to Conflicts with D44918496 landed internally, blocks diff train import
Fixes #99250, unfortunately I haven't figured out how to handle cross-entropy with smooth loss and weights.

Pull Request resolved: #99255
Approved by: https://github.com/jansel, https://github.com/malfet
This diff renames quantization spec/config and operator config. It moves these
datastructures to base quantizer.
Base quantizer API now has get_supported_operators that returns list of
patterns that a quantizer quantizes.
There are two choices being debated for how to convey to user what a particular
quantizer will quantize.

1. Modules. We just convey what nn.Modules will be quantized. Of course that
does not mean that equivalent functional variants wont be quantized, however
for simplifity we just use nn.Module. If certain ops are quatnzied in fused
manner then that will considered internal details. Pros and cons of this
approach
pros:
  - Simple. Only nn Modules are listed.
  - User does not have to see fusion patterns.
Cons:
  - confusing perhaps because it is not clear if supported = nn.Conv2d also
    means that the quantizer supported functional.conv2d
  - Hiding fusion pattern means user has no say in not fusing. Meaning if
    conv2d + relu is fused and user configures to quantize only conv, quantizer
    will also quantize the following relu as if conv2d + relu are fused.

2. Patterns. Be explicit about what is supported and enumerate all possible
compbinations.
Pros:
  - it is very clear what quantizer will do. no surprises.
Cons:
  - It is not simple to parse.
  - It can be argued taht fusion is internal detail of the quantizer. So some
    quantizer implementation may chose to expose fusion patterns, while others
    may not and may not even provide any configurability.

One option is to move set_supported_operators/modules out of base quantizer and
let each quantizer define its own way of communicating what is supported. Issue
with this is that when we want to "Compose" multiple quantizers there is no way
for user to define the order of composition if user does not know what a
quantizer supports. For exampl quantizer A may quantizer conv + relu while B
only conv, but B's implementation is fast. In that case you may compose (B, A)
such B quantizes conv and A quantizes relu. Not knowning what A
and B support, makes such composition harder

Differential Revision: [D44895547](https://our.internmc.facebook.com/intern/diff/D44895547/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D44895547/)!
Pull Request resolved: #99063
Approved by: https://github.com/jerryzh168
@malfet malfet deleted the branch jessejay-ch:master April 17, 2023 00:36
@malfet malfet deleted the master branch April 17, 2023 00:36
malfet added 3 commits April 16, 2023 18:13
Entire branch is about to get deleted, it needs only .github and README
@malfet malfet deleted the branch jessejay-ch:master April 24, 2023 22:03
@malfet malfet deleted the branch jessejay-ch:master April 24, 2023 22:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.