Bug: MXLinear backward pass implementation

In the current implementation of the `MXLinear` layer:
```python
def forward(self, x):
       x_mx = MXTensor.to_mx(x, self.elem_dtype, self.block_size)
       w_mx = MXTensor.to_mx(self.weight, self.elem_dtype, self.block_size)
       y = F.linear(x_mx, w_mx, self.bias)
       y = NoopFwToMXBw.apply(y, self.elem_dtype, self.block_size)
       return y
```
there is only a single MX quantization step of the output gradient (in `NoopFwToMXBw`).

However, following the MX microscaling paper, there should be 4 quantizations happening: two for output gradient (on 2 different axes), one for the activation and one for the weights (different from the forward ones). 
![microscaling-fwd-bwd](https://github.com/user-attachments/assets/60421c7d-8709-41c1-aa84-3955567d4d1b)

**Why does it matter:** even though not officially confirmed by hardware vendors, it is clear that MX matmuls can only be fully optimized if the quantization axis correspond to the reduction axis for both operands. Hence, running MX backward pass on next gen hardware will require the 4 quantization steps presented above. Changing of axis for the MX quantization result in a different quantization error, meaning that the current implementation is potentially not giving a full picture of what will be MX training on real hardware.

**Potential fix**: I believe we need a full implementation of forward+backward pass of `blockwise_quantize_linear` function, manually handling the backward pass quantization steps.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug: MXLinear backward pass implementation #1501

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug: MXLinear backward pass implementation #1501

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions