Skip to content

Conversation

@prashanth058
Copy link

@prashanth058 prashanth058 commented Nov 19, 2025

Issue:
LoRA-wrapped RowParallelLinear was adding bias as a separate bfloat16 operation instead of fusing it into the GEMM kernel like the unwrapped layer does. This caused precision loss because the fused kernel can accumulate in higher precision (FP32) before converting to bfloat16, while separate addition incurs additional rounding errors. The discrepancy appeared even with zero LoRA weights when comparing LoRA-wrapped vs merged weight results.

Fix:
Pass bias to apply() only on rank 0 (or when skip_bias_add=False), allowing the quantization method to fuse bias addition with matrix multiplication in the GEMM kernel. This matches the unwrapped layer's behavior and eliminates precision discrepancies.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a precision loss issue in LoRA-wrapped RowParallelLinear by fusing the bias addition into the GEMM kernel, which aligns its behavior with the non-LoRA equivalent layer. The changes correctly pass the bias to the apply method only on rank 0 to prevent redundant additions in tensor-parallel setups, and the refactoring of the bias handling logic improves code clarity. The fix appears correct and well-implemented. I have no major concerns with this change.

@jeejeelee
Copy link
Collaborator

over LGTM, could you please address CI failure first?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants