Attention sinks are not applied correctly in `integrations.flex_attention`

The `score_mod` function passed to `flex_attention` should operate on the pre-softmax attention scores, but the snippet below appears to be applying the attention biases (`s_aux`) and computing the **_post_**-softmax scores. 

https://github.com/huggingface/transformers/blob/67097bf34055c55b886dc92014fd628c9a70e168/src/transformers/integrations/flex_attention.py#L275-L280

I don't think it is possible to apply (gpt-oss-style) attention sinks using the `score_mod` alone, but you can do it by passing `return_lse=True` to `flex_attention` and renormalising using the extra return value. If someone can point me to where unit tests for this code should live I'm happy to PR a fix.

	if s_aux is not None:
	logits_max = torch.max(score, dim=-1, keepdim=True).values
	sinks = torch.exp(s_aux - logits_max)
	unnormalized_scores = torch.exp(score - logits_max)
	normalizer = unnormalized_scores.sum(dim=-1, keepdim=True) + sinks
	score = unnormalized_scores / normalizer

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Attention sinks are not applied correctly in `integrations.flex_attention` #41026

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Attention sinks are not applied correctly in integrations.flex_attention #41026

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Attention sinks are not applied correctly in `integrations.flex_attention` #41026