ModernBert reranker - Different results from vs transformers library

### System Info

Transformers version - 4.51.3
TEI - Latest main branch
```
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.20             Driver Version: 570.133.20     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40S                    On  |   00000000:36:00.0 Off |                    0 |
| N/A   35C    P0             79W /  350W |   20049MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
```


## Description
I'm very green to this field so apologies in advance if I ask some stupid questions or make some assumptions that are incorrect.
___

I have a Modernbert reranker initially run on the `Transformers` library and saw that `TEI` would perform inference faster so I migrated them over. What I observed was different outputs, initially I thought this was fine but the differences are quite large so I stepped into the different processes from tokenisation to pooling strategy and the activation function. What I found was two areas where the two libraries differ that produce a significant difference between the two libraries ouputs:

1. Tokenisation - Modernbert on TEI does not add additional padding (maybe this is fine?)
2. The pooling strategy used by default is CLS, where as in `Transformers` it uses the value in `config.json` and it just so happens Toma's and (the reranker I want to use is mean)

I tested both libraries on the same machine where TEI was run based on the Dockerfile-cuda. One small note is that I'm running TEI on `float16` (I didn't have enough memory to run with float32) but I don't think this would drastically change the results.

I made a simple high level example:

```python
query = "How do I sell my shirt?"
texts = [
    "You can sell your shirt by going to the sell page and clicking the sell button.",
    "Ketchup is a condiment that is made from tomatoes.",
    "You can sell your apple in the store.",
    "How you can sell your clothes online."
]
pairs = [[query, text] for text in texts]


# Transformers: 
[{'index': 0, 'score': 0.9871538},
 {'index': 3, 'score': 0.60760754},
 {'index': 2, 'score': 0.0041681635},
 {'index': 1, 'score': 2.5670563e-05}]

# TEI:
[{'index': 0, 'score': 0.9993761},
 {'index': 3, 'score': 0.29017562},
 {'index': 2, 'score': 0.0047737225},
 {'index': 1, 'score': 1.2219076e-05}]
```

### Model
The model that i'm running: https://huggingface.co/tomaarsen/reranker-ModernBERT-large-gooaq-bce

```
{
  "model_id": "../models/toma-reranker",
  "model_sha": null,
  "model_dtype": "float16",
  "model_type": {
    "reranker": {
      "id2label": {
        "0": "LABEL_0"
      },
      "label2id": {
        "LABEL_0": 0
      }
    }
  },
  "max_concurrent_requests": 512,
  "max_input_length": 8192,
  "max_batch_tokens": 16384,
  "max_batch_requests": null,
  "max_client_batch_size": 32,
  "auto_truncate": false,
  "tokenization_workers": 32,
  "version": "1.7.0",
  "sha": null,
  "docker_label": null
}
```


### Information

- [x] Docker
- [ ] The CLI directly

### Tasks

- [x] An officially supported command
- [ ] My own modifications

### Reproduction

Steps to reproduce:

Transformers version - 4.51.3
TEI - Latest main branch
Python - 3.11.7

1. Download [Modernbert reranker model](https://huggingface.co/tomaarsen/reranker-ModernBERT-large-gooaq-bce/tree/main)
2. Run `CrossEncoder` model from Transformers with the above model (float32)
3. Build and run Dockerfile-cuda from TEI (no flash attention so we only run ModernBert)
4. Execute both on the example:

```python
query = "How do I sell my shirt?"
texts = [
    "You can sell your shirt by going to the sell page and clicking the sell button.",
    "Ketchup is a condiment that is made from tomatoes.",
    "You can sell your apple in the store.",
    "How you can sell your clothes online."
]
pairs = [[query, text] for text in texts]
```

5. Sort transformer results so we can compare with TEI
```
transformers_sorted = sorted(
    ({"index": i, "score": score} for i, score in enumerate(response)),
    key=lambda x: x["score"],
    reverse=True
)
```

6. Observe results:
```
Transformers: 
[{'index': 0, 'score': 0.9871538},
 {'index': 3, 'score': 0.60760754},
 {'index': 2, 'score': 0.0041681635},
 {'index': 1, 'score': 2.5670563e-05}]
TEI:
[{'index': 0, 'score': 0.9993761},
 {'index': 3, 'score': 0.29017562},
 {'index': 2, 'score': 0.0047737225},
 {'index': 1, 'score': 1.2219076e-05}]
```

### Expected behavior

The results from Transformers and TEI to be similar.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ModernBert reranker - Different results from vs transformers library #615

System Info

Description

Model

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ModernBert reranker - Different results from vs transformers library #615

Description

System Info

Description

Model

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions