Description
System Info
Transformers version - 4.51.3
TEI - Latest main branch
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.20 Driver Version: 570.133.20 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA L40S On | 00000000:36:00.0 Off | 0 |
| N/A 35C P0 79W / 350W | 20049MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
Description
I'm very green to this field so apologies in advance if I ask some stupid questions or make some assumptions that are incorrect.
I have a Modernbert reranker initially run on the Transformers
library and saw that TEI
would perform inference faster so I migrated them over. What I observed was different outputs, initially I thought this was fine but the differences are quite large so I stepped into the different processes from tokenisation to pooling strategy and the activation function. What I found was two areas where the two libraries differ that produce a significant difference between the two libraries ouputs:
- Tokenisation - Modernbert on TEI does not add additional padding (maybe this is fine?)
- The pooling strategy used by default is CLS, where as in
Transformers
it uses the value inconfig.json
and it just so happens Toma's and (the reranker I want to use is mean)
I tested both libraries on the same machine where TEI was run based on the Dockerfile-cuda. One small note is that I'm running TEI on float16
(I didn't have enough memory to run with float32) but I don't think this would drastically change the results.
I made a simple high level example:
query = "How do I sell my shirt?"
texts = [
"You can sell your shirt by going to the sell page and clicking the sell button.",
"Ketchup is a condiment that is made from tomatoes.",
"You can sell your apple in the store.",
"How you can sell your clothes online."
]
pairs = [[query, text] for text in texts]
# Transformers:
[{'index': 0, 'score': 0.9871538},
{'index': 3, 'score': 0.60760754},
{'index': 2, 'score': 0.0041681635},
{'index': 1, 'score': 2.5670563e-05}]
# TEI:
[{'index': 0, 'score': 0.9993761},
{'index': 3, 'score': 0.29017562},
{'index': 2, 'score': 0.0047737225},
{'index': 1, 'score': 1.2219076e-05}]
Model
The model that i'm running: https://huggingface.co/tomaarsen/reranker-ModernBERT-large-gooaq-bce
{
"model_id": "../models/toma-reranker",
"model_sha": null,
"model_dtype": "float16",
"model_type": {
"reranker": {
"id2label": {
"0": "LABEL_0"
},
"label2id": {
"LABEL_0": 0
}
}
},
"max_concurrent_requests": 512,
"max_input_length": 8192,
"max_batch_tokens": 16384,
"max_batch_requests": null,
"max_client_batch_size": 32,
"auto_truncate": false,
"tokenization_workers": 32,
"version": "1.7.0",
"sha": null,
"docker_label": null
}
Information
- Docker
- The CLI directly
Tasks
- An officially supported command
- My own modifications
Reproduction
Steps to reproduce:
Transformers version - 4.51.3
TEI - Latest main branch
Python - 3.11.7
- Download Modernbert reranker model
- Run
CrossEncoder
model from Transformers with the above model (float32) - Build and run Dockerfile-cuda from TEI (no flash attention so we only run ModernBert)
- Execute both on the example:
query = "How do I sell my shirt?"
texts = [
"You can sell your shirt by going to the sell page and clicking the sell button.",
"Ketchup is a condiment that is made from tomatoes.",
"You can sell your apple in the store.",
"How you can sell your clothes online."
]
pairs = [[query, text] for text in texts]
- Sort transformer results so we can compare with TEI
transformers_sorted = sorted(
({"index": i, "score": score} for i, score in enumerate(response)),
key=lambda x: x["score"],
reverse=True
)
- Observe results:
Transformers:
[{'index': 0, 'score': 0.9871538},
{'index': 3, 'score': 0.60760754},
{'index': 2, 'score': 0.0041681635},
{'index': 1, 'score': 2.5670563e-05}]
TEI:
[{'index': 0, 'score': 0.9993761},
{'index': 3, 'score': 0.29017562},
{'index': 2, 'score': 0.0047737225},
{'index': 1, 'score': 1.2219076e-05}]
Expected behavior
The results from Transformers and TEI to be similar.