Skip to content

ModernBert reranker - Different results from vs transformers library #615

Closed
@kwnath

Description

@kwnath

System Info

Transformers version - 4.51.3
TEI - Latest main branch

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.20             Driver Version: 570.133.20     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40S                    On  |   00000000:36:00.0 Off |                    0 |
| N/A   35C    P0             79W /  350W |   20049MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

Description

I'm very green to this field so apologies in advance if I ask some stupid questions or make some assumptions that are incorrect.


I have a Modernbert reranker initially run on the Transformers library and saw that TEI would perform inference faster so I migrated them over. What I observed was different outputs, initially I thought this was fine but the differences are quite large so I stepped into the different processes from tokenisation to pooling strategy and the activation function. What I found was two areas where the two libraries differ that produce a significant difference between the two libraries ouputs:

  1. Tokenisation - Modernbert on TEI does not add additional padding (maybe this is fine?)
  2. The pooling strategy used by default is CLS, where as in Transformers it uses the value in config.json and it just so happens Toma's and (the reranker I want to use is mean)

I tested both libraries on the same machine where TEI was run based on the Dockerfile-cuda. One small note is that I'm running TEI on float16 (I didn't have enough memory to run with float32) but I don't think this would drastically change the results.

I made a simple high level example:

query = "How do I sell my shirt?"
texts = [
    "You can sell your shirt by going to the sell page and clicking the sell button.",
    "Ketchup is a condiment that is made from tomatoes.",
    "You can sell your apple in the store.",
    "How you can sell your clothes online."
]
pairs = [[query, text] for text in texts]


# Transformers: 
[{'index': 0, 'score': 0.9871538},
 {'index': 3, 'score': 0.60760754},
 {'index': 2, 'score': 0.0041681635},
 {'index': 1, 'score': 2.5670563e-05}]

# TEI:
[{'index': 0, 'score': 0.9993761},
 {'index': 3, 'score': 0.29017562},
 {'index': 2, 'score': 0.0047737225},
 {'index': 1, 'score': 1.2219076e-05}]

Model

The model that i'm running: https://huggingface.co/tomaarsen/reranker-ModernBERT-large-gooaq-bce

{
  "model_id": "../models/toma-reranker",
  "model_sha": null,
  "model_dtype": "float16",
  "model_type": {
    "reranker": {
      "id2label": {
        "0": "LABEL_0"
      },
      "label2id": {
        "LABEL_0": 0
      }
    }
  },
  "max_concurrent_requests": 512,
  "max_input_length": 8192,
  "max_batch_tokens": 16384,
  "max_batch_requests": null,
  "max_client_batch_size": 32,
  "auto_truncate": false,
  "tokenization_workers": 32,
  "version": "1.7.0",
  "sha": null,
  "docker_label": null
}

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

Steps to reproduce:

Transformers version - 4.51.3
TEI - Latest main branch
Python - 3.11.7

  1. Download Modernbert reranker model
  2. Run CrossEncoder model from Transformers with the above model (float32)
  3. Build and run Dockerfile-cuda from TEI (no flash attention so we only run ModernBert)
  4. Execute both on the example:
query = "How do I sell my shirt?"
texts = [
    "You can sell your shirt by going to the sell page and clicking the sell button.",
    "Ketchup is a condiment that is made from tomatoes.",
    "You can sell your apple in the store.",
    "How you can sell your clothes online."
]
pairs = [[query, text] for text in texts]
  1. Sort transformer results so we can compare with TEI
transformers_sorted = sorted(
    ({"index": i, "score": score} for i, score in enumerate(response)),
    key=lambda x: x["score"],
    reverse=True
)
  1. Observe results:
Transformers: 
[{'index': 0, 'score': 0.9871538},
 {'index': 3, 'score': 0.60760754},
 {'index': 2, 'score': 0.0041681635},
 {'index': 1, 'score': 2.5670563e-05}]
TEI:
[{'index': 0, 'score': 0.9993761},
 {'index': 3, 'score': 0.29017562},
 {'index': 2, 'score': 0.0047737225},
 {'index': 1, 'score': 1.2219076e-05}]

Expected behavior

The results from Transformers and TEI to be similar.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions