Skip to content

GPQA Accuracy Mismatch on Completions vs Responses API (GPT-4o August 2024) #89

@ziwenseal

Description

@ziwenseal

When reproducing the reported results on GPT-4o, I noticed that the responses API and completions API result in different numbers beyond expected error bars, especially on gpt-4o-2024-08-06.

My results on GPT-4o:

  • Completions API: 53.0% (match)
  • Responses API: 49.2% (no match)

I only changed one line of code: from ChatCompletionsSampler to ResponsesSampler

"gpt-4o-2024-08-06": ChatCompletionSampler(

Command: python -m simple-evals.simple_evals --model gpt-4o-2024-08-06 --eval gpqa --n-repeats 10

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions