Skip to content

Evaluation Errors: Output Parsing and Timeout Issues for Context/Faithfulness Metrics #2044

@WangAo-0

Description

@WangAo-0

[√ ] I checked the documentation and related resources and couldn't find an answer to my question.

Your Question
I'm encountering issues when trying to evaluate my RAG system using Ragas. The answer_correctness and answer_similarity metrics calculate correctly, but other metrics like context_recall, faithfulness, and context_precision consistently result in nan. The evaluation process also shows frequent RagasOutputParserException and TimeoutError.

I suspect this might be related to the choice of LLM (likely a local model via Ollama (qwen2.5:14b), as indicated by init_ragas_ollama_components) or how Ragas handles the output parsing from this specific LLM.

Code Snippet

      with open(Path(args.base_dir) / "retrival_results.json",encoding='utf-8') as f:
        retirval_results = json.load(f)
    
    embed_wrapper, llm_wrapper = init_ragas_ollama_components(args)
    metrics = [answer_correctness, answer_similarity,context_recall,faithfulness, context_precision]
    
    for metric in metrics:
        if hasattr(metric, "llm"): metric.llm = llm_wrapper
        if hasattr(metric, "embeddings"): metric.embeddings = embed_wrapper
    
    modes = ['naive', 'local', 'global'] if args.mode == 'all' else [args.mode]
    all_results = {}

    for mode in modes:
        samples = retirval_results.get(mode, []) 
        questions = []
        ground_truths = []
        predictions = []
        contexts = []

        for sample in samples:
            questions.append(sample.get("question", "")) 
            ground_truths.append(sample.get("ground_truth", ""))
            predictions.append(sample.get("prediction", ""))
            contexts.append([sample.get("retrieval_context", "")])

        dataset = Dataset.from_dict({
            "question": questions,
            "ground_truth": ground_truths,
            "answer": predictions,
            "contexts": contexts
        })
        results = evaluate(dataset, metrics=metrics)

Error Messages:

Evaluating: 10%|██████████████ | 5/50 [01:41<20:45, 27.69s/it]ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt context_recall_classification_prompt failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.executor:Exception raised in Job[12]: RagasOutputParserException(The output parser failed to parse the output including retries.)
Evaluating: 18%|█████████████████████████▏ | 9/50 [02:48<10:41, 15.66s/it]ERROR:ragas.executor:Exception raised in Job[2]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[3]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[4]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[5]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[7]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[8]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[9]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[13]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[14]: TimeoutError()
Evaluating: 38%|████████████████████████████████████████████████████▊ | 19/50 [03:01<01:37, 3.13s/it]ERROR:ragas.executor:Exception raised in Job[17]: TimeoutError()
Evaluating: 42%|██████████████████████████████████████████████████████████▍ | 21/50 [03:02<01:14, 2.56s/it]ERROR:ragas.executor:Exception raised in Job[18]: TimeoutError()
Evaluating: 46%|███████████████████████████████████████████████████████████████▉ | 23/50 [03:03<00:57, 2.12s/it]ERROR:ragas.executor:Exception raised in Job[19]: TimeoutError()
Evaluating: 48%|██████████████████████████████████████████████████████████████████▋ | 24/50 [03:04<00:49, 1.92s/it]ERROR:ragas.executor:Exception raised in Job[20]: TimeoutError()
Evaluating: 50%|█████████████████████████████████████████████████████████████████████▌ | 25/50 [04:41<07:17, 17.52s/it]ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt context_recall_classification_prompt failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.executor:Exception raised in Job[32]: RagasOutputParserException(The output parser failed to parse the output including retries.)
Evaluating: 54%|███████████████████████████████████████████████████████████████████████████ | 27/50 [05:24<06:47, 17.74s/it]ERROR:ragas.executor:Exception raised in Job[22]: TimeoutError()
Evaluating: 56%|█████████████████████████████████████████████████████████████████████████████▊ | 28/50 [05:33<05:47, 15.77s/it]ERROR:ragas.executor:Exception raised in Job[23]: TimeoutError()
Evaluating: 58%|████████████████████████████████████████████████████████████████████████████████▌ | 29/50 [05:46<05:15, 15.04s/it]ERROR:ragas.executor:Exception raised in Job[24]: TimeoutError()
Evaluating: 60%|███████████████████████████████████████████████████████████████████████████████████▍ | 30/50 [05:48<03:52, 11.63s/it]ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt context_precision_prompt failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.executor:Exception raised in Job[34]: RagasOutputParserException(The output parser failed to parse the output including retries.)
Evaluating: 64%|████████████████████████████████████████████████████████████████████████████████████████▉ | 32/50 [05:55<02:15, 7.53s/it]ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt context_precision_prompt failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.executor:Exception raised in Job[29]: RagasOutputParserException(The output parser failed to parse the output including retries.)
Evaluating: 66%|███████████████████████████████████████████████████████████████████████████████████████████▋ | 33/50 [05:58<01:46, 6.27s/it]ERROR:ragas.executor:Exception raised in Job[25]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[27]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[28]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[30]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[33]: TimeoutError()
Evaluating: 68%|██████████████████████████████████████████████████████████████████████████████████████████████▌ | 34/50 [06:00<01:16, 4.78s/it]ERROR:ragas.executor:Exception raised in Job[35]: TimeoutError()
Evaluating: 78%|████████████████████████████████████████████████████████████████████████████████████████████████████████████▍ | 39/50 [06:01<00:19, 1.79s/it]ERROR:ragas.executor:Exception raised in Job[37]: TimeoutError()
Evaluating: 80%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 40/50 [06:02<00:16, 1.60s/it]ERROR:ragas.executor:Exception raised in Job[38]: TimeoutError()
Evaluating: 82%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉ | 41/50 [06:03<00:13, 1.51s/it]ERROR:ragas.executor:Exception raised in Job[39]: TimeoutError()
Evaluating: 86%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 43/50 [06:16<00:27, 3.92s/it]ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt context_precision_prompt failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.executor:Exception raised in Job[49]: RagasOutputParserException(The output parser failed to parse the output including retries.)
Evaluating: 88%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 44/50 [07:52<02:40, 26.72s/it]ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt n_l_i_statement_prompt failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.executor:Exception raised in Job[43]: RagasOutputParserException(The output parser failed to parse the output including retries.)
Evaluating: 90%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 45/50 [08:17<02:11, 26.36s/it]ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt context_recall_classification_prompt failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.executor:Exception raised in Job[47]: RagasOutputParserException(The output parser failed to parse the output including retries.)
Evaluating: 94%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 47/50 [08:24<00:46, 15.39s/it]ERROR:ragas.executor:Exception raised in Job[42]: TimeoutError()
Evaluating: 96%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍ | 48/50 [08:24<00:22, 11.13s/it]ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt n_l_i_statement_prompt failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.executor:Exception raised in Job[48]: RagasOutputParserException(The output parser failed to parse the output including retries.)
Evaluating: 98%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 49/50 [08:31<00:09, 9.88s/it]ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt context_precision_prompt failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.executor:Exception raised in Job[44]: RagasOutputParserException(The output parser failed to parse the output including retries.)
Evaluating: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [08:31<00:00, 10.23s/it]
{'answer_correctness': 0.7917, 'semantic_similarity': 0.7538, 'context_recall': nan, 'faithfulness': nan, 'context_precision': nan}
INFO:main:=== Evaluation finished ===

Additional context:

  • I am using a local LLM, likely via Ollama (based on the init_ragas_ollama_components function name).
  • The errors suggest the LLM might not be returning output in the expected format for Ragas' parsers, or it might be timing out during the evaluation process for certain metrics.
  • answer_correctness and answer_similarity rely more on embeddings and simpler LLM calls, while context_recall, faithfulness, and context_precision typically require the LLM to perform more complex reasoning and structured output generation (like JSON). This difference in required LLM capability might explain why some metrics work and others fail
  • Could there be specific prompt or parsing issues with certain local LLMs? Or are there parameters I need to adjust for timeouts or parser robustness?

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingmodule-metricsthis is part of metrics moduleopenhandsquestionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions