Skip to content

fix flaky integration tests #875

@mfaizanse

Description

@mfaizanse

Following is the list of flaky integration tests:

  • tests/integration/agents/kyma/test_kyma_agent_tool_accuracy.py::test_kyma_agent_namespace_scoped[test_case2] - AssertionError: Should not call kyma doc search tool as Function is successfully deployed: Tool call accuracy (0.00) is below the threshold of 0.5

  • tests/integration/agents/kyma/test_kyma_agent_tool_accuracy.py::test_kyma_agent_namespace_scoped[test_case1] - AssertionError: Should not call kyma doc search tool as Function has JavaScript issue: Tool call accuracy (0.00) is below the threshold of 0.5

  • tests/integration/agents/test_gatekeeper_feedback.py::test_invoke_feedback_node[user is still confused-messages11-True] - AssertionError: Expected True but got False for test case: user is still confused

  • tests/integration/agents/test_gatekeeper_node.py::test_invoke_gatekeeper_node[Query about self capabilities.-messages35-I can help with various queries related to Kyma and Kubernetes, such as troubleshooting issues, understanding concepts, or deployment processes. For example, if you encounter an error while deploying a service in Kyma, I can guide you through the troubleshooting steps. If you have a specific problem or scenario in mind, feel free to share it, and I'll do my best to assist you!-False] - AssertionError: Metrics: Semantic Similarity [GEval] (score: 0.4, threshold: 0.7, strict: False, error: None, reason: The actual output conveys a similar overall purpose of assisting with Kyma and Kubernetes queries, including troubleshooting, but it introduces additional information about the assistant's identity, security features, and escalation protocols not present in the expected output. The core semantic intent aligns, but the lack of identical phrasing and added content reduces the score.) failed.

  • tests/integration/agents/k8s/test_k8s_agent.py::test_tool_calls[state0-k8s_query_tool] - AssertionError: Expected at least one tool call

  • tests/integration/agents/supervisor/test_finalizer.py::test_generate_final_response[case 3: Finalizer do not answer the question itself-messages2-\nI'm sorry, but the agents did not provide any information regarding what Kubernetes (k8s) is.\nTherefore, I cannot offer any specific information on this topic.\nIf you have any other questions or need assistance with a different subject, feel free to ask!\n] - AssertionError: Metrics: Semantic Similarity [GEval] (score: 0.2, threshold: 0.7, strict: False, error: None, reason: The actual output diverges significantly from the expected output by providing additional information about Joule and the Kyma dashboard, which is not present in the expected response. The expected output clearly states no information is available about Kubernetes (k8s), while the actual output attempts to explain related functionality, thus failing to maintain semantic similarity. There is no code block in either output, so no penalty there, but the main idea differs, leading to a low score.) failed.

  • tests/integration/agents/supervisor/test_finalizer.py::test_generate_final_response[case 4: Finalizer do not answer the question itself-messages3-The agents did not provide specific information regarding\n the types of Kubernetes services or the definition of a Kyma function.\n Therefore, I cannot offer any specific information on this topic.\n If you have any other questions or need assistance with a different subject, feel free to ask!\n] - AssertionError: Metrics: Semantic Similarity [GEval] (score: 0.6, threshold: 0.7, strict: False, error: None, reason: The actual output and expected output convey the same core meaning that no specific information was provided about Kubernetes services or Kyma functions. However, the actual output omits the invitation to ask other questions and replaces the more personal tone with a suggestion to consult external documentation. There are no code blocks to compare, so no penalties there. The difference in phrasing and omission of the offer for further assistance results in a moderate penalty.) failed.

Task(s):

  • Change the above mentioned integration to be stable. Either change the evaluation criteria.
  • Adjust the agent prompts if needed, instead of changing the tests.
  • Open a follow-up ticket if a tests need code refactoring.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions