Conformance Test for AutoML and Training Working Group

Andrey Velichkevich (@andreyvelich) Johnu George (@johnugeorge) 2022-11-21 Original Google Doc.

Motivation

Kubeflow community needs to design conformance program so the distributions can become Certified Kubeflow. Recently, Kubeflow Pipelines Working Group (WG) implemented the first version of their conformance tests. We should design the same program for AutoML and Training WG.

This document is based on the original proposal for the Kubeflow Pipelines conformance program.

Objective

Conformance program for AutoML and Training WG should follow the same goals as Pipelines program:

The tests should be fully automated and executable by anyone who has public access to the Kubeflow repository.
The test results should be easy to verify by the Kubeflow Conformance Committee.
The tests should not depend on cloud provider (e.g. AWS or GCP).
The tests should cover basic functionality of Katib and the Training Operator. It will not cover all features.
The tests are expected to evolve in the future versions.
The tests should have a well documented and short list of set-up requirements.
The tests should install and complete in a relatively short period of time with suggested minimum infrastructure requirements (e.g. 3 nodes, 24 vCPU, 64 GB RAM, 500 GB Disk).

Kubeflow Conformance

Initially the Kubeflow conformance will include the CRD based tests. In the future, API and UI based tests may be added. Kubeflow conformance consists the 3 category of tests:

CRD-based tests

Most of Katib and Training Operator functionality are based on Kubernetes CRD.

This document will define a design for CRD-based tests for Katib and the Training Operator.
API-based tests

Currently, Katib or Training Operator doesn’t have an API server that receives requests from the users. However, Katib has the DB Manager component that is responsible for writing/reading ML Training metrics.

In the following versions, we should design conformance program for the Katib API-based tests.
UI-based tests

UI tests are valuable but complex to design, document and execute. In the following versions, we should design conformance program for the Katib UI-based tests.

Design for the CRD-based tests

The design is similar to the KFP conformance program for the API-based tests.

For Katib, tests will be based on the run-e2e-experiment.go script that we run for our e2e tests.

This script will be converted to use Katib SDK. Tracking issue: kubeflow#2024.

For the Training Operator, tests will be based on the SDK e2e test.

Test Workflow

All tests will be run in the kf-conformance namespace inside the separate container. That will help to avoid environment variance and improve fault tolerance. Driver is required to trigger the deployment and download the results.

We are going to use the unified Makefile for all Kubeflow conformance tests. Distributions (driver on the diagram) need to run the following Makefile commands:

# Run the conformance program.
run: setup run-katib run-training-operator

# Sets up the Kubernetes resources (Kubeflow Profile, RBAC) that needs to run the test.
# Create temporary folder for the conformance report.
setup:
  kubectl apply -f ./setup.yaml
  mkdir -p /tmp/kf-conformance

# Create deployment and run the e2e tests for Katib and Training Operator.
run-katib:
  kubectl apply -f ./katib-conformance.yaml

run-training-operator:
  kubectl apply -f ./training-operator-conformance.yaml

# Download the test deployment results to create PR for the Kubeflow Conformance Committee.
report:
  ./report-conformance.sh

# Cleans up created resources and directories.
cleanup:
  kubectl delete -f ./setup.yaml
  kubectl delete -f ./katib-conformance.yaml
  kubectl delete -f ./training-operator-conformance.yaml
  rm -rf /tmp/kf-conformance

Katib and Training Operator conformance deployment will have the appropriate RBAC to Create/Read/Delete Katib Experiment and Training Operator Jobs in the kf-conformance namespace.
Distribution should have access to the internet to download the training datasets (e.g. MNIST) while running the tests.

When the job is finished, the script generates output.

For Katib Experiment the output should be as follows:

Test 1 - passed.
Experiment name: random-search
Experiment status: Experiment has succeeded because max trial count has reached

For Training Operator the output should be as follows:

Test 1 - passed.
TFJob name: tfjob-mnist
TFJob status: TFJob tfjob-mnist is successfully completed.

The above report can be downloaded from the test deployment by running make report.
When all reports have been collected, the distributions are going to create PR to publish the reports and to update the appropriate Kubeflow Documentation on conformant Kubeflow distributions. The Kubeflow Conformance Committee will verify it and make the distribution Certified Kubeflow.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

conformance-test.md

conformance-test.md

Conformance Test for AutoML and Training Working Group

Motivation

Objective

Kubeflow Conformance

Design for the CRD-based tests

Test Workflow

Files

conformance-test.md

Latest commit

History

conformance-test.md

File metadata and controls

Conformance Test for AutoML and Training Working Group

Motivation

Objective

Kubeflow Conformance

Design for the CRD-based tests

Test Workflow