Andrey Velichkevich (@andreyvelich) Johnu George (@johnugeorge) 2022-11-21 Original Google Doc.
Kubeflow community needs to design conformance program so the distributions can become Certified Kubeflow. Recently, Kubeflow Pipelines Working Group (WG) implemented the first version of their conformance tests. We should design the same program for AutoML and Training WG.
This document is based on the original proposal for the Kubeflow Pipelines conformance program.
Conformance program for AutoML and Training WG should follow the same goals as Pipelines program:
- The tests should be fully automated and executable by anyone who has public access to the Kubeflow repository.
- The test results should be easy to verify by the Kubeflow Conformance Committee.
- The tests should not depend on cloud provider (e.g. AWS or GCP).
- The tests should cover basic functionality of Katib and the Training Operator. It will not cover all features.
- The tests are expected to evolve in the future versions.
- The tests should have a well documented and short list of set-up requirements.
- The tests should install and complete in a relatively short period of time with suggested minimum infrastructure requirements (e.g. 3 nodes, 24 vCPU, 64 GB RAM, 500 GB Disk).
Initially the Kubeflow conformance will include the CRD based tests. In the future, API and UI based tests may be added. Kubeflow conformance consists the 3 category of tests:
-
CRD-based tests
Most of Katib and Training Operator functionality are based on Kubernetes CRD.
This document will define a design for CRD-based tests for Katib and the Training Operator.
-
API-based tests
Currently, Katib or Training Operator doesn’t have an API server that receives requests from the users. However, Katib has the DB Manager component that is responsible for writing/reading ML Training metrics.
In the following versions, we should design conformance program for the Katib API-based tests.
-
UI-based tests
UI tests are valuable but complex to design, document and execute. In the following versions, we should design conformance program for the Katib UI-based tests.
The design is similar to the KFP conformance program for the API-based tests.
For Katib, tests will be based on
the run-e2e-experiment.go
script
that we run for our e2e tests.
This script will be converted to use Katib SDK. Tracking issue: kubeflow#2024.
For the Training Operator, tests will be based on the SDK e2e test.
All tests will be run in the kf-conformance namespace inside the separate container. That will help to avoid environment variance and improve fault tolerance. Driver is required to trigger the deployment and download the results.
-
We are going to use the unified Makefile for all Kubeflow conformance tests. Distributions (driver on the diagram) need to run the following Makefile commands:
# Run the conformance program. run: setup run-katib run-training-operator # Sets up the Kubernetes resources (Kubeflow Profile, RBAC) that needs to run the test. # Create temporary folder for the conformance report. setup: kubectl apply -f ./setup.yaml mkdir -p /tmp/kf-conformance # Create deployment and run the e2e tests for Katib and Training Operator. run-katib: kubectl apply -f ./katib-conformance.yaml run-training-operator: kubectl apply -f ./training-operator-conformance.yaml # Download the test deployment results to create PR for the Kubeflow Conformance Committee. report: ./report-conformance.sh # Cleans up created resources and directories. cleanup: kubectl delete -f ./setup.yaml kubectl delete -f ./katib-conformance.yaml kubectl delete -f ./training-operator-conformance.yaml rm -rf /tmp/kf-conformance
-
Katib and Training Operator conformance deployment will have the appropriate RBAC to Create/Read/Delete Katib Experiment and Training Operator Jobs in the kf-conformance namespace.
-
Distribution should have access to the internet to download the training datasets (e.g. MNIST) while running the tests.
-
When the job is finished, the script generates output.
For Katib Experiment the output should be as follows:
Test 1 - passed. Experiment name: random-search Experiment status: Experiment has succeeded because max trial count has reached
For Training Operator the output should be as follows:
Test 1 - passed. TFJob name: tfjob-mnist TFJob status: TFJob tfjob-mnist is successfully completed.
-
The above report can be downloaded from the test deployment by running
make report
. -
When all reports have been collected, the distributions are going to create PR to publish the reports and to update the appropriate Kubeflow Documentation on conformant Kubeflow distributions. The Kubeflow Conformance Committee will verify it and make the distribution Certified Kubeflow.