Katib makes suggestions long-running in v1alpha3. And the suggestions need to communicate with Katib DB manager to get experiments and trials from Katib db driver. This design hurts high availability.
Thus we proposed a new design to implement a CRD for suggestion and remove Katib db communication from main workflow. The new design simplifies the implementation of experiment and trial controller, and makes Katib Kubernetes native.
This document is to illustrate the details of the new design.
- Propose the Suggestion CRD.
- Propose new GRPC API for Suggestion service.
- Suggest the approaches to implement suggestion algorithms.
- Metrics collection (See Metrics Collector Design Document)
- Database-related refactor
// SuggestionSpec defines the desired state of Suggestion
type SuggestionSpec struct {
AlgorithmName string `json:"algorithmName"`
// Number of suggestions requested
Requests int32 `json:"requests,omitempty"`
// SuggestionStatus defines the observed state of Suggestion
type SuggestionStatus struct {
// Algorithmsettings set by the algorithm services.
AlgorithmSettings []common.AlgorithmSetting `json:"algorithmSettings,omitempty"`
// Number of suggestion results
SuggestionCount int32 `json:"suggestionCount,omitempty"`
// Suggestion results
Suggestions []TrialAssignment `json:"suggestions,omitempty"`
// TrialAssignment is the assignment for one trial.
type TrialAssignment struct {
// Suggestion results
ParameterAssignments []common.ParameterAssignment `json:"parameterAssignments,omitempty"`
//Name of the suggestion
Name string `json:"name,omitempty"`
syntax = "proto3";
package api.v1.alpha3;
import "google/api/annotations.proto";
service Suggestion {
rpc GetSuggestions(GetSuggestionsRequest) returns (GetSuggestionsReply);
message GetSuggestionsRequest {
Experiment experiment = 1;
repeated Trial trials = 2; // all completed trials owned by the experiment.
int32 request_number = 3; ///The number of Suggestion you request at one time. When you set 3 to request_number, you can get three Suggestions at one time.
message GetSuggestionsReply {
message ParameterAssignments{
repeated ParameterAssignment assignments = 1;
repeated ParameterAssignments parameter_assignments = 1;
AlgorithmSpec algorithm = 2;
message Experiment {
string name = 1;
ExperimentSpec experiment_spec = 2;
message ExperimentSpec {
AlgorithmSpec algorithm = 3;
ParameterSpecs parameter_specs = 1;
ObjectiveSpec objective = 2;
message ParameterSpecs {
repeated ParameterSpec parameters = 1;
message AlgorithmSpec {
string algorithm_name = 1;
repeated AlgorithmSetting algorithm_settings = 2;
message AlgorithmSetting {
string name = 1;
string value = 2;
message ParameterSpec {
string name = 1; /// Name of the parameter.
ParameterType parameter_type = 2; /// Type of the parameter.
FeasibleSpace feasible_space = 3; /// FeasibleSpace for the parameter.
message FeasibleSpace {
string max = 1; /// Max Value
string min = 2; /// Minimum Value
repeated string list = 3; /// List of Values.
string step = 4; /// Step for double or int parameter
enum ParameterType {
UNKNOWN_TYPE = 0; /// Undefined type and not used.
DOUBLE = 1; /// Double float type. Use "Max/Min".
INT = 2; /// Int type. Use "Max/Min".
DISCRETE = 3; /// Discrete number type. Use "List" as float.
CATEGORICAL = 4; /// Categorical type. Use "List" as string.
enum ObjectiveType {
UNKNOWN = 0; /// Undefined type and not used.
MINIMIZE = 1; /// Minimize
MAXIMIZE = 2; /// Maximize
message ObjectiveSpec {
ObjectiveType type = 1;
double goal = 2;
string objective_metric_name = 3;
message Trial {
string name = 1;
TrialSpec spec = 2;
TrialStatus status = 3;
message TrialSpec {
ParameterAssignments parameter_assignments = 2;
string run_spec = 3;
message ParameterAssignments {
repeated ParameterAssignment assignments = 1;
message ParameterAssignment {
string name = 1;
string value = 2;
message TrialStatus {
Observation observation = 4; // The best observation in logs.
message Observation {
repeated Metric metrics = 1;
message Metric {
string name = 1;
string value = 2;
When the user creates a Experiment, we will create a Suggestion for the Experiment. When the Experiment needs some suggestions, Experiment controller updates the Suggestions
, then Suggestion controller communicates with the Suggestion to get parameter assignments and set them in Suggestion status.
Now the workflow will be illustrated with an example.
apiVersion: ""
kind: Experiment
namespace: kubeflow
name: random-experiment
parallelTrialCount: 3
maxTrialCount: 12
maxFailedTrialCount: 3
type: maximize
goal: 0.99
objectiveMetricName: Validation-accuracy
- accuracy
algorithmName: random
rawTemplate: |-
apiVersion: batch/v1
kind: Job
name: {{.Trial}}
namespace: {{.NameSpace}}
- name: {{.Trial}}
image: katib/mxnet-mnist-example
- "python"
- "/mxnet/example/image-classification/"
- "--batch-size=64"
{{- with .HyperParameters}}
{{- range .}}
- "{{.Name}}={{.Value}}"
{{- end}}
{{- end}}
restartPolicy: Never
- name: --lr
parameterType: double
min: "0.01"
max: "0.03"
- name: --num-layers
parameterType: int
min: "2"
max: "5"
- name: --optimizer
parameterType: categorical
- sgd
- adam
- ftrl
Then, Experiment controller needs 3 parallel trials to run. It creates the Suggestions:
apiVersion: ""
kind: Suggestion
namespace: kubeflow
name: random-experiment
algorithmName: random
requests: 3
After that, Suggestion controller communicates with the Suggestion via GRPC and updates the status:
apiVersion: ""
kind: Suggestion
namespace: kubeflow
name: random-experiment
algorithmName: random
requests: 3
- assignments:
- name: --lr
value: 0.02
- name: --num-layers
value: 4
- name: --optimizer
value: sgd
- assignments:
- name: --lr
value: 0.021
- name: --num-layers
value: 3
- name: --optimizer
value: adam
- assignments:
- name: --lr
value: 0.03
- name: --num-layers
value: 5
- name: --optimizer
value: adam
Then Experiment controller creates the trial. When there is one trial finished, Experiment controller will ask Suggestion controller for a new suggestion:
apiVersion: ""
kind: Suggestion
namespace: kubeflow
name: random-experiment
algorithmName: random
requests: 4
- assignments:
- name: --lr
value: 0.02
- name: --num-layers
value: 4
- name: --optimizer
value: sgd
- assignments:
- name: --lr
value: 0.021
- name: --num-layers
value: 3
- name: --optimizer
value: adam
- assignments:
- name: --lr
value: 0.03
- name: --num-layers
value: 5
- name: --optimizer
value: adam
- assignments:
- name: --lr
value: 0.012
- name: --num-layers
value: 4
- name: --optimizer
value: adam
We can use the implementation in Katib or hyperopt.
We can use the length of the trials to know which grid we are in. Please refer to the implementation in advisor.
Or we can use chocolate.
We can use skopt to run bayes optimization.
We can use HpBandSter to run HyperBand.
We can use HpBandSter to run BOHB.
We can use hyperopt to run TPE.
We can use SMAC3 to run SMAC.
We can use goptuna to run CMA-ES.
We can use goptuna to run Sobol.