Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Config API for Kubeflow Trainer controller manager #2428

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
54 changes: 24 additions & 30 deletions cmd/trainer-controller-manager/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,13 @@ import (
"net/http"
"os"

"github.com/kubeflow/trainer/pkg/apis/trainer/v1alpha1"
trainer "github.com/kubeflow/trainer/pkg/apis/trainer/v1alpha1"
"github.com/kubeflow/trainer/pkg/controller"
"github.com/kubeflow/trainer/pkg/runtime"
runtimecore "github.com/kubeflow/trainer/pkg/runtime/core"
"github.com/kubeflow/trainer/pkg/util/cert"
webhooks "github.com/kubeflow/trainer/pkg/webhooks"
zaplog "go.uber.org/zap"
"go.uber.org/zap/zapcore"
apiruntime "k8s.io/apimachinery/pkg/runtime"
Expand All @@ -36,13 +43,6 @@ import (
"sigs.k8s.io/controller-runtime/pkg/webhook"
jobsetv1alpha2 "sigs.k8s.io/jobset/api/jobset/v1alpha2"
schedulerpluginsv1alpha1 "sigs.k8s.io/scheduler-plugins/apis/scheduling/v1alpha1"

trainer "github.com/kubeflow/trainer/pkg/apis/trainer/v1alpha1"
"github.com/kubeflow/trainer/pkg/controller"
"github.com/kubeflow/trainer/pkg/runtime"
runtimecore "github.com/kubeflow/trainer/pkg/runtime/core"
"github.com/kubeflow/trainer/pkg/util/cert"
webhooks "github.com/kubeflow/trainer/pkg/webhooks"
)

const (
Expand All @@ -62,31 +62,25 @@ func init() {
}

func main() {
var metricsAddr string
var enableLeaderElection bool
var probeAddr string
var secureMetrics bool
var enableHTTP2 bool
var webhookServerPort int
var webhookServiceName string
var webhookSecretName string
var tlsOpts []func(*tls.Config)

flag.StringVar(&metricsAddr, "metrics-bind-address", "0", "The address the metrics endpoint binds to. "+
trainerCfg := &v1alpha1.TrainerControllerConfig{}

flag.StringVar(&trainerCfg.ControllerManager.MetricsBindAddress, "metrics-bind-address", "0", "The address the metrics endpoint binds to. "+
"Use :8443 for HTTPS or :8080 for HTTP, or leave as 0 to disable the metrics service.")
flag.StringVar(&probeAddr, "health-probe-bind-address", ":8081", "The address the probe endpoint binds to.")
flag.BoolVar(&enableLeaderElection, "leader-elect", false,
flag.StringVar(&trainerCfg.ControllerManager.HealthProbeBindAddress, "health-probe-bind-address", ":8081", "The address the probe endpoint binds to.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, you should read this config from the file, similar to this: https://github.com/kubernetes-sigs/jobset/blob/main/main.go#L84C18-L87
We might also want to have the default config which enables the internal Cert Generator:
https://github.com/kubernetes-sigs/jobset/blob/main/config/components/manager/controller_manager_config.yaml#L2-L3

flag.BoolVar(&trainerCfg.ControllerManager.LeaderElect, "leader-elect", false,
"Enable leader election for controller manager. "+
"Enabling this will ensure there is only one active controller manager.")
flag.BoolVar(&secureMetrics, "metrics-secure", true,
flag.BoolVar(&trainerCfg.ControllerManager.MetricsSecure, "metrics-secure", true,
"If set, the metrics endpoint is served securely via HTTPS. Use --metrics-secure=false to use HTTP instead.")
flag.BoolVar(&enableHTTP2, "enable-http2", false,
flag.BoolVar(&trainerCfg.ControllerManager.EnableHTTP2, "enable-http2", false,
"If set, HTTP/2 will be enabled for the metrics and webhook servers")

// Cert generation flags
flag.IntVar(&webhookServerPort, "webhook-server-port", 9443, "Endpoint port for the webhook server.")
flag.StringVar(&webhookServiceName, "webhook-service-name", "kubeflow-trainer-controller-manager", "Name of the Service used as part of the DNSName")
flag.StringVar(&webhookSecretName, "webhook-secret-name", "kubeflow-trainer-webhook-cert", "Name of the Secret to store CA and server certs")
flag.IntVar(&trainerCfg.CertGeneration.WebhookServerPort, "webhook-server-port", 9443, "Endpoint port for the webhook server.")
flag.StringVar(&trainerCfg.CertGeneration.WebhookServiceName, "webhook-service-name", "training-operator-v2", "Name of the Service used as part of the DNSName")
flag.StringVar(&trainerCfg.CertGeneration.WebhookSecretName, "webhook-secret-name", "training-operator-v2-webhook-cert", "Name of the Secret to store CA and server certs")

opts := zap.Options{
TimeEncoder: zapcore.RFC3339NanoTimeEncoder,
Expand All @@ -97,7 +91,7 @@ func main() {

ctrl.SetLogger(zap.New(zap.UseFlagOptions(&opts)))

if !enableHTTP2 {
if !trainerCfg.ControllerManager.EnableHTTP2 {
// if the enable-http2 flag is false (the default), http/2 should be disabled
// due to its vulnerabilities. More specifically, disabling http/2 will
// prevent from being vulnerable to the HTTP/2 Stream Cancellation and
Expand All @@ -112,15 +106,15 @@ func main() {
mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
Scheme: scheme,
Metrics: metricsserver.Options{
BindAddress: metricsAddr,
SecureServing: secureMetrics,
BindAddress: trainerCfg.ControllerManager.MetricsBindAddress,
SecureServing: trainerCfg.ControllerManager.MetricsSecure,
TLSOpts: tlsOpts,
},
WebhookServer: webhook.NewServer(webhook.Options{
Port: webhookServerPort,
Port: trainerCfg.CertGeneration.WebhookServerPort,
TLSOpts: tlsOpts,
}),
HealthProbeBindAddress: probeAddr,
HealthProbeBindAddress: trainerCfg.ControllerManager.HealthProbeBindAddress,
})
if err != nil {
setupLog.Error(err, "unable to start manager")
Expand All @@ -129,8 +123,8 @@ func main() {

certsReady := make(chan struct{})
if err = cert.ManageCerts(mgr, cert.Config{
WebhookSecretName: webhookSecretName,
WebhookServiceName: webhookServiceName,
WebhookSecretName: trainerCfg.CertGeneration.WebhookSecretName,
WebhookServiceName: trainerCfg.CertGeneration.WebhookServiceName,
WebhookConfigurationName: webhookConfigurationName,
}, certsReady); err != nil {
setupLog.Error(err, "unable to set up cert rotation")
Expand Down
50 changes: 50 additions & 0 deletions pkg/apis/trainer/v1alpha1/types.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
// Copyright 2024 The Kubeflow Authors
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have dedicate directory for Config APIs and place it in /apis/config/v1alpha1/configuration_types.go ?
Similar to Kueue: https://github.com/kubernetes-sigs/kueue/tree/main/apis/config/v1beta1

//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

// +k8s:defaulter-gen=TypeMeta
// +k8s:openapi-gen=true
// +k8s:deepcopy-gen=package

// Package v2alpha1 is the v2alpha1 version of the API.
// +groupName=trainer.kubeflow.org

package v1alpha1

import (
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)

// TrainerControllerConfig represents the configuration for the training operator.
type TrainerControllerConfig struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's call it type Configuration struct

Suggested change
type TrainerControllerConfig struct {
type Configuration struct {

metav1.TypeMeta `json:",inline"`

ControllerManager *ControllerManager `json:"controllerManager,omitempty"`
CertGeneration *CertGeneration `json:"certGeneration,omitempty"`
}
Comment on lines +31 to +34
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might need to fix Go formatting here.


// ControllerManager holds configuration related to the controller manager.
type ControllerManager struct {
MetricsBindAddress string `json:"metricsBindAddress,omitempty"`
HealthProbeBindAddress string `json:"healthProbeBindAddress,omitempty"`
LeaderElect bool `json:"leaderElect,omitempty"`
MetricsSecure bool `json:"metricsSecure,omitempty"`
EnableHTTP2 bool `json:"enableHttp2,omitempty"`
}
Comment on lines +37 to +43
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


// CertGeneration holds configuration related to webhook server certificate generation.
type CertGeneration struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also have flag to enable/disable Cert Generator ?

WebhookServerPort int `json:"webhookServerPort,omitempty"`
WebhookServiceName string `json:"webhookServiceName,omitempty"`
WebhookSecretName string `json:"webhookSecretName,omitempty"`
Comment on lines +47 to +49
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need to explain what does each value mean and add the Defaults values:
https://github.com/kubeflow/katib/blob/master/pkg/apis/config/v1beta1/types.go#L55

}
Loading