- katib/issues#685 (Katib metrics collector solution)
- katib/pull#697 (API for metricCollector)
- katib/pull#716 (Add pod level inject webhook)
- katib/pull#729 (Inject pod sidecar for specified namespace)
- katib/pull#730 (fix: Add build for sidecar)
Katib is a hyperparameter tuning (HPT) and neural architecture search (NAS) system based on Kubernetes. During the auto-training, the metrics collection is an essential step. In the current design, the metrics collector is pulled-based. Katib runs a metrics collector cron job for each Trial. The cron job pulls the targeted pod logs periodically and then persist the logs into MySQL. However, the pulled-based design has some problems, such as, at what frequency should we scrape the metrics and so on.
To enhance the extensibility and support EarlyStopping, we propose a new design of the metrics collector. In the new design, Katib use mutating webhook to inject metrics collector container as a sidecar into Job/Tfjob/PytorchJob pod. The sidecar collects metrics of the master and then store them on the persistent layer (e.x. katib-db-manager and metadata server).
Fig. 1 Architecture of the new design
- A mutating webhook: inject metrics collector as a sidecar into master pod.
- A metric collector: collect metrics and store them on the persistent layer (katib-db-manager).
- The final metrics of worker pods should be collected by trail controller and then be stored into trial status.
For more detail, see here.
type MetricsCollectorSpec struct {
Retain bool `json:"retain,omitempty"`
// Deprecated Retain
Retain bool `json:"retain,omitempty"`
// Deprecated GoTemplate
GoTemplate GoTemplate `json:"goTemplate,omitempty"`
Source *SourceSpec `json:"source,omitempty"`
Collector *CollectorSpec `json:"collector,omitempty"`
}
type SourceSpec struct {
// Model-train source code can expose metrics by http, such as HTTP endpoint in
// prometheus metric format
HttpGet *v1.HTTPGetAction `json:"httpGet,omitempty"`
// During training model, metrics may be persisted into local file in source
// code, such as tfEvent use case
FileSystemPath *FileSystemPath `json:"fileSystemPath,omitempty"`
// Default metric output format is {"metric": "<metric_name>",
// "value": <int_or_float>, "epoch": <int>, "step": <int>}, but if the output doesn't
// follow default format, please extend it here
Filter *FilterSpec `json:"filter,omitempty"`
}
type FilterSpec struct {
// When the metrics output follows format as this field specified, metricsCollector
// collects it and reports to metrics server, it can be "<metric_name>: <float>" or else
MetricsFormat []string `json:"metricsFormat,omitempty"`
}
type FileSystemKind string
const (
DirectoryKind FileSystemKind = "diretory"
FileKind FileSystemKind = "file"
)
type FileSystemPath struct {
Path string `json:"path,omitempty"`
Kind FileSystemKind `json:"kind,omitempty"`
}
type CollectorKind string
const (
StdOutCollector CollectorKind = "stdOutCollector"
FileCollector CollectorKind = "fileCollector"
TfEventCollector CollectorKind = "tfEventCollector"
PrometheusMetricCollector CollectorKind = "prometheusMetricCollector"
CustomCollector CollectorKind = "customCollector"
// When model training source code persists metrics into persistent layer
// directly, metricsCollector isn't in need, and its kind is "noneCollector"
NoneCollector CollectorKind = "noneCollector"
)
type CollectorSpec struct {
Kind CollectorKind `json:"kind"`
// When kind is "customCollector", this field will be used
CustomCollector *v1.Container `json:"customCollector,omitempty"`
}
To avoid collecting duplicated metrics, as we discuss in kubeflow/katib#685, only one metrics collector sidecar will be injected into the master pod during one Experiment.
In the new design, there are two modes for Katib mutating webhook to inject the sidecar: Pod Level Injecting and Job Level Injecting.
The webhook decides which mode to be used based on the katib.kubeflow.org/metrics-collector-injection=enabled
label tagged on the namespace.
In the namespace with katib.kubeflow.org/metrics-collector-injection=enabled
label, the webhook inject the sidecar in the pod level. Otherwise, without this label, injecting in the job level.
In Pod Level Injecting,
- Job operators (e.x. TFjob/PyTorchjob) tag the
training.kubeflow.org/job-role: master
(#1064) label on the master pod. - The webhook inject the metric collector only if the webhook recognizes this label.
- The webhook uses ObjectSelector to skip on irrelevant objects in order to optimize the performance.
- ObjectSelector is only supported above Kubernetes v1.15. Without this new feature, there may be a performance issue in webhook. In this situation, the following Job Level Injecting mode may be a better option.
In Job Level Injecting,
- The webhook use different strategies to inject sidecar according to different job operators. For now, the webhook support PytorchJob and TfJob.
- For PytorchJob, the metrics collector sidecar is injected into master template.
- For TfJob, the metrics collector sidecar is injected into master template if master exists. Otherwise, the sidecar is injected into worker template with 0 index.
After injecting, the sidecar collects metrics of the master and then store them on the persistent layer (e.x. katib-db-manager and metadata server).
In katib/pull#716, we implement a mutating webhook and a new sidecarmetricscollector which differ from the original metrics collector. The mutating webhook inject a sidecar metrics collector into every master pod.
While running as a sidecar, the collector container needs to keep running to make sure the master pod does training normally.
Otherwise, if the sidecar container is completed or error before the master container finishes, this master pod with sidecar will be invisible to other worker pods.
Therefore, in the implementation, the sidecar collector collects the logs from the Kubernetes client in the Follow
mode and keep on retrying while error encountered.
The final metrics of worker pods should be collected by trail controller and then be stored into trial status.
#WIP