Skip to content

Commit

Permalink
katib metrics-collector: mention supported writers (#3999)
Browse files Browse the repository at this point in the history
* katib metrics-collector: mention supported writers

See kubeflow/katib#2467

Signed-off-by: Gary Miguel <[email protected]>

* add 'metrics' word

Signed-off-by: Gary Miguel <[email protected]>

---------

Signed-off-by: Gary Miguel <[email protected]>
  • Loading branch information
garymm authored Feb 15, 2025
1 parent 8ad90c5 commit 27e9998
Showing 1 changed file with 9 additions and 6 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Before running your hyperparameter tuning Katib Experiment with Python SDK,
ensure the namespace label `katib.kubeflow.org/metrics-collector-injection: enabled`
is present. This label enables the sidecar container injection for pull-based metrics collectors to collect metrics during the experiment.

You can configure the namespace by adding the following label `katib.kubeflow.org/metrics-collector-injection: enabled`
You can configure the namespace by adding the following label `katib.kubeflow.org/metrics-collector-injection: enabled`
as is shown in the sample code:

```yaml
Expand Down Expand Up @@ -44,7 +44,7 @@ define how Katib should collect the metrics from each Trial, such as the accurac

## Pull-based Metrics Collector

Your training code can record the metrics into `StdOut` or into arbitrary output files.
Your training code can record the metrics into `StdOut` or into arbitrary output files.

To define the pull-based metrics collector for your Experiment:

Expand Down Expand Up @@ -73,6 +73,9 @@ To define the pull-based metrics collector for your Experiment:

- `TensorFlowEvent`: Katib collects the metrics from a directory path
containing a [tf.Event](https://www.tensorflow.org/api_docs/python/tf/compat/v1/Event).
These are typically written by [tensorflow.summary](https://www.tensorflow.org/api_docs/python/tf/summary).
As of Katib 0.18, [torch.utils.tensorboard](https://pytorch.org/docs/stable/tensorboard.html) or
[tensorboardX](https://tensorboardx.readthedocs.io/en/latest/index.html) may also be used to write metrics.
You should specify the path in the `.source.fileSystemPath.path` field. Check the
[TFJob example](https://github.com/kubeflow/katib/blob/ea46a7f2b73b2d316b6b7619f99eb440ede1909b/examples/v1beta1/kubeflow-training-operator/tfjob-mnist-with-summaries.yaml#L17-L23).
The default directory path is `/var/log/katib/tfevent/`.
Expand Down Expand Up @@ -110,10 +113,10 @@ To define the pull-based metrics collector for your Experiment:

## Push-based Metrics Collector

Your training code needs to call [`report_metrics()`](https://github.com/kubeflow/katib/blob/e251a07cb9491e2d892db306d925dddf51cb0930/sdk/python/v1beta1/kubeflow/katib/api/report_metrics.py#L26) function in Python SDK to record metrics.
The `report_metrics()` function works by parsing the metrics in `metrics` field into a gRPC request, automatically adding the current timestamp for users, and sending the request to Katib DB Manager.
Your training code needs to call [`report_metrics()`](https://github.com/kubeflow/katib/blob/e251a07cb9491e2d892db306d925dddf51cb0930/sdk/python/v1beta1/kubeflow/katib/api/report_metrics.py#L26) function in Python SDK to record metrics.
The `report_metrics()` function works by parsing the metrics in `metrics` field into a gRPC request, automatically adding the current timestamp for users, and sending the request to Katib DB Manager.

But before that, `kubeflow-katib` package should be installed in your training container.
But before that, `kubeflow-katib` package should be installed in your training container.

To define the push-based metrics collector for your Experiment, you have two options:

Expand Down Expand Up @@ -146,7 +149,7 @@ To define the push-based metrics collector for your Experiment, you have two opt
max_trial_count=2,
metrics_collector_config={"kind": "Push"},
# When SDK is released, replace it with packages_to_install=["kubeflow-katib==0.18.0"].
# Currently, the training container should have `git` package to install this SDK.
# Currently, the training container should have `git` package to install this SDK.
packages_to_install=["git+https://github.com/kubeflow/katib.git@master#subdirectory=sdk/python/v1beta1"],
)
```

0 comments on commit 27e9998

Please sign in to comment.