title | aliases | further_reading | |||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Kubernetes Control Plane Monitoring |
|
|
This section aims to document specificities and to provide good base configurations for monitoring the Kubernetes Control Plane. You can then customize these configurations to add any Datadog feature.
With Datadog integrations for the API server, Etcd, Controller Manager, and Scheduler, you can collect key metrics from all four components of the Kubernetes Control Plane.
- Kubernetes with Kubeadm
- Kubernetes on Amazon EKS
- Kubernetes on OpenShift 4
- Kubernetes on OpenShift 3
- Kubernetes on Rancher Kubernetes Engine (v2.5+)
- Kubernetes on Rancher Kubernetes Engine (<v2.5)
- Kubernetes on Managed Services (AKS, GKE)
The following configurations are tested on Kubernetes v1.18+
.
The API server integration is automatically configured. The Datadog Agent discovers it automatically.
By providing read access to the Etcd certificates located on the host, the Datadog Agent check can communicate with Etcd and start collecting Etcd metrics.
{{< tabs >}} {{% tab "Datadog Operator" %}}
{{< code-block lang="yaml" filename="datadog-agent.yaml" >}} kind: DatadogAgent apiVersion: datadoghq.com/v2alpha1 metadata: name: datadog spec: global: credentials: apiKey: <DATADOG_API_KEY> appKey: <DATADOG_APP_KEY> clusterName: <CLUSTER_NAME> kubelet: tlsVerify: false override: clusterAgent: image: name: gcr.io/datadoghq/cluster-agent:latest nodeAgent: image: name: gcr.io/datadoghq/agent:latest extraConfd: configMap: name: datadog-checks containers: agent: volumeMounts: - name: etcd-certs readOnly: true mountPath: /host/etc/kubernetes/pki/etcd - name: disable-etcd-autoconf mountPath: /etc/datadog-agent/conf.d/etcd.d volumes: - name: etcd-certs hostPath: path: /etc/kubernetes/pki/etcd - name: disable-etcd-autoconf emptyDir: {} tolerations: - key: node-role.kubernetes.io/master operator: Exists effect: NoSchedule
apiVersion: v1 kind: ConfigMap metadata: name: datadog-checks data: etcd.yaml: |- ad_identifiers: - etcd init_config: instances: - prometheus_url: https://%%host%%:2379/metrics tls_ca_cert: /host/etc/kubernetes/pki/etcd/ca.crt tls_cert: /host/etc/kubernetes/pki/etcd/server.crt tls_private_key: /host/etc/kubernetes/pki/etcd/server.key {{< /code-block >}}
{{% /tab %}} {{% tab "Helm" %}}
{{< code-block lang="yaml" filename="datadog-values.yaml" >}} datadog: apiKey: <DATADOG_API_KEY> appKey: <DATADOG_APP_KEY> clusterName: <CLUSTER_NAME> kubelet: tlsVerify: false ignoreAutoConfig:
- etcd
confd:
etcd.yaml: |-
ad_identifiers:
- etcd
instances:
- prometheus_url: https://%%host%%:2379/metrics
tls_ca_cert: /host/etc/kubernetes/pki/etcd/ca.crt
tls_cert: /host/etc/kubernetes/pki/etcd/server.crt
tls_private_key: /host/etc/kubernetes/pki/etcd/server.key
agents:
volumes:
- hostPath: path: /etc/kubernetes/pki/etcd name: etcd-certs volumeMounts:
- name: etcd-certs mountPath: /host/etc/kubernetes/pki/etcd readOnly: true tolerations:
- effect: NoSchedule key: node-role.kubernetes.io/master operator: Exists {{< /code-block >}}
{{% /tab %}}
{{< /tabs >}}
If the insecure ports of your Controller Manager and Scheduler instances are enabled, the Datadog Agent discovers the integrations and starts collecting metrics without any additional configuration.
Secure ports allow authentication and authorization to protect your Control Plane components. The Datadog Agent can collect Controller Manager and Scheduler metrics by targeting their secure ports.
{{< tabs >}} {{% tab "Datadog Operator" %}}
{{< code-block lang="yaml" filename="datadog-agent.yaml" >}} kind: DatadogAgent apiVersion: datadoghq.com/v2alpha1 metadata: name: datadog spec: global: credentials: apiKey: <DATADOG_API_KEY> appKey: <DATADOG_APP_KEY> clusterName: <CLUSTER_NAME> kubelet: tlsVerify: false override: clusterAgent: image: name: gcr.io/datadoghq/cluster-agent:latest nodeAgent: image: name: gcr.io/datadoghq/agent:latest extraConfd: configMap: name: datadog-checks containers: agent: volumeMounts: - name: etcd-certs readOnly: true mountPath: /host/etc/kubernetes/pki/etcd - name: disable-etcd-autoconf mountPath: /etc/datadog-agent/conf.d/etcd.d - name: disable-scheduler-autoconf mountPath: /etc/datadog-agent/conf.d/kube_scheduler.d - name: disable-controller-manager-autoconf mountPath: /etc/datadog-agent/conf.d/kube_controller_manager.d volumes: - name: etcd-certs hostPath: path: /etc/kubernetes/pki/etcd - name: disable-etcd-autoconf emptyDir: {} - name: disable-scheduler-autoconf emptyDir: {} - name: disable-controller-manager-autoconf emptyDir: {} tolerations: - key: node-role.kubernetes.io/master operator: Exists effect: NoSchedule
apiVersion: v1 kind: ConfigMap metadata: name: datadog-checks data: etcd.yaml: |- ad_identifiers: - etcd init_config: instances: - prometheus_url: https://%%host%%:2379/metrics tls_ca_cert: /host/etc/kubernetes/pki/etcd/ca.crt tls_cert: /host/etc/kubernetes/pki/etcd/server.crt tls_private_key: /host/etc/kubernetes/pki/etcd/server.key kube_scheduler.yaml: |- ad_identifiers: - kube-scheduler instances: - prometheus_url: https://%%host%%:10259/metrics ssl_verify: false bearer_token_auth: true kube_controller_manager.yaml: |- ad_identifiers: - kube-controller-manager instances: - prometheus_url: https://%%host%%:10257/metrics ssl_verify: false bearer_token_auth: true {{< /code-block >}}
{{% /tab %}} {{% tab "Helm" %}}
{{< code-block lang="yaml" filename="datadog-values.yaml" >}} datadog: apiKey: <DATADOG_API_KEY> appKey: <DATADOG_APP_KEY> clusterName: <CLUSTER_NAME> kubelet: tlsVerify: false ignoreAutoConfig: - etcd - kube_scheduler - kube_controller_manager confd: etcd.yaml: |- ad_identifiers: - etcd instances: - prometheus_url: https://%%host%%:2379/metrics tls_ca_cert: /host/etc/kubernetes/pki/etcd/ca.crt tls_cert: /host/etc/kubernetes/pki/etcd/server.crt tls_private_key: /host/etc/kubernetes/pki/etcd/server.key kube_scheduler.yaml: |- ad_identifiers: - kube-scheduler instances: - prometheus_url: https://%%host%%:10259/metrics ssl_verify: false bearer_token_auth: true kube_controller_manager.yaml: |- ad_identifiers: - kube-controller-manager instances: - prometheus_url: https://%%host%%:10257/metrics ssl_verify: false bearer_token_auth: true agents: volumes: - hostPath: path: /etc/kubernetes/pki/etcd name: etcd-certs volumeMounts: - name: etcd-certs mountPath: /host/etc/kubernetes/pki/etcd readOnly: true tolerations:
- effect: NoSchedule key: node-role.kubernetes.io/master operator: Exists {{< /code-block >}}
{{% /tab %}}
{{< /tabs >}}
Notes:
- The
ssl_verify
field in thekube_controller_manager
andkube_scheduler
configuration needs to be set tofalse
when using self-signed certificates. - When targeting secure ports, the
bind-address
option in your Controller Manager and Scheduler configuration must be reachable by the Datadog Agent. Example:
apiVersion: kubeadm.k8s.io/v1beta2
kind: ClusterConfiguration
controllerManager:
extraArgs:
bind-address: 0.0.0.0
scheduler:
extraArgs:
bind-address: 0.0.0.0
Amazon Elastic Kubernetes Service (EKS) supports monitoring all control plane components using cluster checks.
- An EKS Cluster running on Kubernetes version >= 1.28
- Deploy the Agent using one of:
- Helm chart version >=
3.90.1
- Datadog Operator >=
v1.13.0
- Helm chart version >=
- Enable the Datadog Cluster Agent
{{< tabs >}} {{% tab "Datadog Operator" %}}
Add the following annotations. Operator installations are limited to API Server metrics. For support for kube_controller_manager
and kube_scheduler
metrics, use the Helm install.
annotations:
ad.datadoghq.com/endpoints.check_names: '["kube_apiserver_metrics"]'
ad.datadoghq.com/endpoints.init_configs: '[{}]'
ad.datadoghq.com/endpoints.instances:
'[{ "prometheus_url": "https://%%host%%:%%port%%/metrics", "bearer_token_auth": "true" }]'
{{% /tab %}} {{% tab "Helm" %}}
Add the following annotations to the default/kubernetes
service:
annotations:
ad.datadoghq.com/endpoints.check_names: '["kube_apiserver_metrics"]'
ad.datadoghq.com/endpoints.init_configs: '[{}]'
ad.datadoghq.com/endpoints.instances:
'[{ "prometheus_url": "https://%%host%%:%%port%%/metrics", "bearer_token_auth": "true" }]'
ad.datadoghq.com/service.check_names: '["kube_controller_manager","kube_scheduler"]'
ad.datadoghq.com/service.init_configs: '[{},{}]'
ad.datadoghq.com/service.instances: '[{"prometheus_url":"https://%%host%%:%%port%%/apis/metrics.eks.amazonaws.com/v1/kcm/container/metrics","extra_headers":{"accept":"*/*"},"tls_ignore_warning":"true","tls_verify":"false","bearer_token_auth":"true"},{"prometheus_url":"https://%%host%%:%%port%%/apis/metrics.eks.amazonaws.com/v1/ksh/container/metrics","extra_headers":{"accept":"*/*"},"tls_ignore_warning":"true","tls_verify":"false","bearer_token_auth":"true"}]'
{{% /tab %}} {{< /tabs >}}
Notes:
- Amazon exposes
kube_controller_manager
andkube_scheduler
metrics under themetrics.eks.amazonaws.com
API Group. - The addition of
"extra_headers":{"accept":"*/*"}
preventsHTTP 406
errors when querying the EKS metrics API.
On OpenShift 4, all control plane components can be monitored using endpoint checks.
- Enable the Datadog Cluster Agent
- Enable Cluster checks
- Enable Endpoint checks
- Ensure that you are logged in with sufficient permissions to edit services and create secrets.
The API server runs behind the service kubernetes
in the default
namespace. Annotate this service with the kube_apiserver_metrics
configuration:
oc annotate service kubernetes -n default 'ad.datadoghq.com/endpoints.check_names=["kube_apiserver_metrics"]'
oc annotate service kubernetes -n default 'ad.datadoghq.com/endpoints.init_configs=[{}]'
oc annotate service kubernetes -n default 'ad.datadoghq.com/endpoints.instances=[{"prometheus_url": "https://%%host%%:%%port%%/metrics", "bearer_token_auth": "true"}]'
oc annotate service kubernetes -n default 'ad.datadoghq.com/endpoints.resolve=ip'
The last annotation ad.datadoghq.com/endpoints.resolve
is needed because the service is in front of static pods. The Datadog Cluster Agent schedules the checks as endpoint checks and dispatches them to Cluster Check Runners. The nodes they are running on can be identified with:
oc exec -it <datadog cluster agent pod> -n <datadog ns> -- agent clusterchecks
{{% collapse-content title="Etcd OpenShift 4.0 - 4.13" level="h4" %}}
Certificates are needed to communicate with the Etcd service, which can be found in the secret kube-etcd-client-certs
in the openshift-monitoring
namespace. To give the Datadog Agent access to these certificates, first copy them into the same namespace the Datadog Agent is running in:
oc get secret kube-etcd-client-certs -n openshift-monitoring -o yaml | sed 's/namespace: openshift-monitoring/namespace: <datadog agent namespace>/' | oc create -f -
These certificates should be mounted on the Cluster Check Runner pods by adding the volumes and volumeMounts as below.
Note: Mounts are also included to disable the Etcd check autoconfiguration file packaged with the agent.
{{< tabs >}} {{% tab "Datadog Operator" %}}
{{< code-block lang="yaml" filename="datadog-agent.yaml" >}} kind: DatadogAgent apiVersion: datadoghq.com/v2alpha1 metadata: name: datadog spec: override: clusterChecksRunner: containers: agent: volumeMounts: - name: etcd-certs readOnly: true mountPath: /etc/etcd-certs - name: disable-etcd-autoconf mountPath: /etc/datadog-agent/conf.d/etcd.d volumes: - name: etcd-certs secret: secretName: kube-etcd-client-certs - name: disable-etcd-autoconf emptyDir: {} {{< /code-block >}}
{{% /tab %}} {{% tab "Helm" %}}
{{< code-block lang="yaml" filename="datadog-values.yaml" >}} ... clusterChecksRunner: volumes: - name: etcd-certs secret: secretName: kube-etcd-client-certs - name: disable-etcd-autoconf emptyDir: {} volumeMounts: - name: etcd-certs mountPath: /host/etc/etcd readOnly: true - name: disable-etcd-autoconf mountPath: /etc/datadog-agent/conf.d/etcd.d {{< /code-block >}}
{{% /tab %}}
{{< /tabs >}}
Then, annotate the service running in front of Etcd:
oc annotate service etcd -n openshift-etcd 'ad.datadoghq.com/endpoints.check_names=["etcd"]'
oc annotate service etcd -n openshift-etcd 'ad.datadoghq.com/endpoints.init_configs=[{}]'
oc annotate service etcd -n openshift-etcd 'ad.datadoghq.com/endpoints.instances=[{"prometheus_url": "https://%%host%%:%%port%%/metrics", "tls_ca_cert": "/etc/etcd-certs/etcd-client-ca.crt", "tls_cert": "/etc/etcd-certs/etcd-client.crt",
"tls_private_key": "/etc/etcd-certs/etcd-client.key"}]'
oc annotate service etcd -n openshift-etcd 'ad.datadoghq.com/endpoints.resolve=ip'
The Datadog Cluster Agent schedules the checks as endpoint checks and dispatches them to Cluster Check Runners.
{{% /collapse-content %}}
{{% collapse-content title="Etcd OpenShift 4.14 and higher" level="h4" %}}
Certificates are needed to communicate with the Etcd service, which can be found in the secret etcd-metric-client
in the openshift-etcd-operator
namespace. To give the Datadog Agent access to these certificates, first copy them into the same namespace the Datadog Agent is running in:
oc get secret etcd-metric-client -n openshift-etcd-operator -o yaml | sed 's/namespace: openshift-etcd-operator/namespace: <datadog agent namespace>/' | oc create -f -
These certificates should be mounted on the Cluster Check Runner pods by adding the volumes and volumeMounts as below.
Note: Mounts are also included to disable the Etcd check autoconfiguration file packaged with the agent.
{{< tabs >}} {{% tab "Datadog Operator" %}}
{{< code-block lang="yaml" filename="datadog-agent.yaml" >}} kind: DatadogAgent apiVersion: datadoghq.com/v2alpha1 metadata: name: datadog spec: override: clusterChecksRunner: containers: agent: volumeMounts: - name: etcd-certs readOnly: true mountPath: /etc/etcd-certs - name: disable-etcd-autoconf mountPath: /etc/datadog-agent/conf.d/etcd.d volumes: - name: etcd-certs secret: secretName: etcd-metric-client - name: disable-etcd-autoconf emptyDir: {} {{< /code-block >}}
{{% /tab %}} {{% tab "Helm" %}}
{{< code-block lang="yaml" filename="datadog-values.yaml" >}} ... clusterChecksRunner: volumes: - name: etcd-certs secret: secretName: etcd-metric-client - name: disable-etcd-autoconf emptyDir: {} volumeMounts: - name: etcd-certs mountPath: /host/etc/etcd readOnly: true - name: disable-etcd-autoconf mountPath: /etc/datadog-agent/conf.d/etcd.d
{{< /code-block >}}
{{% /tab %}}
{{< /tabs >}}
Then, annotate the service running in front of Etcd:
oc annotate service etcd -n openshift-etcd 'ad.datadoghq.com/endpoints.check_names=["etcd"]'
oc annotate service etcd -n openshift-etcd 'ad.datadoghq.com/endpoints.init_configs=[{}]'
oc annotate service etcd -n openshift-etcd 'ad.datadoghq.com/endpoints.instances=[{"prometheus_url": "https://%%host%%:%%port%%/metrics", "tls_ca_cert": "/etc/etcd-certs/etcd-client-ca.crt", "tls_cert": "/etc/etcd-certs/etcd-client.crt",
"tls_private_key": "/etc/etcd-certs/etcd-client.key"}]'
oc annotate service etcd -n openshift-etcd 'ad.datadoghq.com/endpoints.resolve=ip'
The Datadog Cluster Agent schedules the checks as endpoint checks and dispatches them to Cluster Check Runners.
{{% /collapse-content %}}
The Controller Manager runs behind the service kube-controller-manager
in the openshift-kube-controller-manager
namespace. Annotate the service with the check configuration:
oc annotate service kube-controller-manager -n openshift-kube-controller-manager 'ad.datadoghq.com/endpoints.check_names=["kube_controller_manager"]'
oc annotate service kube-controller-manager -n openshift-kube-controller-manager 'ad.datadoghq.com/endpoints.init_configs=[{}]'
oc annotate service kube-controller-manager -n openshift-kube-controller-manager 'ad.datadoghq.com/endpoints.instances=[{"prometheus_url": "https://%%host%%:%%port%%/metrics", "ssl_verify": "false", "bearer_token_auth": "true"}]'
oc annotate service kube-controller-manager -n openshift-kube-controller-manager 'ad.datadoghq.com/endpoints.resolve=ip'
The Datadog Cluster Agent schedules the checks as endpoint checks and dispatches them to Cluster Check Runners.
The Scheduler runs behind the service scheduler
in the openshift-kube-scheduler
namespace. Annotate the service with the check configuration:
oc annotate service scheduler -n openshift-kube-scheduler 'ad.datadoghq.com/endpoints.check_names=["kube_scheduler"]'
oc annotate service scheduler -n openshift-kube-scheduler 'ad.datadoghq.com/endpoints.init_configs=[{}]'
oc annotate service scheduler -n openshift-kube-scheduler 'ad.datadoghq.com/endpoints.instances=[{"prometheus_url": "https://%%host%%:%%port%%/metrics", "ssl_verify": "false", "bearer_token_auth": "true"}]'
oc annotate service scheduler -n openshift-kube-scheduler 'ad.datadoghq.com/endpoints.resolve=ip'
The Datadog Cluster Agent schedules the checks as endpoint checks and dispatches them to Cluster Check Runners.
On OpenShift 3, all control plane components can be monitored using endpoint checks.
- Enable the Datadog Cluster Agent
- Enable Cluster checks
- Enable Endpoint checks
- Ensure that you are logged in with sufficient permissions to create and edit services.
The API server runs behind the service kubernetes
in the default
namespace. Annotate this service with the kube_apiserver_metrics
configuration:
oc annotate service kubernetes -n default 'ad.datadoghq.com/endpoints.check_names=["kube_apiserver_metrics"]'
oc annotate service kubernetes -n default 'ad.datadoghq.com/endpoints.init_configs=[{}]'
oc annotate service kubernetes -n default 'ad.datadoghq.com/endpoints.instances=[{"prometheus_url": "https://%%host%%:%%port%%/metrics", "bearer_token_auth": "true"}]'
oc annotate service kubernetes -n default 'ad.datadoghq.com/endpoints.resolve=ip'
The last annotation ad.datadoghq.com/endpoints.resolve
is needed because the service is in front of static pods. The Datadog Cluster Agent schedules the checks as endpoint checks and dispatches them to Cluster Check Runners. The nodes they are running on can be identified with:
oc exec -it <datadog cluster agent pod> -n <datadog ns> -- agent clusterchecks
Certificates are needed to communicate with the Etcd service, which are located on the host. These certificates should be mounted on the Cluster Check Runner pods by adding the volumes and volumeMounts as below.
Note: Mounts are also included to disable the Etcd check autoconfiguration file packaged with the agent.
{{< tabs >}} {{% tab "Datadog Operator" %}}
{{< code-block lang="yaml" filename="datadog-agent.yaml" >}} kind: DatadogAgent apiVersion: datadoghq.com/v2alpha1 metadata: name: datadog spec: override: clusterChecksRunner: containers: agent: volumeMounts: - name: etcd-certs readOnly: true mountPath: /host/etc/etcd - name: disable-etcd-autoconf mountPath: /etc/datadog-agent/conf.d/etcd.d volumes: - name: etcd-certs hostPath: path: /etc/etcd - name: disable-etcd-autoconf emptyDir: {} {{< /code-block >}}
{{% /tab %}} {{% tab "Helm" %}}
{{< code-block lang="yaml" filename="datadog-values.yaml" >}} ... clusterChecksRunner: volumes: - hostPath: path: /etc/etcd name: etcd-certs - name: disable-etcd-autoconf emptyDir: {} volumeMounts: - name: etcd-certs mountPath: /host/etc/etcd readOnly: true - name: disable-etcd-autoconf mountPath: /etc/datadog-agent/conf.d/etcd.d {{< /code-block >}}
{{% /tab %}} {{< /tabs >}}
Direct edits of this service are not persisted, so make a copy of the Etcd service:
oc get service etcd -n kube-system -o yaml | sed 's/name: etcd/name: etcd-copy/' | oc create -f -
Annotate the copied service with the check configuration:
oc annotate service etcd-copy -n openshift-etcd 'ad.datadoghq.com/endpoints.check_names=["etcd"]'
oc annotate service etcd-copy -n openshift-etcd 'ad.datadoghq.com/endpoints.init_configs=[{}]'
oc annotate service etcd-copy -n openshift-etcd 'ad.datadoghq.com/endpoints.instances=[{"prometheus_url": "https://%%host%%:%%port%%/metrics", "tls_ca_cert": "/host/etc/etcd/ca/ca.crt", "tls_cert": "/host/etc/etcd/server.crt",
"tls_private_key": "/host/etc/etcd/server.key"}]'
oc annotate service etcd-copy -n openshift-etcd 'ad.datadoghq.com/endpoints.resolve=ip'
The Datadog Cluster Agent schedules the checks as endpoint checks and dispatches them to Cluster Check Runners.
The Controller Manager and Scheduler run behind the same service, kube-controllers
in the kube-system
namespace. Direct edits of the service are not persisted, so make a copy of the service:
oc get service kube-controllers -n kube-system -o yaml | sed 's/name: kube-controllers/name: kube-controllers-copy/' | oc create -f -
Annotate the copied service with the check configurations:
oc annotate service kube-controllers-copy -n kube-system 'ad.datadoghq.com/endpoints.check_names=["kube_controller_manager", "kube_scheduler"]'
oc annotate service kube-controllers-copy -n kube-system 'ad.datadoghq.com/endpoints.init_configs=[{}, {}]'
oc annotate service kube-controllers-copy -n kube-system 'ad.datadoghq.com/endpoints.instances=[{ "prometheus_url": "https://%%host%%:%%port%%/metrics",
"ssl_verify": "false", "bearer_token_auth": "true" }, { "prometheus_url": "https://%%host%%:%%port%%/metrics",
"ssl_verify": "false", "bearer_token_auth": "true" }]'
oc annotate service kube-controllers-copy -n kube-system 'ad.datadoghq.com/endpoints.resolve=ip'
The Datadog Cluster Agent schedules the checks as endpoint checks and dispatches them to Cluster Check Runners.
Rancher v2.5 relies on PushProx to expose control plane metric endpoints, this allows the Datadog Agent to run control plane checks and collect metrics.
- Install the Datadog Agent with the rancher-monitoring chart.
- The
pushprox
daemonsets are deployed withrancher-monitoring
and running in thecattle-monitoring-system
namespace.
To configure the kube_apiserver_metrics
check, add the following annotations to the default/kubernetes
service:
annotations:
ad.datadoghq.com/endpoints.check_names: '["kube_apiserver_metrics"]'
ad.datadoghq.com/endpoints.init_configs: '[{}]'
ad.datadoghq.com/endpoints.instances: '[{ "prometheus_url": "https://%%host%%:%%port%%/metrics", "bearer_token_auth": "true" }]'
By adding headless Kubernetes services to define check configurations, the Datadog Agent is able to target the pushprox
pods and collect metrics.
Apply rancher-control-plane-services.yaml
:
apiVersion: v1
kind: Service
metadata:
name: pushprox-kube-scheduler-datadog
namespace: cattle-monitoring-system
labels:
component: kube-scheduler
k8s-app: pushprox-kube-scheduler-client
annotations:
ad.datadoghq.com/endpoints.check_names: '["kube_scheduler"]'
ad.datadoghq.com/endpoints.init_configs: '[{}]'
ad.datadoghq.com/endpoints.instances: |
[
{
"prometheus_url": "http://%%host%%:10251/metrics"
}
]
spec:
clusterIP: None
selector:
k8s-app: pushprox-kube-scheduler-client
---
apiVersion: v1
kind: Service
metadata:
name: pushprox-kube-controller-manager-datadog
namespace: cattle-monitoring-system
labels:
component: kube-controller-manager
k8s-app: pushprox-kube-controller-manager-client
annotations:
ad.datadoghq.com/endpoints.check_names: '["kube_controller_manager"]'
ad.datadoghq.com/endpoints.init_configs: '[{}]'
ad.datadoghq.com/endpoints.instances: |
[
{
"prometheus_url": "http://%%host%%:10252/metrics"
}
]
spec:
clusterIP: None
selector:
k8s-app: pushprox-kube-controller-manager-client
---
apiVersion: v1
kind: Service
metadata:
name: pushprox-kube-etcd-datadog
namespace: cattle-monitoring-system
labels:
component: kube-etcd
k8s-app: pushprox-kube-etcd-client
annotations:
ad.datadoghq.com/endpoints.check_names: '["etcd"]'
ad.datadoghq.com/endpoints.init_configs: '[{}]'
ad.datadoghq.com/endpoints.instances: |
[
{
"prometheus_url": "https://%%host%%:2379/metrics",
"tls_ca_cert": "/host/opt/rke/etc/kubernetes/ssl/kube-ca.pem",
"tls_cert": "/host/opt/rke/etc/kubernetes/ssl/kube-etcd-<node-ip>.pem",
"tls_private_key": "/host/opt/rke/etc/kubernetes/ssl/kube-etcd-<node-ip>.pem"
}
]
spec:
clusterIP: None
selector:
k8s-app: pushprox-kube-etcd-client
Deploy the Datadog Agent with manifests based on the following configurations:
{{< tabs >}} {{% tab "Datadog Operator" %}}
{{< code-block lang="yaml" filename="datadog-agent.yaml" >}} kind: DatadogAgent apiVersion: datadoghq.com/v2alpha1 metadata: name: datadog spec: features: clusterChecks: enabled: true global: credentials: apiKey: <DATADOG_API_KEY> appKey: <DATADOG_APP_KEY> clusterName: <CLUSTER_NAME> kubelet: tlsVerify: false override: nodeAgent: containers: agent: volumeMounts: - name: etcd-certs readOnly: true mountPath: /host/opt/rke/etc/kubernetes/ssl volumes: - name: etcd-certs hostPath: path: /opt/rke/etc/kubernetes/ssl tolerations: - key: node-role.kubernetes.io/controlplane operator: Exists effect: NoSchedule - key: node-role.kubernetes.io/etcd operator: Exists effect: NoExecute {{< /code-block >}}
{{% /tab %}} {{% tab "Helm" %}}
{{< code-block lang="yaml" filename="datadog-values.yaml" >}} datadog: apiKey: <DATADOG_API_KEY> appKey: <DATADOG_APP_KEY> clusterName: <CLUSTER_NAME> kubelet: tlsVerify: false agents: volumes: - hostPath: path: /opt/rke/etc/kubernetes/ssl name: etcd-certs volumeMounts: - name: etcd-certs mountPath: /host/opt/rke/etc/kubernetes/ssl readOnly: true tolerations: - effect: NoSchedule key: node-role.kubernetes.io/controlplane operator: Exists - effect: NoExecute key: node-role.kubernetes.io/etcd operator: Exists {{< /code-block >}}
{{% /tab %}} {{< /tabs >}}
Install the Datadog Agent with the rancher-monitoring chart.
The control plane components run on Docker outside of Kubernetes. Within Kubernetes, the kubernetes
service in the default
namespace targets the control plane node IP(s). You can confirm this by running $ kubectl describe endpoints kubernetes
.
You can annotate this service with endpoint checks (managed by the Datadog Cluster Agent) to monitor the API Server, Controller Manager, and Scheduler:
kubectl edit service kubernetes
metadata:
annotations:
ad.datadoghq.com/endpoints.check_names: '["kube_apiserver_metrics", "kube_controller_manager", "kube_scheduler"]'
ad.datadoghq.com/endpoints.init_configs: '[{},{},{}]'
ad.datadoghq.com/endpoints.instances: '[{ "prometheus_url": "https://%%host%%:%%port%%/metrics", "bearer_token_auth": "true" },
{"prometheus_url": "http://%%host%%:10252/metrics"},
{"prometheus_url": "http://%%host%%:10251/metrics"}]'
Etcd is run in Docker outside of Kubernetes, and certificates are required to communicate with the Etcd service. The suggested steps to set up Etcd monitoring require SSH access to a control plane node running Etcd.
- SSH into the control plane node by following the Rancher documentation. Confirm that Etcd is running in a Docker container with
$ docker ps
, and then use$ docker inspect etcd
to find the location of the certificates used in the run command ("Cmd"
), as well as the host path of the mounts.
The three flags in the command to look for are:
--trusted-ca-file
--cert-file
--key-file
- Using the mount information available in the
$ docker inspect etcd
output, setvolumes
andvolumeMounts
in the Datadog Agent configuration. Also include tolerations so that the Datadog Agent can run on the control plane nodes.
The following are examples of how to configure the Datadog Agent with Helm and the Datadog Operator:
{{< tabs >}} {{% tab "Datadog Operator" %}}
{{< code-block lang="yaml" filename="datadog-agent.yaml" >}} kind: DatadogAgent apiVersion: datadoghq.com/v2alpha1 metadata: name: datadog spec: features: clusterChecks: enabled: true global: credentials: apiKey: <DATADOG_API_KEY> appKey: <DATADOG_APP_KEY> clusterName: <CLUSTER_NAME> kubelet: tlsVerify: false override: nodeAgent: containers: agent: volumeMounts: - name: etcd-certs readOnly: true mountPath: /host/opt/rke/etc/kubernetes/ssl volumes: - name: etcd-certs hostPath: path: /opt/rke/etc/kubernetes/ssl tolerations: - key: node-role.kubernetes.io/controlplane operator: Exists effect: NoSchedule - key: node-role.kubernetes.io/etcd operator: Exists effect: NoExecute {{< /code-block >}}
{{% /tab %}} {{% tab "Helm" %}}
{{< code-block lang="yaml" filename="datadog-values.yaml" >}} datadog: apiKey: <DATADOG_API_KEY> appKey: <DATADOG_APP_KEY> clusterName: <CLUSTER_NAME> kubelet: tlsVerify: false agents: volumes: - hostPath: path: /opt/rke/etc/kubernetes/ssl name: etcd-certs volumeMounts: - name: etcd-certs mountPath: /host/opt/rke/etc/kubernetes/ssl readOnly: true tolerations: - effect: NoSchedule key: node-role.kubernetes.io/controlplane operator: Exists - effect: NoExecute key: node-role.kubernetes.io/etcd operator: Exists {{< /code-block >}}
{{% /tab %}} {{< /tabs >}}
- Set up a DaemonSet with a pause container to run the Etcd check on the nodes running Etcd. This DaemonSet runs on the host network so that it can access the Etcd service. It also has the check configuration and the tolerations needed to run on the control plane node(s). Make sure that the mounted certificate file paths match what you set up on your instance, and replace the
<...>
portion accordingly.
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: etcd-pause
spec:
selector:
matchLabels:
app: etcd-pause
updateStrategy:
type: RollingUpdate
template:
metadata:
annotations:
ad.datadoghq.com/pause.check_names: '["etcd"]'
ad.datadoghq.com/pause.init_configs: '[{}]'
ad.datadoghq.com/pause.instances: |
[{
"prometheus_url": "https://%%host%%:2379/metrics",
"tls_ca_cert": "/host/etc/kubernetes/ssl/kube-ca.pem",
"tls_cert": "/host/etc/kubernetes/ssl/kube-etcd-<...>.pem",
"tls_private_key": "/host/etc/kubernetes/ssl/kube-etcd-<...>-key.pem"
}]
labels:
app: etcd-pause
name: etcd-pause
spec:
hostNetwork: true
containers:
- name: pause
image: k8s.gcr.io/pause:3.0
tolerations:
- effect: NoExecute
key: node-role.kubernetes.io/etcd
operator: Exists
- effect: NoSchedule
key: node-role.kubernetes.io/controlplane
operator: Exists
To deploy the DaemonSet and the check configuration, run
kubectl apply -f <filename>
On other managed services, such as Azure Kubernetes Service (AKS) and Google Kubernetes Engine (GKE), the user cannot access the control plane components. As a result, it is not possible to run the kube_apiserver
, kube_controller_manager
, kube_scheduler
, or etcd
checks in these environments.