Skip to content

Create Configurable Toggle To enable/disable falco_events prom metric #1132

@stefanamaerz

Description

@stefanamaerz

Motivation

👋 Hi sidekick project!

I'm not sure if this is bug or a feature request. Our instances of sidekick are periodically OOMing.

Image

Upon further investigation, I believe the issue is related to unbound growth in prometheus metric cardinality, sometimes called cardinality explosion or high churn rate. As kubernetes hosts in autoscalers and kubernetes pods come and go, more and more timeseries are created, thus consuming more and more resources.

As an example:

falco_events{hostname="falco-rw2n7",k8s_ns_name="<namespace name here>",k8s_pod_name="<pod name here>-6c66cccf94-nbkvb",priority="Notice",rule="<rule name here>",source="syscall"} 1
falco_events{hostname="falco-rw2n7",k8s_ns_name="<namespace name here>",k8s_pod_name="<pod name here>-6c66cccf94-ngr75",priority="Notice",rule="<rule name here>",source="syscall"} 3
falco_events{hostname="falco-rw2n7",k8s_ns_name="<namespace name here>",k8s_pod_name="<pod name here>-6c66cccf94-nhjlk",priority="Notice",rule="<rule name here>",source="syscall"} 3
falco_events{hostname="falco-rw2n7",k8s_ns_name="<namespace name here>",k8s_pod_name="<pod name here>-6c66cccf94-nmfct",priority="Notice",rule="<rule name here>",source="syscall"} 1
falco_events{hostname="falco-rw2n7",k8s_ns_name="<namespace name here>",k8s_pod_name="<pod name here>-6c66cccf94-nsvpk",priority="Notice",rule="<rule name here>",source="syscall"} 3
falco_events{hostname="falco-rw2n7",k8s_ns_name="<namespace name here>",k8s_pod_name="<pod name here>-6c66cccf94-nwqpn",priority="Notice",rule="<rule name here>",source="syscall"} 3
falco_events{hostname="falco-rw2n7",k8s_ns_name="<namespace name here>",k8s_pod_name="<pod name here>-6c66cccf94-p9k4f",priority="Notice",rule="<rule name here>",source="syscall"} 3
falco_events{hostname="falco-rw2n7",k8s_ns_name="<namespace name here>",k8s_pod_name="<pod name here>-6c66cccf94-pndt5",priority="Notice",rule="<rule name here>",source="syscall"} 3
falco_events{hostname="falco-rw2n7",k8s_ns_name="<namespace name here>",k8s_pod_name="<pod name here>-6c66cccf94-q8w9p",priority="Notice",rule="<rule name here>",source="syscall"} 3
falco_events{hostname="falco-rw2n7",k8s_ns_name="<namespace name here>",k8s_pod_name="<pod name here>-6c66cccf94-qr6mr",priority="Notice",rule="<rule name here>",source="syscall"} 1

The combination of the hostname label and k8s_pod_name label means the data has high cardinality and consumes a lot of resources. In our case, both hostname and k8s_pod_name are ephemeral.

Obviously in an ideal world, the above rule is tuned more, however in large clusters where sidekick has run for a prolonged period there exists the possibility that a rule that fires often will take down sidekick due to its increased system usage.

Feature

My proposed solution: for my usecase, simply building a user configurable toggle which permits the user to disable the falco_events metric would eliminate the problem entirely. We do not use the alerts in the prom metrics, rather we export alerts in log format via pub/sub. The prom metrics are specifically interesting to us for monitoring the health of the sidekick logging pipeline. (notably metrics such as promhttp_metric_handler_requests_total, falcosidekick_outputs, falcosidekick_inputs, ect.). So it would be great if we could collect those metrics without having to deal with eventual falco pod resource exhaustion.

Perhaps a config option named like:
prometheus.disablefalcoeventsmetric / PROMETHEUS_DISABLE_FALCO_EVENTS_METRIC

Alternatives

Best practice in prometheus is to not have highly cardinal labels, like IP address, URL, requestID, kubernetes pod name, ect since it causes cardinality to grow without bounds. However without that info, these metrics are not specifically useful -- I'm not sure what an alternative would be.

Additional context

LMK if this makes sense or if you have a recommended approach to this problem!

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Done

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions