Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide a way to prevent unbounded metric growth #197

Open
howardjohn opened this issue Apr 29, 2024 · 6 comments
Open

Provide a way to prevent unbounded metric growth #197

howardjohn opened this issue Apr 29, 2024 · 6 comments

Comments

@howardjohn
Copy link
Contributor

We are interested in a mechanism to control unbounded growth of metrics. While we generally follow best practices around limiting cardinality, for extreme long lived processes this is still problematic. For instance, its common to record the binary version of something in a metric, but with 100s of rollouts over days or months, these can explode in time series if the metrics collection is never restarted.

We would like some way to control this in our application.


Currently, there is a a .clear() and .remove(). These are good building blocks, but I am not sure they are sufficient on their own.

remove() is challenging on its own because we don't have any way to understand the entire set of labels stored in the metric at any point. In theory you could use EncodeMetric::encode and parse the results, but that is quite hacky.

clear() is also challenging, because it is all or nothing.


Ideally, I think we would have some interface like:

family.retain_if(|(labelset, metric)| {
 Instant::now().duration_since(metric.last_write()) < Duration::from_secs(3600)
})  

(remove any metrics not modified for an hour)

This would require a method on the family, but also maybe some changes on the metric type as well to make this easier to encode.

In #196 I have put up a small draft of what this could look like, but very open to alternatives

@mxinden
Copy link
Member

mxinden commented May 13, 2024

remove() is challenging on its own because we don't have any way to understand the entire set of labels stored in the metric at any point.

In your above example, rolling out new binaries, can't you add a hook, that on removing a binary, calls remove on the Family?

@mxinden
Copy link
Member

mxinden commented May 13, 2024

Related discussion: prometheus/client_golang#920

@howardjohn
Copy link
Contributor Author

In your above example, rolling out new binaries, can't you add a hook, that on removing a binary, calls remove on the Family?

We could know a few of the labels, but not all of them. Consider a case of a network proxy acting on behalf of a few clients. We may have something like

{source=app-v1,http_code=5xx,destination_app=app-v2}
{source=app-v1,http_code=4xx,destination_app=app-v2}
{source=app-v1,http_code=2xx,destination_app=app-v1}
{source=app-v2,http_code=2xx,destination_app=app-v1}

etc.

In real world its more than 3 labels as well.

if we had something like retain we definitely could, since we could do something like

family.retain_if(|(labelset)| {
 labelset.source == "app-v1"
})  

or similar

@howardjohn
Copy link
Contributor Author

@mxinden wdyt of #196?

@mxinden
Copy link
Member

mxinden commented Aug 31, 2024

Thank you for prototyping this idea in #196.

I am still not sure I fully understand either the use-case itself, or why that is not supported by prometheus-client today.

remove() is challenging on its own because we don't have any way to understand the entire set of labels stored in the metric at any point. In theory you could use EncodeMetric::encode and parse the results, but that is quite hacky.

Given that you must have at some point called Family::get_or_create with a LabelSet, you can as well call Family::remove with that same LabelSet. Am I missing something @howardjohn?

@howardjohn
Copy link
Contributor Author

Given that you must have at some point called Family::get_or_create with a LabelSet, you can as well call Family::remove with that same LabelSet. Am I missing something @howardjohn?

I could but then I need to effectively maintain my own mirror of Family to keep track; we might as well just not use Family at all at that point and fork family.rs with remove functionality. If we don't, its still a lot of work + double the metrics storage.

I am still not sure I fully understand either the use-case itself,

To make it more concrete, I am building a Kubernetes node proxy. For all intents and purposes you could call it like kube-proxy.

Our metrics have source_X labels on them (this is not source_pod which would have incredible cardinality, but things like source_namespace, etc). We are acting as a proxy for whatever happens to be on our node, which can change. So we want the ability to prune out old metrics. For example, if nothing from namespace foo is remaining on the node, we would drop all labels with source_namespace=foo. Without this, over a long period of time, metrics growth is effectively unbounded

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants