Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sum bytes_downloaded and bytes_uploaded for duplicate common-names #15

Open
AvaxVPN-Team opened this issue Dec 25, 2020 · 8 comments · May be fixed by #18
Open

Sum bytes_downloaded and bytes_uploaded for duplicate common-names #15

AvaxVPN-Team opened this issue Dec 25, 2020 · 8 comments · May be fixed by #18

Comments

@AvaxVPN-Team
Copy link

When multiple clients are connected to the server having the same common-name, The collector will panic.
The behaviour should be changed so that the bandwidth gets summed and connected_since field becomes some aggregation of all data (like the minimum connection date, or even drop the metric..?)

@patrickjahns
Copy link
Owner

From my point of view, aggregating the metrics is not a valid solution for the issue you are encountering.
These metrics are intended to be per client ( or in other terms, per connection ) - if you aggregate these metrics, you are combining the statistics of several clients and therefor the original intention is no longer given.

Dropping duplicate clients also does not seem to be a good solution, as it's not obvious why for a given moment the amount of connected clients does deviate from the amount of client metrics that are provided - i.e. it is just simply hiding a possible underlying issue

If you are allowing more than one connection for the same common name, I suggest to disable per client metrics by starting the collector with --disable-client-metrics.

From prior discussion:

For prometheus, all label (key/value) pairs must be unique per scrape iteration - the only way to change this, would be to add a additional label for the case, where two clients share the same common name.
One option would be to include the client ip, that adds another permutation - however - then we are safe as long as no two clients connect from the same public ip (i.e. natting ) - then we have the same issue again.
One could now argue to use : - but that will lead to a lot of permutations in the time series database

Both of those added labels will add cardinality which might explode with 100s of clients and put a lot of stress on prometheus. So I am hesitant to add those extra labels

Given the background and further information - what would be your use-case for aggregating or dropping metrics?

@AvaxVPN-Team
Copy link
Author

AvaxVPN-Team commented Dec 26, 2020

@patrickjahns We are a commercial VPN provider team and we put a limit on our customers' accounts by the amount of bandwidth they use. say for example a client has 100GB of traffic, and we track their bandwidth using client-disconnect script which gives us the exact amount of bandwidth the user consumed in their connection. We store this value in our central database and check it on the user's connection. if the sum of all of the bandwidth they used in their connections in the past 30 days exceeds their service's limit, they'll see the 'User auth failed' error.

But we also provide other protocols among OpenVPN, and they all report their bandwidth with Prometheus. So it's better that we also switch from using client-disconnect to an OpenVPN Prometheus exporter - to be able to visualize them in a central Grafana panel and do other useful things with the data.

But the problem is here that clients can connect using multiple devices to the same server, and we want to measure the used bandwidth for all of those connections, not only the first one (and definitely we do not a panic error in the duplicate CN situation).

@AvaxVPN-Team
Copy link
Author

AvaxVPN-Team commented Dec 26, 2020

I think the best approach possible would be adding the port number to the label set and disabling it by default, enabling them only with an extra flag (something like --support-duplicate-cn or --include-remote-port).

Because we use VictoriaMetrics instead of the original Prometheus for handling metrics, and VictoriaMetrics is pretty good at handling data with extremely high cardinality - and I'm pretty sure that even in small scales, cardinality wouldn't be a big problem. take, for example, 100 clients that all of them used all of the ports somewhere in history. it would be 100 * (65535 - 1024); a 6.5M cardinality, which is pretty normal for VictoriaMetrics and a little high for Prometheus, but not a big problem for both.

Edit: I partially take that back, looks like 6M is too high for Prometheus and the safe boundary is 100K, even though VictoriaMetrics can handle much more than this amount.

@patrickjahns
Copy link
Owner

Thanks for sharing the information 👍 - #16 addresses the panic for the time being.

I think the best approach possible would be adding the port number to the label set and disabling it by default, enabling them only with an extra flag (something like --support-duplicate-cn or --include-remote-port).

This sounds reasonable to me - another thought that crossed my mind, we simply add a suffix to the client names - i.e. foo1 - foo2 - foo3 - that would limit the potential cardinality issue. Downside here is, that we would need to find a way to consistently assign the names across several scrapes. Using the port seems a more simple and straightforward approach.

Thanks for sharing the information on VictoriaMetrics -wasn't aware that there are good solutions for high cardinality labels 👍
Out of curiosity - do you have a link with further information on what Victoria Metrics can handle?

@AvaxVPN-Team
Copy link
Author

AvaxVPN-Team commented Dec 26, 2020

https://valyala.medium.com/high-cardinality-tsdb-benchmarks-victoriametrics-vs-timescaledb-vs-influxdb-13e6ee64dd6b

There is not just VictoriaMetrics, there are also two other competitors for using with high-cardinality data (InfluxDB and TimescaleDB), but VM seems to beat them in performance.

And for the label problem, I'm thinking of another solution right now. It may be possible to have a global dictionary, matching tuples of (common_name, local_port) to a counter and using that counter in another label.

For example, representation in python:

# Initial definition of the globals
connection_label = {} 
connection_count = defaultdict(0)

# This part runs when processing each log line
if (common_name, port) not in connection_label:
    connection_count[common_name] += 1
    connection_label[(common_name, port)]  = connection_count[common_name]

Notice how I'm using connection_count as the label for that connection. It means that each time user connects, the counter for that common_name increases by one, and that counter value gets assigned to that particular connection.

The only problem here is that the counter would become a big number, and if it gets more than 65535 for each common_name then there is no benefit for using it instead of a port number. Maybe reset it in some sort, or..?

P.S. I'm coming from Python world and I don't know what is the equivalent of dictionary or tuple in Go😅

@patrickjahns
Copy link
Owner

Thanks for sharing the link about the timeseries databases - will give it a read.

Interesting idea of yours.

The only problem here is that the counter would become a big number, and if it gets more than 65535 for each common_name then there is no benefit for using it instead of a port number. Maybe reset it in some sort, or..?

That's a typical caching and cache invalidation problem - let me think about it a bit more.
However I am inclined to go down the simple route of using the port for now with your suggested opt-in flag.

@AvaxVPN-Team
Copy link
Author

AvaxVPN-Team commented Dec 26, 2020

A better way to give small numbers to connection would be having something like a number pool.

What I mean is that on each new connection, start an infinite loop with i = 0 and check if the current i is already in use by another connection for that common_name. If in use, go to the next i, If not in use, assign the current i to this connection and break the loop.

This method will ensure the numbers remain unique for each connection's entire lifespan, and after their disconnection, that number will be usable again for the next connection.

However I am inclined to go down the simple route of using the port for now with your suggested opt-in flag.

I'm doing a kind of brain-storm right now, but of course, just do whatever you think fits this situation the best :)

(And btw, thanks for your fast response and action, it's kind of hard to see this in GitHub nowadays)

@AvaxVPN-Team AvaxVPN-Team linked a pull request Apr 13, 2021 that will close this issue
@matejzero
Copy link

I like the number pool idea better as it will give way lover cardinality than using the ports, where each user can teoretically have 65k metrics. Using numbers pool, the high limit is the same as with the ports, but in reality, it probably wont be higher than 5-10?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants