You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Recently, we have been facing issues with rate-limits for Kafka Plugin which hits the Kafka REST v3 endpoint to send alerts.
We have a druid datasource, and some we have some rules defined. Now periodically, grafana carries out evaluations on data fetched from druid and sends an event to one of our kafka topics with all the necessary metadata, which is processed further to send an alert.
For every (cluster, rule) we have, grafana sends a unique event to kafka. This can lead to the influx of large volume of events to our kafka topic. Hence we are hitting rate limit errors on our kafka topic.
I took a look at the grafana codebase and here are my observations:
Grafana sends the event to kafka topic using the Kafka REST V3/V2 API as defined here (notifyWithAPIV3). We use V3 for our case.
This function call at the end reaches this file finally calling the sendWebRequestSync , essentially making an HTTP POST request.
The following client is used, along with the defined transport:
The http.Client internally maintains a pool of persistent TCP connections per host to improve efficiency of requests, which can be controlled using some transport parameters.
In grafana’s case, The transport does not define these two parameters: MaxIdleConnsPerHost , MaxConnsPerHost.
Hence the default values are used: MaxIdleConnsPerHost = 2 , MaxConnsPerHost = 0
From the documentation:
// MaxIdleConns controls the maximum number of idle (keep-alive)
// connections across all hosts. Zero means no limit.
MaxIdleConns int
// MaxIdleConnsPerHost, if non-zero, controls the maximum idle
// (keep-alive) connections to keep per-host. If zero,
// DefaultMaxIdleConnsPerHost is used.
MaxIdleConnsPerHost int
When grafana receives a large number of events that it needs to send to our kafka topic, it can create as many connections to the kafka REST host in order to fulfil the requests.
And since it can only maintain maximum of two connections in its idle pool, all the other connections get created and are not reused.
For example, if we receive 15 concurrent requests, the client will create 15 connections to the host, out of which the 13 will be closed soon after since the max open connections we can have is set to 2.
Our kafka has a rate limit of 25 connections per second. Hence the limit gets breached in case of huge volume of events. It is also recommended to close connection after every request but this may not be feasible.
Tentative solutions: Using the Kafka V3 streaming mode: The V3 API supports streaming using which we can send 1000 requests per second. So we can modify the code to send multiple request bodies over a single HTTP connection at the application level. I have implemented a basic way for streaming just to do a POC, but even this is not entirely correct. I am able to send responses over a single connection (streaming), however, I am not able to read responses in streaming mode. The entire response is read all at once. So if I send 200 requests over the connection, the response always waits till all 200 are sent and only then I am able to read the responses, all 200 together.
Looking at the grafana code, integrating it with the current implementation would need a big overhaul. As of now, each thread receives a single alert request, and we send the post request.
We also would need to specify the parameters MaxIdleConnsPerHost , MaxConnsPerHost to set an appropriate limit per host to take into consideration the rate limits.
Looking for inputs on how to go about this. Thank you for your valuable time!
The text was updated successfully, but these errors were encountered:
Recently, we have been facing issues with rate-limits for Kafka Plugin which hits the Kafka REST v3 endpoint to send alerts.
We have a druid datasource, and some we have some rules defined. Now periodically, grafana carries out evaluations on data fetched from druid and sends an event to one of our kafka topics with all the necessary metadata, which is processed further to send an alert.
For every (cluster, rule) we have, grafana sends a unique event to kafka. This can lead to the influx of large volume of events to our kafka topic. Hence we are hitting rate limit errors on our kafka topic.
I took a look at the grafana codebase and here are my observations:
Grafana sends the event to kafka topic using the Kafka REST V3/V2 API as defined here (notifyWithAPIV3). We use V3 for our case.
This function call at the end reaches this file finally calling the sendWebRequestSync , essentially making an HTTP POST request.
The following client is used, along with the defined transport:
The
http.Client
internally maintains a pool of persistent TCP connections per host to improve efficiency of requests, which can be controlled using some transport parameters.In grafana’s case, The transport does not define these two parameters:
MaxIdleConnsPerHost
,MaxConnsPerHost
.Hence the default values are used:
MaxIdleConnsPerHost = 2
,MaxConnsPerHost = 0
From the documentation:
When grafana receives a large number of events that it needs to send to our kafka topic, it can create as many connections to the kafka REST host in order to fulfil the requests.
And since it can only maintain maximum of two connections in its idle pool, all the other connections get created and are not reused.
For example, if we receive 15 concurrent requests, the client will create 15 connections to the host, out of which the 13 will be closed soon after since the max open connections we can have is set to 2.
Our kafka has a rate limit of 25 connections per second. Hence the limit gets breached in case of huge volume of events. It is also recommended to close connection after every request but this may not be feasible.
Tentative solutions:
Using the Kafka V3 streaming mode: The V3 API supports streaming using which we can send 1000 requests per second. So we can modify the code to send multiple request bodies over a single HTTP connection at the application level. I have implemented a basic way for streaming just to do a POC, but even this is not entirely correct. I am able to send responses over a single connection (streaming), however, I am not able to read responses in streaming mode. The entire response is read all at once. So if I send 200 requests over the connection, the response always waits till all 200 are sent and only then I am able to read the responses, all 200 together.
Looking at the grafana code, integrating it with the current implementation would need a big overhaul. As of now, each thread receives a single alert request, and we send the post request.
Looking for inputs on how to go about this. Thank you for your valuable time!
The text was updated successfully, but these errors were encountered: