Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementing streaming mode for kafka plugin #164

Open
abdulkk49 opened this issue Mar 1, 2024 · 1 comment
Open

Implementing streaming mode for kafka plugin #164

abdulkk49 opened this issue Mar 1, 2024 · 1 comment
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@abdulkk49
Copy link

abdulkk49 commented Mar 1, 2024

Recently, we have been facing issues with rate-limits for Kafka Plugin which hits the Kafka REST v3 endpoint to send alerts.
We have a druid datasource, and some we have some rules defined. Now periodically, grafana carries out evaluations on data fetched from druid and sends an event to one of our kafka topics with all the necessary metadata, which is processed further to send an alert.
For every (cluster, rule) we have, grafana sends a unique event to kafka. This can lead to the influx of large volume of events to our kafka topic. Hence we are hitting rate limit errors on our kafka topic.
I took a look at the grafana codebase and here are my observations:
Grafana sends the event to kafka topic using the Kafka REST V3/V2 API as defined here (notifyWithAPIV3). We use V3 for our case.
This function call at the end reaches this file finally calling the sendWebRequestSync , essentially making an HTTP POST request.

The following client is used, along with the defined transport:

var netTransport = &http.Transport{
	TLSClientConfig: &tls.Config{
		Renegotiation: tls.RenegotiateFreelyAsClient,
	},
	Proxy: http.ProxyFromEnvironment,
	Dial: (&net.Dialer{
		Timeout: 30 * time.Second,
	}).Dial,
	TLSHandshakeTimeout: 5 * time.Second,
}
var netClient WebhookClient = &http.Client{
	Timeout:   time.Second * 30,
	Transport: netTransport,
}

The http.Client internally maintains a pool of persistent TCP connections per host to improve efficiency of requests, which can be controlled using some transport parameters.
In grafana’s case, The transport does not define these two parameters: MaxIdleConnsPerHost , MaxConnsPerHost.
Hence the default values are used: MaxIdleConnsPerHost = 2 , MaxConnsPerHost = 0
From the documentation:

// MaxIdleConns controls the maximum number of idle (keep-alive)
// connections across all hosts. Zero means no limit.
MaxIdleConns int

// MaxIdleConnsPerHost, if non-zero, controls the maximum idle
// (keep-alive) connections to keep per-host. If zero,
// DefaultMaxIdleConnsPerHost is used.
MaxIdleConnsPerHost int

When grafana receives a large number of events that it needs to send to our kafka topic, it can create as many connections to the kafka REST host in order to fulfil the requests.
And since it can only maintain maximum of two connections in its idle pool, all the other connections get created and are not reused.
For example, if we receive 15 concurrent requests, the client will create 15 connections to the host, out of which the 13 will be closed soon after since the max open connections we can have is set to 2.
Our kafka has a rate limit of 25 connections per second. Hence the limit gets breached in case of huge volume of events. It is also recommended to close connection after every request but this may not be feasible.

Tentative solutions:
Using the Kafka V3 streaming mode: The V3 API supports streaming using which we can send 1000 requests per second. So we can modify the code to send multiple request bodies over a single HTTP connection at the application level. I have implemented a basic way for streaming just to do a POC, but even this is not entirely correct. I am able to send responses over a single connection (streaming), however, I am not able to read responses in streaming mode. The entire response is read all at once. So if I send 200 requests over the connection, the response always waits till all 200 are sent and only then I am able to read the responses, all 200 together.

Looking at the grafana code, integrating it with the current implementation would need a big overhaul. As of now, each thread receives a single alert request, and we send the post request.

  1. We also would need to specify the parameters MaxIdleConnsPerHost , MaxConnsPerHost to set an appropriate limit per host to take into consideration the rate limits.
    Looking for inputs on how to go about this. Thank you for your valuable time!
@KaplanSarah
Copy link

We will accept PRs for these features but this is not planned work for the team.

This is two separate feature requests. Please move one of them into a separate issue.

@KaplanSarah KaplanSarah moved this to Waiting for input in Alerting Mar 7, 2024
@JacobsonMT JacobsonMT added enhancement New feature or request good first issue Good for newcomers labels Jun 27, 2024
@JacobsonMT JacobsonMT moved this from Waiting for input to Backlog in Alerting Jun 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
Status: Backlog
Development

No branches or pull requests

3 participants