Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ksqldb increase CPU and Memory overtime and Vetx Thread blocked. Cause Application failover #10440

Open
albinabel365 opened this issue Sep 21, 2024 · 0 comments

Comments

@albinabel365
Copy link

Describe the bug
We have deployed two nodes of Ksql version 0.29 in prod with number of standby replica=2; heart beat enabled and lag reporting enabled. We observe the memory increase over time and CPU usage as well. After a certain point get the error as follows
2024-09-20 13:21:18.805,"[2024-09-20 13:21:18,805] INFO Checking ksql's ability to contact broker (io.confluent.ksql.rest.healthcheck.HealthCheckAgent:146)"
2024-09-20 13:21:18.805,"[2024-09-20 13:21:18,802] WARN Thread Thread[vert.x-eventloop-thread-6,5,main] has been blocked for 7299 ms, time limit is 2000 ms (io.vertx.core.impl.BlockedThreadChecker:?)"

Then the application reaches 100% CPU and failover.
We also observe this on stage as well where there are minimum activity except the heartbeat and lag reporting call.

Searching over internet we found this issue #9869 . It says 0.29 the situation is improved but we get one failover every day

To Reproduce
Steps to reproduce the behavior, include:

  1. The version of KSQL. 0.29
    Run Ksql for day. It is happens.
    Our Ksql configuration
    {
    "name": "KSQL_KSQL_STREAMS_NUM_STANDBY_REPLICAS",
    "value": "2"
    },
    {
    "name": "KSQL_KSQL_LOGGING_PROCESSING_TOPIC_AUTO_CREATE",
    "value": "true"
    },
    {
    "name": "KSQL_KSQL_LOGGING_PROCESSING_STREAM_AUTO_CREATE",
    "value": "true"
    },
    {
    "name": "KSQL_SECURITY_PROTOCOL",
    "value": "SASL_SSL"
    },
    {
    "name": "KSQL_SASL_MECHANISM",
    "value": "SCRAM-SHA-512"
    },
    {
    "name": "KSQL_KSQL_LAG_REPORTING_ENABLE",
    "value": "true"
    },
    {
    "name": "KSQL_LISTENERS",
    "value": "http://0.0.0.0:8088"
    },
    {
    "name": "KSQL_KSQL_HEARTBEAT_ENABLE",
    "value": "true"
    },
    {
    "name": "KSQL_KSQL_SERVICE_ID",
    "value": "ksqldb-server-shliq-prod"
    },
    {
    "name": "KSQL_KSQL_QUERY_PULL_ENABLE_STANDBY_READS",
    "value": "true"
    }

Expected behavior
CPU and Memory usage should be minimum when there is not events.

Actual behaviour
Memmory and CPU increase over time
image
image

Stack trace
2024-09-20 13:21:18.805,"[2024-09-20 13:21:18,802] WARN Thread Thread[vert.x-eventloop-thread-6,5,main] has been blocked for 7299 ms, time limit is 2000 ms (io.vertx.core.impl.BlockedThreadChecker:?)"
2024-09-20 13:21:18.805,io.vertx.core.VertxException: Thread blocked
2024-09-20 13:21:18.805, at [email protected]/jdk.internal.misc.Unsafe.park(Native Method)
2024-09-20 13:21:18.805, at [email protected]/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
2024-09-20 13:21:18.805, at app//com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:537)
2024-09-20 13:21:18.805, at app//com.google.common.util.concurrent.AbstractFuture$TrustedFuture.get(AbstractFuture.java:104)
2024-09-20 13:21:18.805, at app//com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:240)
2024-09-20 13:21:18.805, at app//com.google.common.cache.LocalCache$LoadingValueReference.waitForValue(LocalCache.java:3582)
2024-09-20 13:21:18.805, at app//com.google.common.cache.LocalCache$Segment.waitForLoadingValue(LocalCache.java:2175)
2024-09-20 13:21:18.805, at app//com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2162)
2024-09-20 13:21:18.805, at app//com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2045)
2024-09-20 13:21:18.805, at app//com.google.common.cache.LocalCache.get(LocalCache.java:3962)
2024-09-20 13:21:18.805, at app//com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3985)
2024-09-20 13:21:18.805, at app//com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4946)
2024-09-20 13:21:18.805, at app//com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4952)
2024-09-20 13:21:18.805, at app//io.confluent.ksql.rest.server.resources.HealthCheckResource.getResponse(HealthCheckResource.java:60)
2024-09-20 13:21:18.805, at app//io.confluent.ksql.rest.server.resources.HealthCheckResource.checkHealth(HealthCheckResource.java:55)
2024-09-20 13:21:18.805, at app//io.confluent.ksql.rest.server.KsqlServerEndpoints.lambda$executeCheckHealth$18(KsqlServerEndpoints.java:283)
2024-09-20 13:21:18.805, at app//io.confluent.ksql.rest.server.KsqlServerEndpoints$$Lambda$1772/0x00000008009a4040.apply(Unknown Source)
2024-09-20 13:21:18.805, at app//io.confluent.ksql.rest.server.KsqlServerEndpoints.executeOldApiEndpoint(KsqlServerEndpoints.java:362)
2024-09-20 13:21:18.805, at app//io.confluent.ksql.rest.server.KsqlServerEndpoints.executeCheckHealth(KsqlServerEndpoints.java:282)
2024-09-20 13:21:18.805, at app//io.confluent.ksql.api.server.ServerVerticle.lambda$handleHealthcheckRequest$14(ServerVerticle.java:345)
2024-09-20 13:21:18.805, at app//io.confluent.ksql.api.server.ServerVerticle$$Lambda$1771/0x00000008009a3c40.apply(Unknown Source)
2024-09-20 13:21:18.805, at app//io.confluent.ksql.api.server.OldApiUtils.handleOldApiRequest(OldApiUtils.java:78)
2024-09-20 13:21:18.805, at app//io.confluent.ksql.api.server.ServerVerticle.handleHealthcheckRequest(ServerVerticle.java:343)
2024-09-20 13:21:18.805, at app//io.confluent.ksql.api.server.ServerVerticle$$Lambda$846/0x0000000800658840.handle(Unknown Source)
2024-09-20 13:21:18.805, at app//io.vertx.ext.web.impl.RouteState.handleContext(RouteState.java:1284)
2024-09-20 13:21:18.805, at app//io.vertx.ext.web.impl.RoutingContextImplBase.iterateNext(RoutingContextImplBase.java:177)
2024-09-20 13:21:18.805, at app//io.vertx.ext.web.impl.RoutingContextImpl.next(RoutingContextImpl.java:141)
2024-09-20 13:21:18.805, at app//io.confluent.ksql.api.server.ServerStateHandler.handle(ServerStateHandler.java:46)
2024-09-20 13:21:18.805, at app//io.confluent.ksql.api.server.ServerStateHandler.handle(ServerStateHandler.java:28)
2024-09-20 13:21:18.805, at app//io.vertx.ext.web.impl.RouteState.handleContext(RouteState.java:1284)
2024-09-20 13:21:18.805, at app//io.vertx.ext.web.impl.RoutingContextImplBase.iterateNext(RoutingContextImplBase.java:177)
2024-09-20 13:21:18.805, at app//io.vertx.ext.web.impl.RoutingContextImpl.next(RoutingContextImpl.java:141)
2024-09-20 13:21:18.805, at app//io.confluent.ksql.api.server.LoggingHandler.handle(LoggingHandler.java:135)
2024-09-20 13:21:18.805, at app//io.confluent.ksql.api.server.LoggingHandler.handle(LoggingHandler.java:35)
2024-09-20 13:21:18.805, at app//io.vertx.ext.web.impl.RouteState.handleContext(RouteState.java:1284)
2024-09-20 13:21:18.805, at app//io.vertx.ext.web.impl.RoutingContextImplBase.iterateNext(RoutingContextImplBase.java:177)
2024-09-20 13:21:18.805, at app//io.vertx.ext.web.impl.RoutingContextImpl.next(RoutingContextImpl.java:141)
2024-09-20 13:21:18.805, at app//io.vertx.ext.web.impl.RouterImpl.handle(RouterImpl.java:68)
2024-09-20 13:21:18.805, at app//io.vertx.ext.web.impl.RouterImpl.handle(RouterImpl.java:37)
2024-09-20 13:21:18.805, at app//io.vertx.core.http.impl.Http1xServerRequestHandler.handle(Http1xServerRequestHandler.java:67)
2024-09-20 13:21:18.805, at app//io.vertx.core.http.impl.Http1xServerRequestHandler.handle(Http1xServerRequestHandler.java:30)
2024-09-20 13:21:18.805, at app//io.vertx.core.impl.EventLoopContext.emit(EventLoopContext.java:55)
2024-09-20 13:21:18.805, at app//io.vertx.core.impl.DuplicatedContext.emit(DuplicatedContext.java:158)
2024-09-20 13:21:18.805, at app//io.vertx.core.http.impl.Http1xServerConnection.handleMessage(Http1xServerConnection.java:145)
2024-09-20 13:21:18.805, at app//io.vertx.core.net.impl.ConnectionBase.read(ConnectionBase.java:157)
2024-09-20 13:21:18.805, at app//io.vertx.core.net.impl.VertxHandler.channelRead(VertxHandler.java:153)
2024-09-20 13:21:18.805, at app//io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
2024-09-20 13:21:18.805, at app//io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
2024-09-20 13:21:18.805, at app//io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
2024-09-20 13:21:18.805, at app//io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:93)
2024-09-20 13:21:18.805, at app//io.netty.handler.codec.http.websocketx.extensions.WebSocketServerExtensionHandler.channelRead(WebSocketServerExtensionHandler.java:99)
2024-09-20 13:21:18.805, at app//io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
2024-09-20 13:21:18.805, at app//io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
2024-09-20 13:21:18.805, at app//io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
2024-09-20 13:21:18.805, at app//io.vertx.core.http.impl.Http1xUpgradeToH2CHandler.channelRead(Http1xUpgradeToH2CHandler.java:120)
2024-09-20 13:21:18.805, at app//io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
2024-09-20 13:21:18.805, at app//io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
2024-09-20 13:21:18.805, at app//io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
2024-09-20 13:21:18.805, at app//io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
2024-09-20 13:21:18.805, at app//io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
2024-09-20 13:21:18.805, at app//io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
2024-09-20 13:21:18.805, at app//io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
2024-09-20 13:21:18.805, at app//io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:346)
2024-09-20 13:21:18.805, at app//io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:318)
2024-09-20 13:21:18.805, at app//io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
2024-09-20 13:21:18.805, at app//io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
2024-09-20 13:21:18.805, at app//io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
2024-09-20 13:21:18.805, at app//io.vertx.core.http.impl.Http1xOrH2CHandler.end(Http1xOrH2CHandler.java:61)
2024-09-20 13:21:18.805, at app//io.vertx.core.http.impl.Http1xOrH2CHandler.channelRead(Http1xOrH2CHandler.java:38)
2024-09-20 13:21:18.805, at app//io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
2024-09-20 13:21:18.805, at app//io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
2024-09-20 13:21:18.805, at app//io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)

Additional context
Add any other context about the problem here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant