Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refreshing a remove brokers operation on KafkaRebalance resource while one rebalancing is already running can drive to a NotReady state #10571

Open
ppatierno opened this issue Sep 12, 2024 · 3 comments
Assignees
Labels

Comments

@ppatierno
Copy link
Member

Create a Kafka custom resource (for example with 7 brokers) with the cruiseControl field to run Cruise Control within the cluster deployment.
Run a rebalancing by creating a KafkaRebalance custom resource to remove nodes 5, 6 (with auto-approval enabled), like this:

apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaRebalance
metadata:
  name: my-rebalance
  labels:
    strimzi.io/cluster: my-cluster
  annotations:
    strimzi.io/rebalance-auto-approval: "true"
# no goals specified, using the default goals from the Cruise Control configuration
spec:
  mode: remove-brokers
  brokers: [5, 6]

Wait for the rebalancing to go from ProposalPendy, to ProposalReady and automatically (auto-approval enabled) to Rebalancing.
While rebalancing is running, ask for a new rebalancing (using the "refresh" annotation on the already existing custom resource) including nodes 3, 4 as well, so having all 3,4,5 and 6, like this:

apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaRebalance
metadata:
  name: my-rebalance
  labels:
    strimzi.io/cluster: my-cluster
  annotations:
    strimzi.io/rebalance-auto-approval: "true"
    strimzi.io/rebalance: "refresh"
# no goals specified, using the default goals from the Cruise Control configuration
spec:
  mode: remove-brokers
  brokers: [3, 4, 5, 6]

Sometimes (it could depending on the timing and where Cruise Control is on the current rebalancing), the operator will go through the following log error and the KafkaRebalance moves to NotReady state:

2024-09-12 13:51:10 INFO  KafkaRebalanceAssemblyOperator:1113 - Reconciliation #59(watch) KafkaRebalance(myproject/my-rebalance): Requesting Cruise Control rebalance [dryrun=true]
2024-09-12 13:51:11 INFO  KafkaRebalanceAssemblyOperator:351 - Reconciliation #59(watch) KafkaRebalance(myproject/my-rebalance): KafkaRebalance state is now updated to [ProposalReady] with annotation strimzi.io/rebalance=refresh applied on the KafkaRebalance resource
2024-09-12 13:51:11 INFO  KafkaRebalanceAssemblyOperator:359 - Reconciliation #59(watch) KafkaRebalance(myproject/my-rebalance): Removing annotation strimzi.io/rebalance=refresh
2024-09-12 13:51:11 INFO  AbstractOperator:520 - Reconciliation #60(watch) KafkaRebalance(myproject/my-rebalance): KafkaRebalance my-rebalance in namespace myproject was MODIFIED
2024-09-12 13:51:11 INFO  AbstractOperator:520 - Reconciliation #61(watch) KafkaRebalance(myproject/my-rebalance): KafkaRebalance my-rebalance in namespace myproject was MODIFIED
2024-09-12 13:51:11 INFO  CrdOperator:123 - Reconciliation #59(watch) KafkaRebalance(myproject/my-rebalance): Status of KafkaRebalance my-rebalance in namespace myproject has been updated
2024-09-12 13:51:11 INFO  AbstractOperator:546 - Reconciliation #59(watch) KafkaRebalance(myproject/my-rebalance): reconciled
2024-09-12 13:51:11 INFO  AbstractOperator:266 - Reconciliation #60(watch) KafkaRebalance(myproject/my-rebalance): KafkaRebalance my-rebalance will be checked for creation or modification
2024-09-12 13:51:11 INFO  KafkaRebalanceAssemblyOperator:317 - Reconciliation #60(watch) KafkaRebalance(myproject/my-rebalance): Rebalance action is performed and KafkaRebalance resource is currently in [ProposalReady] state
2024-09-12 13:51:11 INFO  KafkaRebalanceAssemblyOperator:788 - Reconciliation #60(watch) KafkaRebalance(myproject/my-rebalance): Auto-approval set on the KafkaRebalance resource
2024-09-12 13:51:11 INFO  KafkaRebalanceAssemblyOperator:1113 - Reconciliation #60(watch) KafkaRebalance(myproject/my-rebalance): Requesting Cruise Control rebalance [dryrun=false]
2024-09-12 13:51:11 ERROR KafkaRebalanceAssemblyOperator:378 - Reconciliation #60(watch) KafkaRebalance(myproject/my-rebalance): Status updated to [NotReady] due to error: Error for request: my-cluster-cruise-control.myproject.svc:9090/kafkacruisecontrol/remove_broker?json=true&dryrun=false&verbose=true&skip_hard_goal_check=false&brokerid=3%2C4%2C5%2C6. Server returned: Error processing POST request '/remove_broker' due to: 'java.lang.IllegalStateException: Cannot start a new execution while there is an ongoing execution. Please use stop_ongoing_execution=true to stop ongoing execution and start a new one.'.

It seems that asking for a new rebalancing with different nodes to remove needs that currently running task is stopped via stop_ongoing_execution=true in the query string on the POST request to the REST API.
Maybe we should have this addition in any POST operation for rebalancing when our intention is to not waiting for the current operation ending but starting a new one straight away.

@im-konge
Copy link
Member

Triaged on 19.9.2024: This should be fixed.

@tinaselenge
Copy link
Contributor

After some investigation, I came to conclusion that it is not straight forward to solve this issue with the suggested flag due to our current rebalance reconcile flow. The current flow is:

  • KafkaRabalance is in Rabalancing state, as the rebalance is in progress
  • User updates the CR with a refresh annotation and with a new set of brokers to remove
  • The operator sends a request to /stop_proposal_execution endpoint to stop the current rebalance operation
  • The request completes successfully, however there are still in-progress batch for the current balance operation in CC
  • The operator sends a request for a new proposal for the updated set of brokers to remove
  • The new proposal is ready, therefore the KafkaRebalance is set to ProposalReady state
  • The refresh annotation is removed after this round of the reconciliation
  • In the next reconciliation, KafkaRebalance is in ProposalReady state, therefore the operator sends a request to execute the removal of the updated set of brokers. When we send this request, we cannot set the stop_ongoing_execution conditionally because the annotation is already removed in the previous round of reconciliation.
  • This request fails because there are still in-progress batch of the previous rebalance operation. The KafkaRebalance is set to NotReady state.

Setting stop_ongoing_execution flag to true whenever we request a full run rebalance would result in stopping all kinds of in progress executions including unrelated executions from topic operators. Currently it is not straight forward to set this flag, only on refresh annotation either. This flag can be only set to true, when dry mode is set to false (both cannot be set to true).

Although, it is not simple for the operator to automatically refresh the rebalance in this scenario, the user would be notified with the reason for NotReady status. The error message makes it clear that the current execution needs to completed before submitting a new one, so user can wait and then apply the refresh annotation again on the KafkaRebalance CR. This would set the KafkaRabalance state from NotReady to New.

 status:
    conditions:
    - lastTransitionTime: "2024-10-29T09:32:38.728781722Z"
      message: 'Error for request: my-cluster-cruise-control.default.svc:9090/kafkacruisecontrol/remove_broker?json=true&dryrun=false&verbose=true&skip_hard_goal_check=false&brokerid=2%2C4%2C5.
        Server returned: Error processing POST request ''/remove_broker'' due to:
        ''java.lang.IllegalStateException: Cannot start a new execution while there
        is an ongoing execution. Please use stop_ongoing_execution=true to stop ongoing
        execution and start a new one.''.'
      reason: CruiseControlRestException
      status: "True"
      type: NotReady

One improvement we could do is to handle this error and modify the error message slightly. Instead of the Please use stop_ongoing_execution=true to stop ongoing execution and start a new one. part, it could be something like Please wait for a few minutes until the ongoing execution is completed and then use the refresh annotation to ask for a new rebalance request again..

@ppatierno please let me know what you think.

@ppatierno
Copy link
Member Author

So on one side, I would leave the message as it is because that's exactly what we get from Cruise Control instead of starting to handle specific errors (we don't know how many others we can face in the future) and changing the message for a more understandable one for the user.
On the other side, this message doesn't really explain to the user what to do. They could just apply the 'refresh' again hoping that there is no execution running and it will go through.
I am interested to know what the other maintainers think as well @strimzi/maintainers ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants