You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
We have identified a race condition in the WorkflowSweeper class, which causes workflows to be in inconsistent states across different threads. This issue is critical as it affects the reliability and correctness of workflow execution and completion checks.
Steps to Reproduce:
Deploy the application with at least 30 replicas in a Kubernetes environment.
Use a high sweeper rate of about 25ms and a high thread count.
Use a Redis cluster with Redis lock for workflow execution.
Execute workflows at a rate of approximately 75-90 workflows per second.
Monitor the state of workflows and observe for inconsistencies.
Observed Behavior
Workflows are fetched from executionDaoFacade before acquiring a lock.
The verifyAndRepair method mutates the workflow state without proper synchronization.
The workflow lock is released before the workflow is removed from the queue.
These conditions create a time window of roughly 50µ to 100µ seconds where a workflow can be in two states concurrently on different threads.
Workflow listeners or completion checks may fail as a result, with workflows erroneously marked as "Running" even after triggering the finish.
Expected Behavior
Workflows should maintain consistent states across all threads.
Proper locking should be enforced to prevent state mutations without synchronization.
Workflow locks should only be released after the workflow is securely removed from the queue.
Screenshots
The text was updated successfully, but these errors were encountered:
Describe the bug
We have identified a race condition in the WorkflowSweeper class, which causes workflows to be in inconsistent states across different threads. This issue is critical as it affects the reliability and correctness of workflow execution and completion checks.
Details
Conductor version: 3.17
Persistence implementation: Postgres,Opensearch
Queue implementation: RedisCluster
Lock: Redis
Steps to Reproduce:
Deploy the application with at least 30 replicas in a Kubernetes environment.
Use a high sweeper rate of about 25ms and a high thread count.
Use a Redis cluster with Redis lock for workflow execution.
Execute workflows at a rate of approximately 75-90 workflows per second.
Monitor the state of workflows and observe for inconsistencies.
Observed Behavior
Workflows are fetched from executionDaoFacade before acquiring a lock.
The verifyAndRepair method mutates the workflow state without proper synchronization.
The workflow lock is released before the workflow is removed from the queue.
These conditions create a time window of roughly 50µ to 100µ seconds where a workflow can be in two states concurrently on different threads.
Workflow listeners or completion checks may fail as a result, with workflows erroneously marked as "Running" even after triggering the finish.
Expected Behavior
Workflows should maintain consistent states across all threads.
Proper locking should be enforced to prevent state mutations without synchronization.
Workflow locks should only be released after the workflow is securely removed from the queue.
Screenshots
The text was updated successfully, but these errors were encountered: