Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Race Condition in WorkflowSweeper Leading to Inconsistent Workflow States #213

Open
rq-dbrady opened this issue Jul 18, 2024 · 1 comment
Assignees

Comments

@rq-dbrady
Copy link
Contributor

rq-dbrady commented Jul 18, 2024

Describe the bug
We have identified a race condition in the WorkflowSweeper class, which causes workflows to be in inconsistent states across different threads. This issue is critical as it affects the reliability and correctness of workflow execution and completion checks.

Details
Conductor version: 3.17
Persistence implementation: Postgres,Opensearch
Queue implementation: RedisCluster
Lock: Redis

Steps to Reproduce:
Deploy the application with at least 30 replicas in a Kubernetes environment.
Use a high sweeper rate of about 25ms and a high thread count.
Use a Redis cluster with Redis lock for workflow execution.
Execute workflows at a rate of approximately 75-90 workflows per second.
Monitor the state of workflows and observe for inconsistencies.

Observed Behavior
Workflows are fetched from executionDaoFacade before acquiring a lock.
The verifyAndRepair method mutates the workflow state without proper synchronization.
The workflow lock is released before the workflow is removed from the queue.
These conditions create a time window of roughly 50µ to 100µ seconds where a workflow can be in two states concurrently on different threads.
Workflow listeners or completion checks may fail as a result, with workflows erroneously marked as "Running" even after triggering the finish.

Expected Behavior
Workflows should maintain consistent states across all threads.
Proper locking should be enforced to prevent state mutations without synchronization.
Workflow locks should only be released after the workflow is securely removed from the queue.

Screenshots
Screenshot 2024-07-18 at 10 58 28

@v1r3n
Copy link
Collaborator

v1r3n commented Jul 26, 2024

Hi @rq-dbrady we are investigating.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants