Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: deadlock in table lock #16632

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Conversation

zhyass
Copy link
Member

@zhyass zhyass commented Oct 17, 2024

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

Problem Overview

This PR addresses a deadlock issue that occurs when max_running_queries is set to 1 in Databend, resulting in errors like "table locked by other session." The root cause of the deadlock is the inconsistent timing of acquiring table locks during query execution, which leads to conflicts in acquiring the necessary locks.

Root Cause

The deadlock happens because queries acquire table locks at different stages of query processing:

  • Some queries acquire the lock during the binder phase.
  • Others acquire it during the interpreter phase.

This inconsistency creates a situation where:

  • A query already in the execution queue (and actively running) may not be able to acquire a lock.
  • Meanwhile, a query still waiting in the queue may hold the required table lock, blocking the active query and causing a deadlock.

Conditions That Trigger the Issue

  • The issue is most likely to occur when the max_running_queries setting is set to 1.
  • Under these conditions, only one query is allowed to run at a time, exacerbating lock contention between queued and running queries.

Solution

To resolve the deadlock, this PR introduces a new lock acquisition strategy for queries that need require a table lock(merge into, recluster, update, delete, compact, truncate, replace, ..):

  • Queueing After SQL Parsing: For queries that require a table lock, they are queued immediately after the SQL statement is parsed, rather than waiting after binder.
  • Lock Acquisition After Queuing: The table lock is only acquired once the query exits the queue and begins execution. This ensures that the query can safely acquire the lock when needed, without blocking other queries.

This change ensures a more consistent and deadlock-free process of table lock acquisition across different types of queries.

Impact of the Fix

  • The solution avoids the race condition between queued queries and running queries by ensuring that the lock acquisition only happens after queuing, preventing deadlocks and improving overall query execution stability in scenarios with limited concurrency.

Tests

  • Unit Test
  • Logic Test
  • Benchmark Test
  • No Test - Explain why

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Breaking Change (fix or feature that could cause existing functionality not to work as expected)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

This change is Reviewable

@zhyass zhyass changed the title fixsplit parse and plan fix: table locked if max_running_queries = 1 Oct 17, 2024
@zhyass zhyass marked this pull request as draft October 17, 2024 07:41
@github-actions github-actions bot added the pr-bugfix this PR patches a bug in codebase label Oct 17, 2024
@zhang2014

This comment was marked as outdated.

@zhyass zhyass force-pushed the feat_stream branch 3 times, most recently from d174e15 to 294e8a3 Compare October 19, 2024 15:50
@zhyass zhyass added the ci-benchmark Benchmark: run all test label Oct 19, 2024
@zhyass zhyass added ci-benchmark Benchmark: run all test and removed ci-benchmark Benchmark: run all test labels Oct 20, 2024

This comment was marked as outdated.

This comment was marked as outdated.

@databendlabs databendlabs deleted a comment from github-actions bot Oct 20, 2024
@databendlabs databendlabs deleted a comment from github-actions bot Oct 20, 2024
@zhyass zhyass changed the title fix: table locked if max_running_queries = 1 fix: deadlock in table lock Oct 20, 2024
@zhyass zhyass marked this pull request as ready for review October 20, 2024 17:07
@zhyass zhyass marked this pull request as draft October 20, 2024 17:08
@zhyass zhyass added ci-benchmark Benchmark: run all test and removed ci-benchmark Benchmark: run all test labels Oct 20, 2024
Copy link
Contributor

Docker Image for PR

  • tag: pr-16632-f36fe72-1729447292

note: this image tag is only available for internal use,
please check the internal doc for more details.

@zhyass zhyass marked this pull request as ready for review October 21, 2024 04:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-benchmark Benchmark: run all test pr-bugfix this PR patches a bug in codebase
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants