Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Test case test_node_eviction_multiple_volume failed to reschedule replicas after volume detached #9857

Open
yangchiu opened this issue Nov 26, 2024 · 6 comments
Assignees
Labels
area/volume-replica-scheduling Volume replica scheduling related backport/1.6.4 backport/1.7.3 kind/bug kind/regression Regression which has worked before priority/0 Must be implement or fixed in this release (managed by PO) reproduce/always 100% reproducible severity/1 Function broken (a critical incident with very high impact (ex: data corruption, failed upgrade)
Milestone

Comments

@yangchiu
Copy link
Member

yangchiu commented Nov 26, 2024

Describe the bug

Test case test_node_eviction_multiple_volume failed to reschedule replicas after volume detached:

https://ci.longhorn.io/job/public/job/master/job/sles/job/amd64/job/longhorn-tests-sles-amd64/1104/testReport/junit/tests/test_node/test_node_eviction_multiple_volume/

To Reproduce

  1. Disable scheduling on node 1.
  2. Create pv, pvc, pod with volume 1 of 2 replicas.
  3. Set 'Eviction Requested' to 'true' and disable scheduling on node 2.
  4. Set 'Eviction Requested' to 'false' and enable scheduling on node 1.
  5. Check volume 'healthy' and wait for replicas running on node 1 and 3.
  6. delete pods to detach volume 1.
  7. Set 'Eviction Requested' to 'false' and enable scheduling on node 2.
  8. Set 'Eviction Requested' to 'true' and disable scheduling on node 1.
  9. Wait for replicas running on node 2 and 3.

In v1.7.2, the detached volume will automatically re-attach in step 9 to reschedule replicas from node 1 to node 2.

But in master-head, the re-attachment and rescheduling never happen.

Expected behavior

Support bundle for troubleshooting

Environment

  • Longhorn version: master-head
  • Impacted volume (PV):
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: v1.31.1+k3s1
    • Number of control plane nodes in the cluster:
    • Number of worker nodes in the cluster:
  • Node config
    • OS type and version: sles 15-sp6
    • Kernel version:
    • CPU per node:
    • Memory per node:
    • Disk type (e.g. SSD/NVMe/HDD):
    • Network bandwidth between the nodes (Gbps):
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
  • Number of Longhorn volumes in the cluster:

Additional context

Workaround and Mitigation

@yangchiu yangchiu added kind/bug severity/1 Function broken (a critical incident with very high impact (ex: data corruption, failed upgrade) reproduce/always 100% reproducible priority/0 Must be implement or fixed in this release (managed by PO) kind/regression Regression which has worked before area/volume-replica-scheduling Volume replica scheduling related labels Nov 26, 2024
@yangchiu yangchiu added this to the v1.8.0 milestone Nov 26, 2024
@github-project-automation github-project-automation bot moved this to New Issues in Longhorn Sprint Nov 26, 2024
@derekbit
Copy link
Member

@mantissahz Please help investigate the issue. Thank you.

@yangchiu
Copy link
Member Author

yangchiu commented Nov 26, 2024

Could this be related to #9781?

@c3y1huang
Copy link
Contributor

c3y1huang commented Nov 27, 2024

Could this be related to #9781?

Yes, it seems to be a regression failure caused by it. I will handle this at #9781.

cc @derekbit @mantissahz

@longhorn-io-github-bot
Copy link

longhorn-io-github-bot commented Nov 27, 2024

Pre Ready-For-Testing Checklist

  • Where is the reproduce steps/test steps documented?
    The reproduce steps/test steps are at:/

    • Issue description
  • Is there a workaround for the issue? If so, where is it documented?
    The workaround is at:

  • Does the PR include the explanation for the fix or the feature?

  • Does the PR include deployment change (YAML/Chart)? If so, where are the PRs for both YAML file and Chart?
    The PR for the YAML change is at:
    The PR for the chart change is at:

  • Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*)?
    The PR is at:

  • Which areas/issues this PR might have potential impacts on?
    Area replica scheduling, node eviction
    Issues

  • If labeled: require/LEP Has the Longhorn Enhancement Proposal PR submitted?
    The LEP PR is at

  • If labeled: area/ui Has the UI issue filed or ready to be merged (including backport-needed/*)?
    The UI issue/PR is at

  • If labeled: require/doc Has the necessary document PR submitted or merged (including backport-needed/*)?
    The documentation issue/PR is at

  • If labeled: require/automation-e2e Has the end-to-end test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue (including backport-needed/*)
    The automation skeleton PR is at
    The automation test case PR is at
    The issue of automation test case implementation is at (please create by the template)

  • If labeled: require/automation-engine Has the engine integration test been merged (including backport-needed/*)?
    The engine automation PR is at

  • If labeled: require/manual-test-plan Has the manual test plan been documented?
    The updated manual test plan is at

  • If the fix introduces the code for backward compatibility Has a separate issue been filed with the label release/obsolete-compatibility?
    The compatibility issue is filed at

@innobead
Copy link
Member

Could this be related to #9781?

Yes, it seems to be a regression failure caused by it. I will handle this at #9781.

cc @derekbit @mantissahz

so this is not a regression in the existing versions but caused by the recent fix for #9781 ?

@c3y1huang
Copy link
Contributor

so this is not a regression in the existing versions but caused by the recent fix for #9781 ?

Yes, this is caused by a recently merged PR. longhorn/longhorn-manager#3270

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/volume-replica-scheduling Volume replica scheduling related backport/1.6.4 backport/1.7.3 kind/bug kind/regression Regression which has worked before priority/0 Must be implement or fixed in this release (managed by PO) reproduce/always 100% reproducible severity/1 Function broken (a critical incident with very high impact (ex: data corruption, failed upgrade)
Projects
Status: Review
Development

No branches or pull requests

6 participants