Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cloud-provider-config Secret is not updated on Shoot deletion -> deadlock on Shoot deletion #601

Open
ialidzhikov opened this issue Nov 23, 2022 · 0 comments
Labels
area/control-plane Control plane related kind/bug Bug lifecycle/rotten Nobody worked on this for 12 months (final aging stage) platform/azure Microsoft Azure platform/infrastructure

Comments

@ialidzhikov
Copy link
Member

How to categorize this issue?

/area control-plane
/kind bug
/platform azure

What happened:
The cloud-provider-config Secret holds the Azure credentials for cloud-controller-manager. Currently this Secret is updated/created only on ControlPlane reconciliation.

// Get config chart values
if a.configChart != nil {
values, err := a.vp.GetConfigChartValues(ctx, cp, cluster)
if err != nil {
return false, err
}
// Apply config chart
log.Info("Applying configuration chart")
if err := a.configChart.Apply(ctx, a.chartApplier, cp.Namespace, nil, "", "", values); err != nil {
return false, fmt.Errorf("could not apply configuration chart for controlplane '%s': %w", kutil.ObjectName(cp), err)
}
}

There is the following deadlock situation for a deletion of hibernated Shoot.

  1. Shoot with invalid credentials gets deleted.

  2. As the Shoot is hibernated, the deletions fails to destroy the ControlPlane with reason:

    task "Waiting until shoot control plane has been destroyed" failed: Failed to delete ControlPlane shoot--foo--test/test: Error deleting ControlPlane: error while waiting for managed resource containing shoot chart for controlplane 'shoot--foo--test/test' to be deleted: error while waiting for all resources to be deleted: retry failed with context deadline exceeded, last error: resource shoot--foo--test/extension-controlplane-shoot still exists:
    Could not clean all old resources: 2 errors occurred: [deletion of old resource "v1/Service/kube-system/allow-tcp-egress" is still pending, deletion of old resource "v1/Service/kube-system/allow-udp-egress" is still pending]
    

    CCM is CrashLoopBackOff due to invalid credentials, hence cannot deleted the allow-tcp-egress and allow-udp-egress Services.

  3. Shoot owner updates the credentials with valid ones.

  4. The deletion continues to fail with the error from step 2.

    The cloud-provider-config Secret never gets updated.

What you expected to happen:
Deletion of hibernated Shoot to succeed once the credentials are updated with valid ones.

How to reproduce it (as minimally and precisely as possible):
See above.

Anything else we need to know?:
N/A

Environment:

  • Gardener version (if relevant): v1.32.0
  • Extension version:
  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • Others:
@gardener-robot gardener-robot added area/control-plane Control plane related kind/bug Bug platform/azure Microsoft Azure platform/infrastructure labels Nov 23, 2022
@gardener-robot gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Aug 2, 2023
@gardener-robot gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Apr 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/control-plane Control plane related kind/bug Bug lifecycle/rotten Nobody worked on this for 12 months (final aging stage) platform/azure Microsoft Azure platform/infrastructure
Projects
None yet
Development

No branches or pull requests

2 participants