Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delete secondary VRG before primary #1689

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

nirs
Copy link
Member

@nirs nirs commented Dec 1, 2024

Since we added a VRG on the secondary cluster we have a random failure when deleting the DRPC after relocate. When this happens, we find the PVC in terminating state on the secondary cluster, and the VR and VRG are never deleted.

This change attempt to avoid this issue by deleting the secondary VRG first, and deleting the primary VRG only after the secondary VRG was deleted.

During DRPC delete we expect to see these logs one or more times:

2024-12-02T00:29:23.243Z	INFO	controllers.DRPlacementControl	controller/drplacementcontrol_controller.go:191	Deleting DRPC in progress	{"DRPC": {"name":"subscr-deploy-rbd-busybox","namespace":"subscr-deploy-rbd-busybox"}, "rid": "d944591e-fcf5-4882-8143-2e2a8f0351fa", "reason": "secondary VRG manifestwork deletion in progress"}
2024-12-02T00:29:23.358Z	INFO	controllers.DRPlacementControl	controller/drplacementcontrol_controller.go:191	Deleting DRPC in progress	{"DRPC": {"name":"subscr-deploy-rbd-busybox","namespace":"subscr-deploy-rbd-busybox"}, "rid": "364798af-15c0-4bba-829e-ae9928fe2383", "reason": "secondary VRG manifestwork deletion in progress"}
...
2024-12-02T00:30:05.230Z	INFO	controllers.DRPlacementControl	controller/drplacementcontrol_controller.go:191	Deleting DRPC in progress	{"DRPC": {"name":"subscr-deploy-rbd-busybox","namespace":"subscr-deploy-rbd-busybox"}, "rid": "3353357f-3cf8-4a6f-bb16-8aa95f24aec1", "reason": "primary VRG manifestwork deletion in progress"}

Fixes #1659

@nirs nirs requested a review from BenamarMk December 1, 2024 18:26
@nirs nirs force-pushed the vrg-delete-order branch 2 times, most recently from efa29e0 to a684fe8 Compare December 1, 2024 18:34
@nirs
Copy link
Member Author

nirs commented Dec 1, 2024

Errors with the first commit - without using OperationInProgress

So far it works (4 successful builds in 4 tries), but we need to avoid the noisy errors when the reconcile fails:

We have 18 errors - 15 errors for the secondary vrg:

2024-12-01T19:37:37.543Z ERROR controller/controller.go:316 Reconciler error {"controller": "drplacementcontrol", "controllerGroup": "ramendr.openshift.io", "controllerKind": "DRPlacementControl", "DRPlacementControl": {"name":"subscr-deploy-rbd-busybox","namespace":"subscr-deploy-rbd-busybox"}, "namespace": "subscr-deploy-rbd-busybox", "name": "subscr-deploy-rbd-busybox", "reconcileID": "3041f0da-e705-450c-bdd1-e159ffcecef7", "error": "secondary VRG manifestwork deletion in progress"}

And 3 errors for the primary vrg:

2024-12-01T19:38:32.151Z ERROR controller/controller.go:316 Reconciler error {"controller": "drplacementcontrol", "controllerGroup": "ramendr.openshift.io", "controllerKind": "DRPlacementControl", "DRPlacementControl": {"name":"subscr-deploy-rbd-busybox","namespace":"subscr-deploy-rbd-busybox"}, "namespace": "subscr-deploy-rbd-busybox", "name": "subscr-deploy-rbd-busybox", "reconcileID": "0f352732-07e2-40d5-bd3f-a9a97181d94a", "error": "primary VRG manifestwork deletion in progress"}

Examples errors for single workload during e2e flow:

% grep -E 'ERROR.+subscr-deploy-rbd-busybox' hub.log | grep 'in progress'
2024-12-01T19:37:37.543Z	ERROR	controller/controller.go:316	Reconciler error	{"controller": "drplacementcontrol", "controllerGroup": "ramendr.openshift.io", "controllerKind": "DRPlacementControl", "DRPlacementControl": {"name":"subscr-deploy-rbd-busybox","namespace":"subscr-deploy-rbd-busybox"}, "namespace": "subscr-deploy-rbd-busybox", "name": "subscr-deploy-rbd-busybox", "reconcileID": "3041f0da-e705-450c-bdd1-e159ffcecef7", "error": "secondary VRG manifestwork deletion in progress"}
2024-12-01T19:37:37.563Z	ERROR	controller/controller.go:316	Reconciler error	{"controller": "drplacementcontrol", "controllerGroup": "ramendr.openshift.io", "controllerKind": "DRPlacementControl", "DRPlacementControl": {"name":"subscr-deploy-rbd-busybox","namespace":"subscr-deploy-rbd-busybox"}, "namespace": "subscr-deploy-rbd-busybox", "name": "subscr-deploy-rbd-busybox", "reconcileID": "54d47c7a-8cf9-4728-aa47-2f00c1fb8397", "error": "secondary VRG manifestwork deletion in progress"}
2024-12-01T19:37:37.577Z	ERROR	controller/controller.go:316	Reconciler error	{"controller": "drplacementcontrol", "controllerGroup": "ramendr.openshift.io", "controllerKind": "DRPlacementControl", "DRPlacementControl": {"name":"subscr-deploy-rbd-busybox","namespace":"subscr-deploy-rbd-busybox"}, "namespace": "subscr-deploy-rbd-busybox", "name": "subscr-deploy-rbd-busybox", "reconcileID": "48d03d0f-6659-4a02-81ca-0fc52b80d298", "error": "secondary VRG manifestwork deletion in progress"}
2024-12-01T19:37:37.623Z	ERROR	controller/controller.go:316	Reconciler error	{"controller": "drplacementcontrol", "controllerGroup": "ramendr.openshift.io", "controllerKind": "DRPlacementControl", "DRPlacementControl": {"name":"subscr-deploy-rbd-busybox","namespace":"subscr-deploy-rbd-busybox"}, "namespace": "subscr-deploy-rbd-busybox", "name": "subscr-deploy-rbd-busybox", "reconcileID": "22188374-06f7-4b6c-96c9-f108acd86632", "error": "secondary VRG manifestwork deletion in progress"}
2024-12-01T19:37:37.637Z	ERROR	controller/controller.go:316	Reconciler error	{"controller": "drplacementcontrol", "controllerGroup": "ramendr.openshift.io", "controllerKind": "DRPlacementControl", "DRPlacementControl": {"name":"subscr-deploy-rbd-busybox","namespace":"subscr-deploy-rbd-busybox"}, "namespace": "subscr-deploy-rbd-busybox", "name": "subscr-deploy-rbd-busybox", "reconcileID": "84a59ea7-62a2-40d5-98c5-e25e9d56279a", "error": "secondary VRG manifestwork deletion in progress"}
2024-12-01T19:37:37.681Z	ERROR	controller/controller.go:316	Reconciler error	{"controller": "drplacementcontrol", "controllerGroup": "ramendr.openshift.io", "controllerKind": "DRPlacementControl", "DRPlacementControl": {"name":"subscr-deploy-rbd-busybox","namespace":"subscr-deploy-rbd-busybox"}, "namespace": "subscr-deploy-rbd-busybox", "name": "subscr-deploy-rbd-busybox", "reconcileID": "3cca19b4-0747-4544-a3bb-ea34c9b727e5", "error": "secondary VRG manifestwork deletion in progress"}
2024-12-01T19:37:37.852Z	ERROR	controller/controller.go:316	Reconciler error	{"controller": "drplacementcontrol", "controllerGroup": "ramendr.openshift.io", "controllerKind": "DRPlacementControl", "DRPlacementControl": {"name":"subscr-deploy-rbd-busybox","namespace":"subscr-deploy-rbd-busybox"}, "namespace": "subscr-deploy-rbd-busybox", "name": "subscr-deploy-rbd-busybox", "reconcileID": "76b0b07b-69aa-40a9-a94c-f3acd574945f", "error": "secondary VRG manifestwork deletion in progress"}
2024-12-01T19:37:38.184Z	ERROR	controller/controller.go:316	Reconciler error	{"controller": "drplacementcontrol", "controllerGroup": "ramendr.openshift.io", "controllerKind": "DRPlacementControl", "DRPlacementControl": {"name":"subscr-deploy-rbd-busybox","namespace":"subscr-deploy-rbd-busybox"}, "namespace": "subscr-deploy-rbd-busybox", "name": "subscr-deploy-rbd-busybox", "reconcileID": "5e8cb110-d849-4664-9a5e-b002ed75433f", "error": "secondary VRG manifestwork deletion in progress"}
2024-12-01T19:37:38.837Z	ERROR	controller/controller.go:316	Reconciler error	{"controller": "drplacementcontrol", "controllerGroup": "ramendr.openshift.io", "controllerKind": "DRPlacementControl", "DRPlacementControl": {"name":"subscr-deploy-rbd-busybox","namespace":"subscr-deploy-rbd-busybox"}, "namespace": "subscr-deploy-rbd-busybox", "name": "subscr-deploy-rbd-busybox", "reconcileID": "c789aa29-75ad-48c5-9914-6ec6d5e93e5a", "error": "secondary VRG manifestwork deletion in progress"}
2024-12-01T19:37:40.133Z	ERROR	controller/controller.go:316	Reconciler error	{"controller": "drplacementcontrol", "controllerGroup": "ramendr.openshift.io", "controllerKind": "DRPlacementControl", "DRPlacementControl": {"name":"subscr-deploy-rbd-busybox","namespace":"subscr-deploy-rbd-busybox"}, "namespace": "subscr-deploy-rbd-busybox", "name": "subscr-deploy-rbd-busybox", "reconcileID": "60124950-d22f-4078-a0e2-385b6697e8bb", "error": "secondary VRG manifestwork deletion in progress"}
2024-12-01T19:37:42.712Z	ERROR	controller/controller.go:316	Reconciler error	{"controller": "drplacementcontrol", "controllerGroup": "ramendr.openshift.io", "controllerKind": "DRPlacementControl", "DRPlacementControl": {"name":"subscr-deploy-rbd-busybox","namespace":"subscr-deploy-rbd-busybox"}, "namespace": "subscr-deploy-rbd-busybox", "name": "subscr-deploy-rbd-busybox", "reconcileID": "ed43d9d3-25b0-4839-b6b9-773dbf0a3361", "error": "secondary VRG manifestwork deletion in progress"}
2024-12-01T19:37:47.845Z	ERROR	controller/controller.go:316	Reconciler error	{"controller": "drplacementcontrol", "controllerGroup": "ramendr.openshift.io", "controllerKind": "DRPlacementControl", "DRPlacementControl": {"name":"subscr-deploy-rbd-busybox","namespace":"subscr-deploy-rbd-busybox"}, "namespace": "subscr-deploy-rbd-busybox", "name": "subscr-deploy-rbd-busybox", "reconcileID": "f77bcf91-6d06-44af-8677-84c65ffcab94", "error": "secondary VRG manifestwork deletion in progress"}
2024-12-01T19:37:58.101Z	ERROR	controller/controller.go:316	Reconciler error	{"controller": "drplacementcontrol", "controllerGroup": "ramendr.openshift.io", "controllerKind": "DRPlacementControl", "DRPlacementControl": {"name":"subscr-deploy-rbd-busybox","namespace":"subscr-deploy-rbd-busybox"}, "namespace": "subscr-deploy-rbd-busybox", "name": "subscr-deploy-rbd-busybox", "reconcileID": "b6baa97d-a743-4eed-9a4b-4f9e44241011", "error": "secondary VRG manifestwork deletion in progress"}
2024-12-01T19:38:02.134Z	ERROR	controller/controller.go:316	Reconciler error	{"controller": "drplacementcontrol", "controllerGroup": "ramendr.openshift.io", "controllerKind": "DRPlacementControl", "DRPlacementControl": {"name":"subscr-deploy-rbd-busybox","namespace":"subscr-deploy-rbd-busybox"}, "namespace": "subscr-deploy-rbd-busybox", "name": "subscr-deploy-rbd-busybox", "reconcileID": "951f3169-cc3d-4719-9194-902e8aaf10c4", "error": "secondary VRG manifestwork deletion in progress"}
2024-12-01T19:38:18.591Z	ERROR	controller/controller.go:316	Reconciler error	{"controller": "drplacementcontrol", "controllerGroup": "ramendr.openshift.io", "controllerKind": "DRPlacementControl", "DRPlacementControl": {"name":"subscr-deploy-rbd-busybox","namespace":"subscr-deploy-rbd-busybox"}, "namespace": "subscr-deploy-rbd-busybox", "name": "subscr-deploy-rbd-busybox", "reconcileID": "d9fd5fdb-3bab-45c4-9b10-8df9421f9f6e", "error": "secondary VRG manifestwork deletion in progress"}
2024-12-01T19:38:32.151Z	ERROR	controller/controller.go:316	Reconciler error	{"controller": "drplacementcontrol", "controllerGroup": "ramendr.openshift.io", "controllerKind": "DRPlacementControl", "DRPlacementControl": {"name":"subscr-deploy-rbd-busybox","namespace":"subscr-deploy-rbd-busybox"}, "namespace": "subscr-deploy-rbd-busybox", "name": "subscr-deploy-rbd-busybox", "reconcileID": "0f352732-07e2-40d5-bd3f-a9a97181d94a", "error": "primary VRG manifestwork deletion in progress"}
2024-12-01T19:38:32.235Z	ERROR	controller/controller.go:316	Reconciler error	{"controller": "drplacementcontrol", "controllerGroup": "ramendr.openshift.io", "controllerKind": "DRPlacementControl", "DRPlacementControl": {"name":"subscr-deploy-rbd-busybox","namespace":"subscr-deploy-rbd-busybox"}, "namespace": "subscr-deploy-rbd-busybox", "name": "subscr-deploy-rbd-busybox", "reconcileID": "f331b6f2-7d28-45b0-b535-55f9c7de4c7d", "error": "primary VRG manifestwork deletion in progress"}
2024-12-01T19:38:32.531Z	ERROR	controller/controller.go:316	Reconciler error	{"controller": "drplacementcontrol", "controllerGroup": "ramendr.openshift.io", "controllerKind": "DRPlacementControl", "DRPlacementControl": {"name":"subscr-deploy-rbd-busybox","namespace":"subscr-deploy-rbd-busybox"}, "namespace": "subscr-deploy-rbd-busybox", "name": "subscr-deploy-rbd-busybox", "reconcileID": "44c5037e-1924-4d32-b052-3f87fc333520", "error": "primary VRG manifestwork deletion in progress"}

@nirs nirs requested a review from Madhu-1 December 1, 2024 20:03
@nirs nirs force-pushed the vrg-delete-order branch 2 times, most recently from 82b60e2 to ffc3b3b Compare December 1, 2024 22:57
@nirs

This comment was marked as outdated.

@nirs

This comment was marked as outdated.

@nirs
Copy link
Member Author

nirs commented Dec 2, 2024

Example run

--- PASS: TestSuites (0.05s)
    --- PASS: TestSuites/Validate (0.05s)
        --- PASS: TestSuites/Validate/hub (0.02s)
        --- PASS: TestSuites/Validate/c1 (0.02s)
        --- PASS: TestSuites/Validate/c2 (0.02s)
    --- PASS: TestSuites/Exhaustive (6.09s)
        --- SKIP: TestSuites/Exhaustive/disapp-deploy-cephfs-busybox (0.00s)
        --- PASS: TestSuites/Exhaustive/appset-deploy-rbd-busybox (522.23s)
            --- PASS: TestSuites/Exhaustive/appset-deploy-rbd-busybox/Deploy (0.21s)
            --- PASS: TestSuites/Exhaustive/appset-deploy-rbd-busybox/Enable (100.31s)
            --- PASS: TestSuites/Exhaustive/appset-deploy-rbd-busybox/Failover (210.24s)
            --- PASS: TestSuites/Exhaustive/appset-deploy-rbd-busybox/Relocate (150.37s)
            --- PASS: TestSuites/Exhaustive/appset-deploy-rbd-busybox/Disable (61.05s)
            --- PASS: TestSuites/Exhaustive/appset-deploy-rbd-busybox/Undeploy (0.06s)
        --- PASS: TestSuites/Exhaustive/subscr-deploy-cephfs-busybox (522.91s)
            --- PASS: TestSuites/Exhaustive/subscr-deploy-cephfs-busybox/Deploy (5.29s)
            --- PASS: TestSuites/Exhaustive/subscr-deploy-cephfs-busybox/Enable (150.29s)
            --- PASS: TestSuites/Exhaustive/subscr-deploy-cephfs-busybox/Failover (120.15s)
            --- PASS: TestSuites/Exhaustive/subscr-deploy-cephfs-busybox/Relocate (185.40s)
            --- PASS: TestSuites/Exhaustive/subscr-deploy-cephfs-busybox/Disable (55.71s)
            --- PASS: TestSuites/Exhaustive/subscr-deploy-cephfs-busybox/Undeploy (6.07s)
        --- PASS: TestSuites/Exhaustive/subscr-deploy-rbd-busybox (523.05s)
            --- PASS: TestSuites/Exhaustive/subscr-deploy-rbd-busybox/Deploy (5.26s)
            --- PASS: TestSuites/Exhaustive/subscr-deploy-rbd-busybox/Enable (90.17s)
            --- PASS: TestSuites/Exhaustive/subscr-deploy-rbd-busybox/Failover (210.32s)
            --- PASS: TestSuites/Exhaustive/subscr-deploy-rbd-busybox/Relocate (155.37s)
            --- PASS: TestSuites/Exhaustive/subscr-deploy-rbd-busybox/Disable (55.85s)
            --- PASS: TestSuites/Exhaustive/subscr-deploy-rbd-busybox/Undeploy (6.06s)
        --- PASS: TestSuites/Exhaustive/disapp-deploy-rbd-busybox (528.00s)
            --- PASS: TestSuites/Exhaustive/disapp-deploy-rbd-busybox/Deploy (2.27s)
            --- PASS: TestSuites/Exhaustive/disapp-deploy-rbd-busybox/Enable (90.26s)
            --- PASS: TestSuites/Exhaustive/disapp-deploy-rbd-busybox/Failover (211.04s)
            --- PASS: TestSuites/Exhaustive/disapp-deploy-rbd-busybox/Relocate (159.64s)
            --- PASS: TestSuites/Exhaustive/disapp-deploy-rbd-busybox/Disable (50.08s)
            --- PASS: TestSuites/Exhaustive/disapp-deploy-rbd-busybox/Undeploy (14.71s)
        --- PASS: TestSuites/Exhaustive/appset-deploy-cephfs-busybox (586.29s)
            --- PASS: TestSuites/Exhaustive/appset-deploy-cephfs-busybox/Deploy (0.20s)
            --- PASS: TestSuites/Exhaustive/appset-deploy-cephfs-busybox/Enable (135.34s)
            --- PASS: TestSuites/Exhaustive/appset-deploy-cephfs-busybox/Failover (120.16s)
            --- PASS: TestSuites/Exhaustive/appset-deploy-cephfs-busybox/Relocate (270.49s)
            --- PASS: TestSuites/Exhaustive/appset-deploy-cephfs-busybox/Disable (60.08s)
            --- PASS: TestSuites/Exhaustive/appset-deploy-cephfs-busybox/Undeploy (0.03s)
PASS
ok  	github.com/ramendr/ramen/e2e	593.046s

@nirs
Copy link
Member Author

nirs commented Dec 2, 2024

Example deleting drpc progress logs

% grep -E 'Deleting DRPC in progress.+subscr-deploy-rbd-' hub.log 
2024-12-02T00:29:23.243Z	INFO	controllers.DRPlacementControl	controller/drplacementcontrol_controller.go:191	Deleting DRPC in progress	{"DRPC": {"name":"subscr-deploy-rbd-busybox","namespace":"subscr-deploy-rbd-busybox"}, "rid": "d944591e-fcf5-4882-8143-2e2a8f0351fa", "reason": "secondary VRG manifestwork deletion in progress"}
2024-12-02T00:29:23.358Z	INFO	controllers.DRPlacementControl	controller/drplacementcontrol_controller.go:191	Deleting DRPC in progress	{"DRPC": {"name":"subscr-deploy-rbd-busybox","namespace":"subscr-deploy-rbd-busybox"}, "rid": "364798af-15c0-4bba-829e-ae9928fe2383", "reason": "secondary VRG manifestwork deletion in progress"}
2024-12-02T00:29:23.542Z	INFO	controllers.DRPlacementControl	controller/drplacementcontrol_controller.go:191	Deleting DRPC in progress	{"DRPC": {"name":"subscr-deploy-rbd-busybox","namespace":"subscr-deploy-rbd-busybox"}, "rid": "54cce07c-3416-4935-b3e0-dc9b2015d1e1", "reason": "secondary VRG manifestwork deletion in progress"}
2024-12-02T00:29:23.641Z	INFO	controllers.DRPlacementControl	controller/drplacementcontrol_controller.go:191	Deleting DRPC in progress	{"DRPC": {"name":"subscr-deploy-rbd-busybox","namespace":"subscr-deploy-rbd-busybox"}, "rid": "c6854a8d-3668-4bf2-ae3f-0afdf04a444d", "reason": "secondary VRG manifestwork deletion in progress"}
2024-12-02T00:29:23.667Z	INFO	controllers.DRPlacementControl	controller/drplacementcontrol_controller.go:191	Deleting DRPC in progress	{"DRPC": {"name":"subscr-deploy-rbd-busybox","namespace":"subscr-deploy-rbd-busybox"}, "rid": "a822bf0a-d9fa-472e-b4df-54e0a69bf605", "reason": "secondary VRG manifestwork deletion in progress"}
2024-12-02T00:29:23.848Z	INFO	controllers.DRPlacementControl	controller/drplacementcontrol_controller.go:191	Deleting DRPC in progress	{"DRPC": {"name":"subscr-deploy-rbd-busybox","namespace":"subscr-deploy-rbd-busybox"}, "rid": "94d38462-f725-407b-8045-5943af05f423", "reason": "secondary VRG manifestwork deletion in progress"}
2024-12-02T00:29:24.026Z	INFO	controllers.DRPlacementControl	controller/drplacementcontrol_controller.go:191	Deleting DRPC in progress	{"DRPC": {"name":"subscr-deploy-rbd-busybox","namespace":"subscr-deploy-rbd-busybox"}, "rid": "52ab9cdc-a3e6-4cd6-b43b-1159e9280fcf", "reason": "secondary VRG manifestwork deletion in progress"}
2024-12-02T00:29:24.369Z	INFO	controllers.DRPlacementControl	controller/drplacementcontrol_controller.go:191	Deleting DRPC in progress	{"DRPC": {"name":"subscr-deploy-rbd-busybox","namespace":"subscr-deploy-rbd-busybox"}, "rid": "08283771-99bb-4a97-8aa4-136eab0edf1c", "reason": "secondary VRG manifestwork deletion in progress"}
2024-12-02T00:29:25.149Z	INFO	controllers.DRPlacementControl	controller/drplacementcontrol_controller.go:191	Deleting DRPC in progress	{"DRPC": {"name":"subscr-deploy-rbd-busybox","namespace":"subscr-deploy-rbd-busybox"}, "rid": "7d3bc103-246d-4d09-a98a-fc9bf77c35bf", "reason": "secondary VRG manifestwork deletion in progress"}
2024-12-02T00:29:26.560Z	INFO	controllers.DRPlacementControl	controller/drplacementcontrol_controller.go:191	Deleting DRPC in progress	{"DRPC": {"name":"subscr-deploy-rbd-busybox","namespace":"subscr-deploy-rbd-busybox"}, "rid": "f6bec8be-13c7-45f4-8d8b-59e626fc2631", "reason": "secondary VRG manifestwork deletion in progress"}
2024-12-02T00:29:29.341Z	INFO	controllers.DRPlacementControl	controller/drplacementcontrol_controller.go:191	Deleting DRPC in progress	{"DRPC": {"name":"subscr-deploy-rbd-busybox","namespace":"subscr-deploy-rbd-busybox"}, "rid": "a4345a4d-47c2-4321-a2c7-e722691109a1", "reason": "secondary VRG manifestwork deletion in progress"}
2024-12-02T00:29:34.480Z	INFO	controllers.DRPlacementControl	controller/drplacementcontrol_controller.go:191	Deleting DRPC in progress	{"DRPC": {"name":"subscr-deploy-rbd-busybox","namespace":"subscr-deploy-rbd-busybox"}, "rid": "3ab27ce7-fb93-47c3-a0c0-cb939447729f", "reason": "secondary VRG manifestwork deletion in progress"}
2024-12-02T00:29:44.736Z	INFO	controllers.DRPlacementControl	controller/drplacementcontrol_controller.go:191	Deleting DRPC in progress	{"DRPC": {"name":"subscr-deploy-rbd-busybox","namespace":"subscr-deploy-rbd-busybox"}, "rid": "bfb20d7c-2de2-4588-b425-7e0da9483d5f", "reason": "secondary VRG manifestwork deletion in progress"}
2024-12-02T00:29:47.470Z	INFO	controllers.DRPlacementControl	controller/drplacementcontrol_controller.go:191	Deleting DRPC in progress	{"DRPC": {"name":"subscr-deploy-rbd-busybox","namespace":"subscr-deploy-rbd-busybox"}, "rid": "2d33dca8-11e6-4979-adfe-8486247f7b36", "reason": "primary VRG manifestwork deletion in progress"}
2024-12-02T00:29:47.764Z	INFO	controllers.DRPlacementControl	controller/drplacementcontrol_controller.go:191	Deleting DRPC in progress	{"DRPC": {"name":"subscr-deploy-rbd-busybox","namespace":"subscr-deploy-rbd-busybox"}, "rid": "667ff490-205c-4f0f-aff3-6bfa8f2c18a1", "reason": "primary VRG manifestwork deletion in progress"}
2024-12-02T00:30:05.230Z	INFO	controllers.DRPlacementControl	controller/drplacementcontrol_controller.go:191	Deleting DRPC in progress	{"DRPC": {"name":"subscr-deploy-rbd-busybox","namespace":"subscr-deploy-rbd-busybox"}, "rid": "3353357f-3cf8-4a6f-bb16-8aa95f24aec1", "reason": "primary VRG manifestwork deletion in progress"}

internal/controller/drplacementcontrol_controller.go Outdated Show resolved Hide resolved
// DeleteInProgress error is returned during DRPC deletion when we need to wait and try later.
type deleteInProgress struct{ string }

func (e deleteInProgress) Error() string { return e.string }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach works and looks elegant, but why introduce the complexity of defining a new type? Why not simply use errors.New() similar to these ? One might argue it’s for errors.Is() support, but is it necessary to check for this specific error type at line 189? Instead, consider returning a custom error directly. The reconciler will eventually retry, and the error will still be logged.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach works and looks elegant, but why introduce the complexity of defining a new type? Why not simply use errors.New() similar to these ?

Using a sentinel error works when there one special error like io.EOF. But we have multiple reasons to requeue the request during delete:

  • adopting a vrg
  • waiting until secondary is deleted
  • waiting until primary is deleted
  • vrg count > 0 (maybe not possible now)

Using a a type matching any error of the same type, we can return multiple errors have the same handling.

We already use this in

func RequestProcessingErrorCreate(s string) RequestProcessingError { return RequestProcessingError{s} }

I started with a local error in this package, but this can move to a shared package and used in all controllers.

One might argue it’s for errors.Is() support, but is it necessary to check for this specific error type at line 189?

Yes, see comment bellow.

Instead, consider returning a custom error directly. The reconciler will eventually retry, and the error will still be logged.

This is what the first commit is doing, using fmt.Errorf() - the result is having 18 errors with stacktrace for every disable dr. This is huge amount of noise and it is incorrect to log a normal condition as an error.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the errors generated by the first commit - each one has a 10 lines stacktrace that I did not copy:
#1689 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is huge amount of noise and it is incorrect to log a normal condition as an error

Since deleteInProgress implements the Error interface, it is inherently treated as an error object. This creates a bias in how it is perceived. While it may not represent an error condition for the reconcile server, within the context of Ramen, it is treated as one. Therefore, the classification as an error is intentional.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using an error is just a limit of the current code, that have single error return value for the entire finalizeDRPC code path. We can change this but this is a big change that will be hard to backport, and I don't think it will be better to return multiple arguments like we have in other places. Using specific error type for the common case of "not ready yet" seems better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach works and looks elegant, but why introduce the complexity of defining a new type?

Using existing type now.

return ctrl.Result{Requeue: true}, nil
}

// Unexpected error.
return ctrl.Result{}, err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean just let this be returned and remove the block starting from line 189 to 195 inclusive.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The result will be ERROR log with a stacktrace - we see 18 of these for every delete operation. Since this is an expected condition, it is wrong to treat this as an error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is an expected condition, it is wrong to treat this as an error.

Same as my comment above

internal/controller/drplacementcontrol_controller.go Outdated Show resolved Hide resolved
Since we added a VRG on the secondary cluster we have a random failure
when deleting the DRPC after relocate. When this happens, we find the
PVC in terminating state on the secondary cluster, and the VR and VRG
are never deleted.

This change attempt to avoid this issue by deleting the secondary VRG
first, and deleting the primary VRG only after the secondary VRG was
deleted.

During DRPC delete we expect to see these errors one or more times:

    ERROR   Secondary VRG manifestwork deletion in progress
    very noisy
        stacktrace...

    ERROR Primary VRG manifestwork deletion in progress
    very noisy
        stacktrace...

Fixes: RamenDR#1659
Signed-off-by: Nir Soffer <[email protected]>
This error is used to signal that an operation is in progress, and the
caller should log the error in normal log level and try again later.
This is a common concept for reconciling, and we can sue the same error
in other packages.

Changes:
- Rename the error to OperationInProgress
- Simplify to a string instead of embedding a string
- update the tests to reflect the common usage with errors.Is().

Signed-off-by: Nir Soffer <[email protected]>
We want to use it more in ramen, to remove bugus errors for logs.
To encourage use this this pattern and make it easier to use correctly,
add util.IsOperationInProgress() helper.

Improve the test to use the new IsOperationInProgress() and test also
wrapped OperationInProgress error.

Signed-off-by: Nir Soffer <[email protected]>
When deleting the DRPC we may need to adopt the VRG, delete the
secondary VRG, wait until the secondary VRG is deleted, delete the
primary VRG, and wait until the primary VRG is deleted. This takes 60-90
seconds and many reconciles (18 seen in e2e test), and creates huge
amount of noise in the log.

Suppress the noise using util.OperationInProgress error. When the
reconcile is successful but it is still in progress, we return a
util.OperationInProgress error describing the current progression. The
top level error handler logs an INFO message and requeue the request.

With this change we will see multiple logs for the secondary VRG:

    INFO    Deleting DRPC in progress {"reason", "secondary VRG deletion in progress"}
    ...

And finally more logs for the primary VRG:

    INFO    Deleting DRPC in progress {"reason", "primary VRG deletion in progress"}
    ...

Notes:

- We logged errors during finalizeDRPC twice; once as INFO log, and once
  as ERROR with a stacktrace when we return error from the reconcile.
  Remove the duplicate INFO log.

- The linter is not happy about the new nested if. We can avoid this by
  extracting a helper to handle finalize errors, but I want to keep the
  change minimal for easy backport. We can improve this later upstream.

Signed-off-by: Nir Soffer <[email protected]>
Copy link
Member

@BenamarMk BenamarMk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the change.


// IsOperationInProgress returns true if err or error wrapped by it is an OperationInProgress error.
func IsOperationInProgress(err error) bool {
return errors.Is(err, OperationInProgress(""))
Copy link
Member

@BenamarMk BenamarMk Dec 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question though, is this call going to propagate back to line 15? I doubt it. errors.Is(...) is a free function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it does here:

if x, ok := err.(interface{ Is(error) bool }); ok && x.Is(target) {
			return true
		}

All good

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Disable DR after relocate times out randomly
2 participants