970: Taint nodes even if reboot is currently blocked #971

yaraskm · 2024-09-03T12:58:29Z

If using the --prefer-no-schedule-taint flag, apply the taint even if the reboot is currently blocked. This will make it less likely for Pods to get scheduled to nodes that are pending a reboot, once the blocking Pods / alerts are cleared.

Resolves #970

- If using the --prefer-no-schedule-taint flag, apply the taint even if the reboot is currently blocked. This will make it less likley for Pods to get scheduled to nodes that are pending a reboot, once the blocking Pods / alerts are cleared. Signed-off-by: Matt Yaraskavitch <[email protected]>

Signed-off-by: Matt Yaraskavitch <[email protected]>

jackfrancis · 2024-09-16T16:57:03Z

cmd/kured/main.go

@@ -761,16 +761,24 @@ func rebootAsRequired(nodeID string, booter reboot.Reboot, sentinelCommand []str
 			blockCheckers = append(blockCheckers, KubernetesBlockingChecker{client: client, nodename: nodeID, filter: podSelectors})
 		}

+		rebootBlocked := rebootBlocked(blockCheckers...)


I wonder is the easiest and most correct way to improve this is to move right after L32 above, immediately after we determine that the node will need to be deleted.

As far as I can see there isn't any harm in always adding this "prefer no schedule" taint. Right now we're only doing it if we aren't able to rapidly progress to the drain operation, but why not do it no matter what?

This would solve for your current problem as well.

Thoughts?

Sounds good to me, I'll make the change later today.

Hey @jackfrancis , which line is it you're referring to? I don't think L32 is correct.

I've been able to successfully run with this PR in prod for the last few weeks now and it definitely seems to have lowered the average time for a node to be blocked. I'm definitely open to adding the taint all of the time, regardless of whether we're in an acceptable reboot window or not.

@jackfrancis do you truly believe that adding the prefer no schedule taint can be done in all cases?
Wouldn't it make it a bit moot to have all nodes with prefernoschedule after a security update (for which all nodes will get the taint)? When kubernetes will schedule (all the nodes having the prefernoschedule) there will be no distinction anymore... Am I missing something?

evrardjp · 2024-10-14T07:43:46Z

I am not convinced this is a view everyone will have.

I believe some ppl will prefer adding the preferNoSchedule taint, while some others will consider that the taint shouldn't be liberally applied on all nodes including the ones that can't reboot.

I don't like the fact to change the rebootasrequired loop through a variable. It makes it very hard to understand the loop in terms of maintenance in the long run. On top of that, most of the reasons ppl would have adjusted the rebootasrequired over time was for personal preference: "In my case, I would like to do x before y". It all made sense for some ppl, while it didn't make sense for some others.

I truely belive your case makes sense. Yet, I am not certain that it's the case for everyone.

Two choices:

I will contact a few ppl, see what they think.
We start to work on a change that makes the rebootasrequired loop more flexible. (I am already working on that btw).

yaraskm · 2024-10-15T14:37:32Z

I truely belive your case makes sense. Yet, I am not certain that it's the case for everyone.

I definitely agree with you here, and making it a configurable behaviour change makes the most sense to me as well. I'll wait for your feedback on the discussions and am happy to chip in on implementation changes if needed.

yaraskm mentioned this pull request Sep 3, 2024

Allow tainting Nodes even when blocker Pods exist #970

Closed

fix issue where lock was acquired while waiting for blockers to clear

3fca498

Signed-off-by: Matt Yaraskavitch <[email protected]>

yaraskm marked this pull request as ready for review September 6, 2024 14:00

jackfrancis reviewed Sep 16, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

970: Taint nodes even if reboot is currently blocked #971

970: Taint nodes even if reboot is currently blocked #971

yaraskm commented Sep 3, 2024

jackfrancis Sep 16, 2024

yaraskm Sep 25, 2024

yaraskm Oct 1, 2024

evrardjp Oct 14, 2024

evrardjp commented Oct 14, 2024 •

edited

Loading

yaraskm commented Oct 15, 2024

970: Taint nodes even if reboot is currently blocked #971

Are you sure you want to change the base?

970: Taint nodes even if reboot is currently blocked #971

Conversation

yaraskm commented Sep 3, 2024

jackfrancis Sep 16, 2024

Choose a reason for hiding this comment

yaraskm Sep 25, 2024

Choose a reason for hiding this comment

yaraskm Oct 1, 2024

Choose a reason for hiding this comment

evrardjp Oct 14, 2024

Choose a reason for hiding this comment

evrardjp commented Oct 14, 2024 • edited Loading

yaraskm commented Oct 15, 2024

evrardjp commented Oct 14, 2024 •

edited

Loading