Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry DHCP option 57 default value when failure happens #767

Closed
wants to merge 1 commit into from

Conversation

w13915984028
Copy link
Member

@w13915984028 w13915984028 commented Jul 18, 2024

Problem:

Harvester installer reported DHCP failed when the DHCP server was running on a Windows server.

Solution:

The DHCP client used by Harvester, is using value 1500 for option 57. It is not wrong, but as RFC only defined the min value 576, did not define the max value limitation, 1500 may be refused by some DHCP servers.

Give DHCP client an addtional try with the generally used DHCP option 57 values.

Related Issue:

Partially related to harvester/harvester#3428

Test plan:

Install Harvester with dhcp mode IP, it should get IP successfully.

image

The DHCP client function on new cluster worked as expected

 Expiry Time           MAC address         Protocol   IP address           Hostname   Client ID or DUID
------------------------------------------------------------------------------------------------------------------------------------------------
 2024-07-18 20:05:26   52:54:00:6e:5c:2a   ipv4       192.168.122.131/24   harv31     ff:00:6e:5c:2a:00:01:00:01:2e:2b:fe:c8:52:54:00:6e:5c:2a

@w13915984028
Copy link
Member Author

Why two times of DHCP processsing when applying for node IP from DHCP server stage?

image

@w13915984028
Copy link
Member Author

@starbops @mingshuoqiu
Please take a look the second commit 21cdbbb, thanks.

I tested with dhcp option57 value 576, the installer works well.

But to reduce risk, the code will only try value 576 when 1500 failed, give the dhcp client an additional try.

@starbops
Copy link
Member

As far as I know, option 57 specifies the maximum DHCP message size. The current default value insomniacslk/dhcp is using is 1500, which means the client expects the offer/ack message to be lower than 1500 octets. Do you know why a much larger upper limit would fail, but changing to a lower upper limit is successful?

@tserong
Copy link
Contributor

tserong commented Jul 22, 2024

Why two times of DHCP processsing when applying for node IP from DHCP server stage?

That first log message ("Network configuration is applied") comes from preGotoNextPage() on the network page:

logrus.Infof("Network configuration is applied: %s", output)

(Guessing) after leaving the network page, did you go back then forwards again?

@w13915984028
Copy link
Member Author

@starbops It is a good question.

The second commit comes from following considration:
(1) Currently, I have no exact windows server to reproduce and debug; from the error message, it failed at the first broadcast Discover message (harvester/harvester#3428 (comment)), the MSZ is a suspected difference
(2) RFC defines min value 576 but not defining max, maybe some DHCP server fail on big values;
(3) The code will give an additional try with 576 if 1500 fails

What I can say is that it does no harm, if it fails on Windows DHCP server again, the root cause is further narrowed down.

@w13915984028
Copy link
Member Author

@tserong The tricky comes from setupNetwork

https://github.com/harvester/harvester-installer/blob/e767167263acf062860252c24302ad3ac223b9e5/pkg/console/install_panels.go#L1383C2-L1383C14

which calls RestoreOriginalNetworkConfig

if err := config.RestoreOriginalNetworkConfig(); err != nil {

the Restore triggers an old & additional DHCP via wicked

@tserong
Copy link
Contributor

tserong commented Jul 22, 2024

the Restore triggers an old & additional DHCP via wicked

All I can see in here is writing various files under /etc/sysconfig/network/:

func RestoreOriginalNetworkConfig() error {
if len(originalNetworkConfigs) == 0 {
return nil
}
remove := func(pattern string) error {
paths, err := filepath.Glob(pattern)
if err != nil {
return err
}
for _, path := range paths {
if err := os.Remove(path); err != nil {
return err
}
}
return nil
}
if err := remove(ifrouteGlobPattern); err != nil {
return err
}
if err := remove(ifcfgGlobPattern); err != nil {
return err
}
for name, bytes := range originalNetworkConfigs {
if err := ioutil.WriteFile(fmt.Sprintf("/etc/sysconfig/network/%s", name), bytes, os.FileMode(0600)); err != nil {
return err
}
}
return nil
}

What is it that triggers wicked in this case?

Note also that the duplicate DHCP requests in the screenshot above are about 7 minutes apart (18:36 then 18:43) - surely one call to setupNetworks didn't take that long?

@tserong
Copy link
Contributor

tserong commented Jul 22, 2024

...and anyway (sorry, just thinking aloud) - the fact that the message "Network configuration is applied" appears twice must mean that preGotoNextPage() was invoked twice, because that's the only function that logs that exact error string. Or am I missing some further detail?

@w13915984028
Copy link
Member Author

w13915984028 commented Jul 22, 2024

sorry, the above screenshot is a bit misleading

suppose you already have a Harvester node installed on a KVM VM, and start a new installation of a new ISO, and then run sudo -i tcpdump -i virbr0 udp -e -n -vv on the kvm default network bridge

when you hit dhcp for mgmt network on the installation screen, the tcpdump will show you two times of dhcp request

I did not dig all the details, but we can easily reproduce and observe it.

@tserong
Copy link
Contributor

tserong commented Jul 22, 2024

when you hit dhcp for mgmt network on the installation screen, the tcpdump will show you two times of dhcp request

Ah! That'll be this thing:

addr, err := getIPThroughDHCP(config.MgmtInterfaceName)

It runs after setupNetwork() to test that DHCP works. I have no idea why we do that extra test, but I noticed the duplicate request as well some time ago, because it was causing trouble for my DHCP server when the server is configured to hand out addresses from a pool, rather than fixed addresses. Here's the bug I opened at the time: harvester/harvester#5926

Sorry for not realising that sooner!

@w13915984028
Copy link
Member Author

@tserong @starbops @mingshuoqiu

I rebased the code to address merge conflicts, please take a new look, thanks.

@w13915984028 w13915984028 changed the title Bump depended dhcp client package Retry DHCP option 57 default value when failure happens Jul 25, 2024
@w13915984028
Copy link
Member Author

The dhcp client package bumping is split to PR #779

Copy link
Contributor

@tserong tserong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed a couple of tiny little things (see comments below), and in general LGTM.

I'm wondering though, if msz=1500 might cause problems with some servers, why not just always set msz=576, rather than trying msz=1500 first? Then we'd only need to keep the implementation in getIPThroughDHCPWithOption57(), and could drop the original version.

pkg/console/vip.go Outdated Show resolved Hide resolved
pkg/console/vip.go Outdated Show resolved Hide resolved
Copy link
Member

@starbops starbops left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a suggestion, do we want to upstream this enhancement to insomniacslk/dhcp? I saw you already created an issue there (insomniacslk/dhcp#542). If yes, perhaps we could first apply the enhancement to harvester/dhcp and file the PR to upstream at the same time. For harvester/harvester-installer, replace the dhcp module with ours. WDYT?

pkg/console/vip.go Outdated Show resolved Hide resolved
@bk201
Copy link
Member

bk201 commented Aug 5, 2024

@w13915984028 sorry to chime in here, can you reproduce the issue?

@w13915984028
Copy link
Member Author

@bk201 Unfortunately, I don't have the exact windows server version to run and test that dhcp server, and did not find one from surrounding.

@tserong The max value 1500 works in most cases, we only got this reported issue. Keep the original code and add a second try only when 1500 fails is the more reasonable solution for now.

@starbops You are right, the better way is to get supported from upstream, the project looks to be a semi-active state now, I will try to add a PR to upstream.

Thanks for all your kind reveiw.

@w13915984028
Copy link
Member Author

@starbops @tserong @mingshuoqiu

Please take a new look, if any new concerns about this PR. When no, we can merge it and have more time for testing, thanks.

@bk201
Copy link
Member

bk201 commented Aug 12, 2024

@w13915984028 I suggest we should reproduce the issue first to verify the suspicion.
From the discussion, it looks like this issue can be fixed in the dhcp module. Since we own harvester/dhcp, how about fixing the issue in it to omit the retry in the installer code?

@w13915984028
Copy link
Member Author

@bk201
(1) Windows server is something away from our daily developing, that also explians why the issue #3248 hangs here so long;
(2) The PR is an enhancement, even the issue is not reproduced by us, the code still does no harm, and if lucky, it may fix it;
(3) The dhcp code github.com/insomniacslk/dhcp/dhcpv4/nclient4 is not belonging to us

We can merge this PR, and if we have chance to get a Windows server & reproduce the issue, and know that the root cause is something else, we add new fix and revert this PR. WDYT?

thanks.

@w13915984028
Copy link
Member Author

Add more backgrounds:

(1) Unlike k8s/go, upstream can depricate features as long as he wants and the depricated features are removed quickly, user needs to bump and update daily. The internet protocols (e.g. IETF RFCs) are mostly keeping things, no one can force users to drop old devices or old systems.

(2) The DHCP option 57 defines min value 576, but no upper limit, thus different DHCP servers (or even specific version) can have it's understanding. 1500 is generally accepted by most modern DHCP servers, but some legacy (versioned) servers may refuse it. Indeed, in a 1500 MTU L2 network, the DHCP option itself is hardly to reach 1500, but arguing this makes little sense.

Why we add such a PR to retry 576 when 1500 fails?
(1)The DHCP option 1500 has worked for most cases, and probablly is the cause of the issue.
(2) Suppose it is. The best option is still using 1500 for default value, and retry only when 1500 fails. To avoid import new issues.
(3) Given such, retry on the caller is better than retry on the client directly.
(4) The upstream repository https://github.com/insomniacslk/dhcp/commits/master/ is still updating from time to time, it is not a good time for use to fork an repository.

Back to #3428 and this PR, even if the root cause is not this PR, this PR still makes Harvester installer more robust.

@tserong
Copy link
Contributor

tserong commented Aug 13, 2024

After reviewing the last few comments and thinking about this some more, I'm sorry, but I'm really not sure we should take this change:

  1. We do not actually know that it will fix the problem.
  2. This new code path will presumably never be tested by us, given the current code path always works in our various testing environments.

I just checked, and evaluation versions of Windows Server ISOs can apparently be downloaded from https://www.microsoft.com/en-us/evalcenter. What about installing that in a virtual machine, configure it to be a DHCP server, then boot the harvester ISO on a VM in the same network and see what happens?

@w13915984028
Copy link
Member Author

@tserong Thanks for your investigation, I will try to set and reproduce. Hopefully, this issue is not related to only certain versions Windows DHCP servers.

Meanwhile, it is not surprising that sometimes the defending codes are added, which are difficult to be tested on normal black box of view.

@tserong
Copy link
Contributor

tserong commented Aug 13, 2024

Hopefully, this issue is not related to only certain versions Windows DHCP servers.

Yeah, that would be really annoying!

I just checked the original issue, which says they used "windows DHCP server 2016" (harvester/harvester#3428 (comment)) and I had a further look on the MS site - thankfully they do seem to have that version there: https://www.microsoft.com/en-us/evalcenter/evaluate-windows-server-2016

@w13915984028
Copy link
Member Author

This PR was replaced by #879 which solved the root cause.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants