I stumbled upon a very weird problem and I am not able to sort it out myself.
I have a pretty fresh Cluster running AHV which is for our Lab.
I installed a Server 2022 DC - all fine. I installed a second one - all good.
I created the Active Directory on the first one, edited IP-Settings on both machines including DNS and everything and tried to join the other Server as a second Domain Controller. That is where I got stuck. It wasn’t able to join the Domain. On the Error Page it said it was able to find the Domain but wether A-Records are not correct or the Server is noch reachable.
Ping and nslookup works just fine though. I tried a day testing everything that was coming to my mind and after some while I tried to eliminate as much factors as possible. So I migrated both Servers to the same Node to eliminate networking issues - and this does the trick. When I want to domain join another Server from another Node it does not work anymore, as soon as it is on the same, it does.
I then tried to domain join from another virtual environment - not from nutanix - same outcome as when it’s on different nodes. When trying to join the only additional step for the traffic is to go out the node on the physical interface on native vlan, to a switch and back to the other node - so no firewall, no packet inspection no nothing. Just plain raw networking, not even routing.
So for those reasons and because it also doesn’t work coming from another source that nutanix I believe, that something happens with the packets entering the node. When I tried to do a wireshark the only differnce was a about 4 second delay and some retransmissions for the DNS when atrying to domain join:
oppsed to a pretty clean DNS query when on same nodes:
Do you have any ides what is causing this? As a next step, I will make the Server on the other non-nutanix environment a DC and try joining it from Nutanix to this to see if this problem is only outgoing or both, outgoing or incoming.
Any advice will be highly appreciated!
On a side note: I’m also unable to join the Cluster in the Domain - maybe for the same reason because the CVm hosting Prism is on a different node than the DCs - but I did not verify yet.
Same result when trying to join from Nutanix to the other environment - same switch, same VLAN:
Update: I installed a second server on the non-nutanix environment and tried to join it the test-domain on the non-nutanix side (we have both set up for testing purposes faily new) and we have the same picture here. No domain join on different servers but on the same it works. I will keep this thread postet if I’m allowed/supposed to and inform what the solution was in the end even though I assume it will be nothing nutanix specific but netwokring related.
(might enable Full DNS logging on the DNS server to confirm things are ok...
Perform various diagnostics as a test before the promotion?
dcdiag /test:dns /v /s:<DCName> /DnsBasic /f:dcdiagreport.txt
dcdiag /test:dns /DnsRecordRegistration
dcdiag /test:dns /v /s:<DCName> /DnsDynamicUpdate
Wonder what you get when you try this:
nltest /dsgetdc:<DNS domain name> /force
I tried pretty much all of your suggestions but this is a great collection of troubleshooting options!
The solution I found yesterday is an issue I never came across before. I checked the traffic that went over the firewall thoroughly and saw errors that said "Bad Checksum" coming from the VMs. So I googled this and came across a similar issue with Boradcom NICs. After specifying for the Intel X722 that which are in the Lenovo Hosts I found this:
So apparently Windows introduced new options withing the NIC that lead to that errors. All the suggestions in this thread did not help though so I installed Server 2016 DCs and it works now. So I know the source of the issue now but not the final solution yet. But definitely it is a Windows Server 2019 and above problem - that's why the other virtual environment had the same issues - Intel NICs.
I don't know if there are new drivers out yet to fix the problem but as the problems are coming from the guest I don't even know if that would help. Apparently there are new drivers which have the said options disabled by default but that will oy help when windows is installed on bare metal with direct access to the NIC, not over a hypervisor I guess.
Can you share the name of Domain Name in use? Please also share your final queries
Can you tell me why the domain name is relevant for this issue?
And what do you mean by my final queries?
I just wanted to check if we can help in any way to move forward
We are having the same issue with a Lenovo HX2320 cluster (AHV) with Intel X722 NICs. Only Win Server 2022 DCs are impacted (Win Server 2019 DCs are OK). Has a fix been found ?
Install the new .17 Virt-io drivers on 2022. I had a similar issue and it resolved it.
We’ve actualy done quite a bit of digging with the help of Nutanix support and on our end the problem is that on VMs with the virtio 1.1.7 exe installed, installing the NGTs would revert the drivers to 22.214.171.124.
As a workaround, uninstalling either the virtio or NGT fixes the issue.