highflyer - we went through a CVM re-IP'ing because of a VLAN change. It is not pretty by any way you do it. I will say that we basically re-wrote supports document on how to do it properly. So if you contact them they should be able to help you now. If not, send me a message and I'll see if we can help.
While working on changing IP's my zk_server_config_file got a line added to it.
if you have 4 nodes, you should have three "zk lines" in the file.
my file had
# Version 10
10.0.0.1 zk1 # DON'T TOUCH THIS LINE
10.0.0.2 zk2 # DON'T TOUCH THIS LINE
10.0.0.3 zk3 # DON'T TOUCH THIS LINE
after the ip change,
# Version 10
192.168.0.1 zk1 # DON'T TOUCH THIS LINE
192.168.0.2 zk2 # DON'T TOUCH THIS LINE
192.168.0.3 zk3 # DON'T TOUCH THIS LINE
192.168.0.4 zk4 # DON'T TOUCH THIS LINE
when I did a svmips
Can't find a path to this node.
Once I,with Dell and Nutanix's help, changed the file in vi and rebooted the cvm....the cluster worked, and stargate is not down.
We have find the issue with de CRC errors and the host time-outs in ESXi 6.0. The CRC errors came from an old ESXi 4.1 cluster. That cluster is now completely destroyed
The time-out issue was interesting! I seems like that our NVidia GRID GPU driver had some strange issues on ESXi 6.0. You saw in the log that the hostd was crashing trough the gpu driver. After a driver update of NVidia, the issues were gone.
In order to identify if GPU driver was causing the hostd crash, we need to extract the hostd zdump and then validate the backtrace:
portion of coredump backtrace:
#0 0x055fc092 in _dl_sysinfo_int80 () from /tmp/debug-uw.BVaHOwcq/lib/ld-linux.so.2
#1 0x0ad5bca5 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:67
#2 0x0ad5d4e3 in abort () at abort.c:92
#3 0x05caecf8 in Vmacore:ystem:ignalTerminateHandler (info=0x446020a8, ctx=0x44602124) at bora/vim/lib/vmacore/posix/defSigHandlers.cpp:65
#4 0x00215002 in ?? ()
#5 0x0a2ebafc in XextFindDisplay (extinfo=0x42a01498, dpy=0x50c287b0) at /build/mts/release/bora-1914620/cayman_xorg/Xorg/libXext/src/extutil.c:231
#6 0x0a298ea0 in GPUFindDisplay (dpy=0x50c287b0) at /build/mts/release/bora-1914620/cayman_xorg/Xorg/libgpu/gpu.c:31
#7 0x0a299105 in GPUQueryPCIID (dpy=0x50c287b0, domain=0x446022f8, bus=0x446022fc, dev=0x44602300, func=0x44602304)
#8 0x0a25762e in VmkCtl::Graphics::GraphicsInfoImpl::GetGraphicsInfoById (this=0x4965d3e8, seg=0, bus=135 '\207', dev=0 '\000', func=0 '\000', numSharedDevices=8,
memorySizeInKB=@0x51ea8a0c, vmWorldIDs=...) at bora/lib/vmkctl/graphics/GraphicsInfoImpl.cpp:257
#9 0x0a257b37 in VmkCtl::Graphics::GraphicsInfoImpl::GetAllGraphicsDevices (this=0x4965d3e8) at bora/lib/vmkctl/graphics/GraphicsInfoImpl.cpp:132
Therefore I had to update the driver for the Nvidia G1 grid adapter as per VMware HCL
Also updated the ESXi host to the latest version 6.0U2 or Patch 03.
I am really having issues getting a working itteration of Remote Site replication working. Apparently there is a known issue with De-Dupe and replication in which replication performance is seriously hampered. Anxiously waiting for Astrix release to address this!
I'm working on migrating our existing workload from VMware to AHV. The migration process is actually pretty painless, and with the exception of a couple of 'missing features', I think it's going to work out great!