After a hard node failure, there is additional data in the replica copy.

Badge +7
Hi all, I thought I would run a data resiliance test but the result is not what I expected. Basically when the VM restarted on another node, upon checking the text file that was being written too, it had extra whitespace data at the end of the file. Any feedback would be appreciated or if someone else could do the same test and let me know how it went for them would be great.

Here is the scenario:
I have a test 3 node cluster (node1, node2, node3) running AHV CE edition 2016-04-19.

On node1:
- I have a basic guest VM called "Win2012R2-vm-101" with a script that constantly appends data "date and time" to a txt file as a log file simulation.

I then simulate a node failure and force power off node1.
The nutanix cluster recovers as expected, powering on the guest VM "Win2012R2-vm-101" on the next available node in this case node2.

When I logon on to the recovered "Win2012R2-vm-101" guest vm and open the txt file that was being written too, I notice that there is additional whitespace at the end of the file. Which is odd because the script only every writes one line at a time to the file. See script below.

Test_RF2.bat - script that was running when testing hard node loss.
-----------------------@echo off
set msg=%date% %time% %random%
echo %msg%
echo %msg% >> C:Last_write_to_file.txt
goto :repeat

I also took video of my test if anyone requires it.

This topic has been closed for comments

12 replies

Userlevel 6
Badge +29
so, the only thing that happens is an extra white space at the end of one file on one guest?

That seems extremely odd.
Badge +7
Hi Jon,
Only one guest had any write activity at the time and the extra data is odd, I agree.

Here is an unlisted video of the test:
(part1 - host loss and file check)
(part2 - host recovery)
Userlevel 6
Badge +29
We dont acknowledge data unless its acked in more than one place, so data integrity is the #1 thing we do.

I was saying it was odd because if you WERE going to have some sort of corruption, just a little bit of whitespace at the end of a log process out of the entire file system isn't the place you'd find it.

Normally things dont boot, and the system would go nuts complaining about corrupt data.
Badge +7
I have investigated further still using Nutanix CE with AHV and found that the extra data also appears if you force power off a guest VM when it is running. So I don't know if that presents any clues as to why this is happening. I'll post back here when I have performed the same test with Windows 2012 R2 installed natively on physical hardware.
Userlevel 1
Badge +10
Can you try with a slightly different script on Nutanix?
Instead of your script use this one:
@echo off
echo "test">> C:Last_write_to_file.txt
goto :repeat

and see if you still get an extra space at the end. In your script as is, it has a space after msg variable and if for some reason it is an empty, the space will be written to the file.

Badge +7
Hi  that is unlikely to change the result as the video clearly shows that %msg% wasn't blank because it is echo'd before being appended to the file. In addition we would see 1 space then newline repeat if that was the case. Instead we see a long string of space then newline.
Badge +7
Ok, as I mentioned before I would rerun the test after installing Windows Server directly onto the physical hardware. So while the script was running, busily appending data to a file, I performed a hard power off. Then powered the server back on, to check the file. Upon checking the file, it contained no additional whitespace. So I believe that data replication (NDFS) between hosts is working fine and can only asssume that the problem lies with "AHV and Windows Server 2012 R2" combination. Might even be the virtio scsi driver.
Userlevel 6
Badge +29
This still has me scratching my head a bit. Are you saying that something within AHV is causing an additional white space to be appended?

Normally, if that was the case, you'd be getting whitespace appended in all files that are open and working, which would mean instanteous OS corruption, and things would be really going down hill.

Given that doesn't seem to be happening, not sure where to take this one

Anyone else on the forum want to repeat this on their kit to see if we can get a repro?
Userlevel 4
Badge +20
I don't have a kit available to reproduce at the moment, but I'd side with Nemat on this one.Gut says it has to be linked with how the original script works.As the scripts makes calls to other routines (to get the time, date and random string), it may just be that there is a cache/flush issue somewhere in the script (like in a C program with printf buffering it's writes/output out but not write/putc).Unfortunately I don't know enough on the innards of Windows and bat scripts to be more definitive about that.
Userlevel 2
Badge +14
Well dont know how redirect works on text files - but i presume it writes a line - so this is not a "good" way to test.

You need to write somthing that can open a file handle push data and close the file handle - so you can control how it gets data (With no linefeeds etc)

The best would be a file where you just append data with no "linefeeds/newlines" at all - a simple way would be to create a simple kix vbs/powershell something instead?

And by the way - im very sure that a block missing would cause a corrupt file instead of whitespace (Whitepace/newline) is actually a character - not just "nothing"..

I have testet / simulated host failures 10-20 times even when stressed - never seen a corruption / missing data - testet on SQL and many other types of software. And if i just have 1 wrong character in a SQL binary file the server would never mount the database..
Badge +7
Yes, AHV or virtio driver, that's all I can deduct and the fact that I could repeat the test multiple times, suggest's that it is worth keeping an eye on.

Back to testing other platforms for me atm. As for the stability of the script, over the past 10 years used on more than 1000 windows machines and servers, appending to a file in a batch script has never produced additional whitespace in the file after any number of disasters; power, network, raid, forcequit etc.
Userlevel 6
Badge +29
Sure, and I'm not suggesting your script is wrong, its a solid idea.

We'll keep an eye on this