Issues with Ubuntu and Debian Kernels corrupting due to SSDs.


Badge +5
We have an NX-3050 and we frequently have to re-build linux VM's due to their ext4 filesystems corrupting and goin into read only. Our research has pointed us to articles where the linux kernel has issues with SSD's. Has anyone else experienced this and if so, how did you solve it?


Edit: We created a container that bypassed the SSD's and we have not yet see the issue there, but we would love to re-engage the SSD's on our servers. The linux version/distro is Ubuntu 12.04.3 LTS.

One of the articles we found relating to this is: http://askubuntu.com/questions/262717/ubuntu-12-04-ssd-root-frequent-random-read-only-file-system

I hope this helps if any of you have experienced this issue.

8 replies

Badge +8
Did you see similar issues with Windows or Centos ? From Guest OS perspective, it is a HDD drive and it should not see it as SSD drive, so this kernel corruption bug should not matter.

It would be surprising, if bypassing SSD in container fixes this issue.
Badge +4
I've noticed similar issues. In our case we have template VMs which are left shut off and we clone off them as required for usable VMs. When running Ubuntu 12.04 LTS we started getting regular disk IO errors which resulted in the VM being paused and in bad cases the drive becoming read only as OP reported.

We only needed Ubuntu for a single application which now runs on CentOS 6. We have not seen this issue with CentOS under identical circumstances.
Badge +5
@jerome

The issue confused us also as I agree that the guest os should have no ides they are SSDs, but the errors we found in the logs were consistant with the bug referred to above. Windows as yet to have a problem, and we haven't tried Centos yet as one of the blogs we found was a similar issue (if not the same) on CentOS with SSDs. We only created the container that *should* ignore the SSD's a few weeks ago but so far there haven't been any issues. If all else fails, we have a NAS presented to vSphere via NFS that we could put our linux boxes on, with the new container and the updated kernel, things are looking good.

@kiboro

I am glad I'm not the only one and I am very happy to hear about CentOS, we may have to switch. I am still fairly new to linux (about 1 year exp.) and have been on Ubuntu and Debian from the start. Have you reported the issue to Nutanix at all, when I first talked to them about it, they hadn't heard of it.
Badge +4
Do you all have copies of the logs or a screenshot of the failure, i.e. /var/log/messages or /var/log/dmesg

Since the hypervisor presents a disk, it should be transparent to the VM, espically given how the distrubuted file system in a cluster works.

Is this KVM, vSphere, or HyperV?
Badge +4
In my case it was Ubuntu and on one occasion Windows under KVM. VMs paused and the only evidence at all was in /var/log/libvirt/qemu which reportind disk IO errors which was why the VM had been paused. VMs restarted ok withouth corruption.

Never seen this issue with CentOS and never with non-cloned drives.

If removing the SSD tier stops it (not that I've tried) then I'm suspecting some sort of latency issue. I wonder if copy-on-write on a cloned disk causes the SSD cache layer to trip up briefly where running of spinny disk is slow enough for this not to happen. At that point it might be in the order of magnitude of kernel parameter tweaks which might explain why CentOS doesn't exhibit the same behaviour.
Badge +4
All valid thought processes. It would be helpfull to check out logs when the issue occours, was there any particular event going on, high disk IO, or some proccess on the VM?

It would be a good thing to potentially contact support when there is a paused VM so wost case support may be able to find the root cause.
Badge +4
This was a while ago now. We needed to use Ubuntu for a particular application which has now been ported to our default distro, CentOS. I did raise a ticket and there was nothing of any note in any of the logs. Nutanix support also spent quite a while looking into it. Problem has never occurred under CentOS even with the ideintical setup as with Ubuntu.
Badge +5
I will have to see if I can scare up the logs. I have switched everything over to CentOS now but I might still have a couple of the VM's archived. I will add them to this thread if I can find them.

This is on vSphere 5.1 and the Ubuntu was 12.04.3 LTS.

@kiboro Interesting you mention cloned drives, I haven't been cloning the drives of our CentOS boxes, but I did with Ubuntu, I wonder if that is a factor.

Removing the SSD layer did not work, we had some fail even in the "noSSD" container.
@swatkins We have not ever had I/O that the nutanix would consider high, I think the highest we have seen is about 2000 iops (during a spike).

I did contact Nutanix and after reviewing everything they recommended opening a case with VMware, but we haven't gotten around to that since we switched to CentOS (it was time to update anyways).

I had first thought it was a side effect of converting the machines (pre-nutanix, we were on hyper-v 2)

Anyways, I will try to find the logs, but we consider this resolved since CentOS is working for us.

Thanks everyone!!

Reply