IOPS&Latency issue | Nutanix Community
Skip to main content
Hello mates,

We have 2x NX3350 and Arista switch which handle roughly 100 VM.

Workloads are:

20x VM with heavy disk workload which are used mainly for reporting and analysis. R/W ratio is roughly 50/50. They created 20K IOPS workload when they lived on SAN storage with RAID10 and tiny flash tier and average latency was always below 5 ms.

10x VM which are used for application virtualization

70x VM which are use for desktop virtualization.



Now, when we migrated to Nutanix, we have only 4K cluster IOPS and 20 ms latency which does not seems to be very good for us.



Trying to resolve the issue, we enabled inline compression and increased CVM memory up to 20GB. We also tried to change tier sequential write priority. Unfortunately, this does not help.



ncc, cluster status and prism health claim that everything is ok. Before we migrated our environment, we've run diagnostic VM and the results was roughly 100K IOPS for read.



Here is the current configuration of cluster:NOS Version: 4.0.1.1

1 storage pool

7 containers with enabled inline compression



Also, please see the 2009/latency page output on one node as an example:



reads:

nfs_adapter:



Stage Avg Latency (us) Op count Latency % Op count % of total of this component of total of this component

nfs_adapter component42348273618100100100100

RangeLocksAcquired027361800100100

InodeLockAcquired527383000100100

SentToAdmctl427377200100100

stargate_admctl423032737729999100100

AdmctlDone227377200100100

Finish827382800100100

writes:

nfs_adapter:



Stage Avg Latency (us) Op count Latency % Op count % of total of this component of total of this component

nfs_adapter component21420150682100100100100

RangeLocksAcquired015068200100100

InodeLockAcquired1615283800101101

SentToAdmctl16148526009898

stargate_admctl2150614852698989898

AdmctlDone2148526009898

Finish18215283100101101



Any ideas please?



Br,



Update

We ran diagnostic VM today during the production workload. Here is the output:



Waiting for the hot cache to flush ........... done.

Running test 'Sequential write bandwidth' ...

Begin fio_seq_write: Wed Oct 1 11:52:11 2014



1475 MBps

End fio_seq_write: Wed Oct 1 11:53:07 2014



Duration fio_seq_write : 56 secs

*******************************************************************************



Waiting for the hot cache to flush ............. done.

Running test 'Sequential read bandwidth' ...

Begin fio_seq_read: Wed Oct 1 11:54:16 2014



5104 MBps

End fio_seq_read: Wed Oct 1 11:54:33 2014



Duration fio_seq_read : 17 secs

*******************************************************************************



Waiting for the hot cache to flush ......... done.

Running test 'Random read IOPS' ...

Begin fio_rand_read: Wed Oct 1 11:55:20 2014



123849 IOPS

End fio_rand_read: Wed Oct 1 11:57:02 2014



Duration fio_rand_read : 102 secs

*******************************************************************************



Waiting for the hot cache to flush ....... done.

Running test 'Random write IOPS' ...

Begin fio_rand_write: Wed Oct 1 11:57:38 2014



85467 IOPS

End fio_rand_write: Wed Oct 1 11:59:20 2014



Duration fio_rand_write : 102 secs

*******************************************************************************



Tests done.



Update

We have found, that the problem is in the Hypervisor datastore layout. If I mount Nutanix container inside the test virtual machine, everything works as it should and we have nice results. Once VM is wrting data through the Hypervisor, the latency and IOPS are bad again. What could be the reason for this?
Hi,



You should migrate on supermicro and some really cool unix system clone.



SY
Hi



Clearly, it's a configuration issue.



6x3050 nodes should give you at least 120000 / 60000 random IOPS (hot tier).



Please open support case, definetely we will help you to fix the problem.
In fact, in case our diagnostics VM runs fine, the issue is probably related to your guest VM configuration.



Multiple vdisks should be attached (they can be unified with LVM for example) to get more performance from VM's, as Nutanix OS limiting oplog size per vdisk (to avoid "noisy neighbour" problem)



In fact you can look inside of our diagnostic VM (password is the standard one), it's Centos 6.5. There are multiple vdisks connected and manged by LVM. This way you can get much better performance - 15-20kIOPS per single VM.



It is not nessesary to use VMware paravirtual adapter for Linux VM's, standart SCSI / SAS will work fine.
Thanks for reply,



>Multiple vdisks should be attached (they can be unified with LVM for example) to get more performance from VM's, as Nutanix OS limiting oplog size per vdisk (to avoid "noisy neighbour" problem)



Why then does my test VM work fine when I mount container inside it and run IO tests? It has only one vdisk, and when I run i.e. sequential write test, the results are ~270MBs throughput?



Also, roughly half a year ago, we tested the same model 1x NX-3350. I ran 4 VM on each node:

1 with random read load, 1 with random write load, 1 with sequential write load and 1 with sequential read load. Totally, I had 12 VM on 3 nodes. And I did not feel any oplog per vm limits, because that time we got 50K read IOPS and 30K write IOPS etc. during more than 20 hour testing. The only difference was version of NOS, it was 3.5.
"time we got 50K read IOPS and 30K write IOPS"



I am not sure if it is possible at all :)



You can never get 50k IOPS on a single VM without multiple disks attached.

Again, this is normal / expected behaviour by design (covered by multiple guidances, for example MS SQL on Nutanix Best Practices)



Probably, your tests were incorrect (RAM cache used, etc).



Again, in case you don't have any issues with NFS-mounted contaner (+ our diagnostics VM runs fine / showing perfect perfromance), it means that something wrong with a guest VM configuration.



It is also possible that you've got some flapping / network problem.



The only fast way to fix this issue is to open a support case. Obviosly, the situation is not normal.
Thanks for helping me,



>You can never get 50k IOPS on a single VM without multiple disks attached.

>Again, this is normal / expected behaviour by design (covered by multiple guidances, for example MS SQL on Nutanix Best Practices)



Not on a single VM but we got 4 VM for each type of workload for each node. Totally, we got 12 VM. And we had approximately 40K IOPS. Those VM was simple ubuntu VM with fio with no advanced configuration.



>The only fast way to fix this issue is to open a support case. Obviosly, the situation is not normal.

Thanks you for advice. I've already opened case for this issue



I don't say we have troubles with NFS container. Again, as I've told beforem when we mounted NFS container inside the VM, we got perfect results. We are facing performance problems only when we read/write from the hypervisor datastore.
>In fact you can look inside of our diagnostic VM (password is the standard one), it's Centos 6.5. There >are multiple vdisks connected and manged by LVM. This way you can get much better performance - 15-20kIOPS per single VM.



We've deployed diagnostic VM with VMware Workstation and it has only one vdisk both in VM settings and inside VM. Please, could you explain a little bit more how can we see multiple drives in diagnostic VM?



Because today we performed such tests with spanned volume created from 8 vdisk. Unfortunately, we got only 1.4K IOPS for random read. And by the way, we moved all the heavy VM out of Nutanix, so now it handles only VDI and thin app VMs.
Can you share the fio script/parameters that you used? I would like to see if I can reproduce this.



From what I understand, you have an ubuntu guest vm running in an ESX hypervisor. Running fio inside this vm on ESX gives much worse performance than if you mount the NFS directly.
TenKe



I head Support for Nutanix, and I understand that my team is working with you to collect performance data on your cluster, which we have engineers standing by to analyze when we receive this. The diagnosis so far has not pointed out any issues in the product, but we need the data to dive down further to get closer to the root cause.



Thanks for your support, and for being a Nutanix customer.
Which version of ESXi is being used?
Which version of ESXi is being used?

ESXi 5.1 U2



Thank you mates for your assistance. I will reply asap.