Question

How to "size" a Nutanix cluster based on performance requirement?

  • 22 April 2019
  • 3 replies
  • 4344 views

Hi guys, we're planning to move/migrate my MS SQL workload from physical standalone host to a new Nutanix cluster. We already engaged with local Nutanix representative for this. However, we encountered a few hurdles in terms of performance during and we can't be sure whether we have the right sizing for our performance requirement.

Current Production Physical host:-
Processor: Intel Xeon E5-2690v3 @ 2.6GHz (single socket, 12C 24T)
RAM: 96GB (However only 64GB usable for the MS SQL 2008R2 DB)
Disks: Database MDF resides in a SSD RAID 5 diskgroup. (6 disks), OS and LDF etc resides in other disk groups, SAS disks 10K, 15K RAID 10, RAID 1 etc.


Test Unit ( 3 nodes):-
Processor: Intel Xeon E5-2650 v4 @ 2.2 GHz (dual socket, 24C 48T) per node
RAM: 512GB per node
Disks: 6 x 1.6TB Intel S3610 SATA (All Flash)
Hypervisor: AHV 5.9.2.4
Network: 10GbE RJ45

SQL Test VM created in Nutanix test cluster:-
12 vCPU
96GB RAM

CVM size:
12 vCPU
32GB RAM

Test Result from MS SQL set statistics time on:



Questions:
We tested different type of queries and compared the results from both the physical host and test units. However, the results was not what we expected.
As observed above, the initial run of the SQL in Nutanix took much longer compared to physical host.
The initial run will always query from disks and the query will then "cached" in the RAM.
Hence second run of the query for both Nutanix and Physical will have almost same result.

We have roughly identified that the clock speed of the CPU could be the reason of such difference in terms of speed. But we do not have a very concrete proof to prove this.

1) How can we be sure that it's due to the lower clock speed of the CPU?
2) How do we know what's the maximum performance we can yield from the cluster?

This topic has been closed for comments

3 replies

Userlevel 3
Badge +10
Realistically, when we increased our core count it helped performance a lot. If you've already done so, I would say that your issue isn't CPU based.

As far as testing stress on a SQL server, there are several internet resources that can help you simulate load on a SQL server.

i.e.

https://www.brentozar.com/archive/2015/05/how-to-fake-load-tests-with-sqlquerystress/


To simulate high Storage I/O you can use a test tool call IOMeter.
https://www.itprotoday.com/mobile-management-and-security/using-iometer-measure-disk-performance

I'm not sure what you're using to measure your current query times, but I assume it's SQL Management studio. Have the queries changed since moving it over?

Above all else, if this is POC hardware, I would directly engage with your sales engineer and or support to understand what might be going on. There may be some optimizations that they recommend specific to your situation.
Hi, we're running on AHV 5.9.2.4 for our test cluster.

May I know what kind of differences that you experienced when moving to AHV?
Does slower CPU affects the vDisk speed?

We are starting to think that is it possible that slower CPU frequency will actually yield lower throughput of CVM...?

If you're unable to do that then I would increase the CPU count to at least 14 or 16 cores and split the CPU into 2 sockets and 7 or 8 cores per CPU.

> Already done that in the beginning.


How many databases are on your data drive and how busy are they?

>The current test cluster runs only the single MS SQL VM as I mentioned earlier. And we only run 1 query at one time.

I have even tried to increase the CVM RAM from 32GB to 48GB, but there's no performance improvement.
vCPU of CVM (during high load), is around 50% +-

The golden question: How do we know the new cluster that we are going to purchase, is going to fulfill the speed/performance that we wanted?
Userlevel 3
Badge +10
So my first question here is:

Are you running AHV or ESXi?

If you're running ESXi, I would do to optimize CPU load balancing such as ensuring you have RSS (Receive Side Scaling turned on). If you are using AHV, there is no need for RSS as the VirtIO driver requires no such enhancement.

We recently ran into this on our SharePoint farm we recently moved over to AHV. In terms of hardware the only difference was the CPU clock speed. Utilization was extremely nominal (never exceeding 60%) but the decrease in clock speed appeared to make a sizable difference. We decreased by 500MHz per vCPU. You're reduced by 400MHz per vCPU, so you've been reduced by effectively 4.8 GHz.

Not to mention you're going from dedicated CPU threads to virtualized, so there is some overhead from the hypervisor for that.

My recommendation. Take a look at my storage recommendations below and see if that helps. If you're unable to do that then I would increase the CPU count to at least 14 or 16 cores and split the CPU into 2 sockets and 7 or 8 cores per CPU. This should help you quite a bit with more CPU threads available and get you closer to your experience of what you had on dedicated hardware.

Dedicated hardware will always beat virtual resources in a 1:1 comparison. The behavior is completely different from a CPU and storage utilization perspective.

Second question:

How many databases are on your data drive and how busy are they?

One of the lesser know best practices, is to dedicate vdisks to your busier databases.

Think of it this way. If you have 20 databases and you have queries constantly reading from disk to build views or really anything with a higher amount of IO, which would be more efficient, 20 databases sharing storage streams for a single vdisk, or 20 databases split between two vdisks, each with their own storage streams. All in all, If you have multiple high IO databases, I would highly recommend using dedicated vdisks for them. I know this may seem ridiculous given you're running on SSD's, but it's not always the speed of your disk, but the available streams that make your storage fast. Having vdisks that exist on different SSD drives will drive your speed up as well as help you take advantage of AHV turbo (If you're using AHV). This should also reduce CPU overhead overall. So you may not need to increase your cores.

Separate your logs and tempdb to dedicated vdisks as well. You'll see great improvement if you follow these practices. I'm very fortunate to have one of the very best SQL DBAs on my team (Brad McGehee)
https://bradmcgehee.com/

We split the logs and tempdb as a part of our standard deployment. We've worked with him closely to improve our SQL environment over the years. And with his status with Microsoft, and in the SQL community I don't argue.

I hope this helps some and gets you toward what you're looking for.