HCI and the Art of Performance Measurement - Part II - Microsoft SQL Server
This blog was authored by Andy Daniel, Sr. Technical Marketing Engineer at Nutanix
In Part I of this series, I shared some of the background on our collaborative work with Enterprise Strategy Group (ESG) to test the performance consistency, predictability, and scalability of four enterprise-class, mission-critical, application workloads on the Nutanix Enterprise Cloud Platform. You’ll recall that unlike the “hero number” tests from other vendors that simply generate synthetic I/O, we focused on testing realistic workloads using industry standard application testing tools. In Part II, I’ll deep-dive further into our testing of Microsoft SQL Server.
When choosing workloads for testing, we relied heavily on data and feedback from our customers. Not surprisingly, SQL Server was at the top of the list of most deployed enterprise applications. Although it’s extensively used, its performance requirements can vary widely, so we needed a reference standard for testing. Fortunately, great tools have been developed for common database workloads like online transaction processing (OLTP) and they’re relatively easy to execute once you understand the testing parameters and how they apply to real-world workloads. We ultimately turned to Quest’s Benchmark Factory, and specifically, its representation of the industry-standard TPC-E workload. Quest documentation references the TPC’s own description for the test:
“The TPC-E benchmark uses a database to model a brokerage firm with customers who generate transactions related to trades, account inquiries, and market research. The brokerage firm in turn interacts with financial markets to execute orders on behalf of the customers and updates relevant account information.
The benchmark is “scalable”, meaning that the number of customers defined for the brokerage firm can be varied to represent the workloads of different-size businesses. The benchmark defines the required mix of transactions the benchmark must maintain. The TPC-E metric is given in transactions per second (tps). It specifically refers to the number of Trade-Result transactions the server can sustain over a period of time.”
After deploying four Windows Server 2012 R2 VMs running SQL Server 2016 (one per physical node), we referenced the application’s best practice configuration on Nutanix and configured each accordingly. For SQL Server, guidance includes items like creating databases with more than one data file and allocating each data file on a different drive. Full details can be found in the Microsoft SQL Server Best Practices Guide. Since, when optimally configured on Nutanix all-flash clusters, OLTP database tests are typically CPU bound, we focused heavily on determining the optimal VM vCPU configuration to consistently deliver the most transactions per second while leaving adequate host CPU headroom. We considered 80% host CPU usage to be a typical maximum for most production systems and tuned our testing to not surpass that threshold.
When testing OLTP databases, there are several relevant test configuration parameters that can also vastly affect results. One of these is database size, configured as “scale factor” in Benchmark Factory. Rather than test a very small database to inflate headline tps scores, we tested a more production-like size of 300 GB (scale-factor of 32) per database instance. To fully utilize the database, we ultimately settled on an agent VM generating 80 concurrent users per database and selected the “no think time” test parameter.
Since scalable performance is so important when evaluating hyperconverged infrastructures, ESG was particularly interested in evaluating OLTP performance scalability and workload distribution as more VMs and instances were added to the cluster. So, after completing several preliminary tests to determine the optimal configuration of a single VM and SQL Server instance, test runs were completed for each VM count (one to four) to ensure predictable performance scalability as we added VMs and instances. As you can see in the figure below, not only did tps scale linearly, but average transaction response time remained both low and remarkably steady.
At the completion of testing, the total number of transactions/sec per instance followed a very narrow band from 2,635 to 2,703 for an average of 2,658. As we scaled the workload, ESG was particularly impressed with response times, “Even more impressive to ESG was the average transaction response time, which accounted for more than just storage latency, factoring in database responsiveness with compute being a factor. The Nutanix solution consistently delivered ultra-fast speeds of .031 seconds per transaction with all four nodes running the workload.”
As I mentioned in Part I, although application level performance was the focus of our testing, we did monitor underlying storage performance throughout and noted an average storage read and write latency of 0.95 and 1.59 ms respectively. This low latency is what allowed us to focus on optimally driving CPU for the best results. To put these results in better context, keep in mind that for OLTP testing, all results are sustained. In other words, target performance is the highest steady-state performance with little variability. If properly configured, tps results typically “settle in” within a narrow range, at which point they should run consistently until you stop the test. This makes them a great realistic measure of real-world performance capability.
As a precursor to our testing, ESG also studied the current hyperconverged landscape and came to several conclusions, one of which was “The perception is that for transaction-heavy applications that exercise both compute and storage, hyperconverged solutions will be unable to deliver predictable and consistent performance at scale.” Based on the results of our testing, they confirmed that this is an unfortunate misconception and that, “…[Nutanix] OLTP database environments delivered predictable performance as the number of database instances doubled, including an even distribution of cluster-wide IOPS and consistently low response times, even at higher scalability data points.”
Reviewing the results personally, I also had a few additional thoughts. Having spent my recent career in the field of storage performance and flash, I knew that the introduction of all-flash solutions in the datacenter could make this realm of consistent sustained performance a possibility. However, fully utilizing this media type also requires more CPU cycles than traditional media. Pairing that requirement with CPU hungry applications like SQL Server could potentially be a drag on performance. Ultimately, we uncovered that the ultra-low latency of in-host storage (data locality) more than outweighed any additional overhead. Combining that with near linear IOPS and low-latency scale is the secret performance sauce of the Nutanix platform.
Grab a copy of the ESG report for a detailed look into SQL testing and performance metrics, then let us know what you think, and continue the conversation on the Nutanix Next Community.