Since I’ve now written few posts about NetApp, it is time to switch gears. While I am quite noob with Nutanix, I’d like share something about Nutanix as well.
I received a demo unit from Nutanix a while go. One way to get familiar with a product is to put some load on to it and see what happens.
Because I am going to show some performance figures and Nutanix EULA forbids publishing benchmarking results, I am not going to disclose the configuration of the Nutanix box. This way performance figures are just numbers, not benchmarking results and hopefully I am not breaching the EULA. Furthermore without disclosing all the workload parameters and the configuration of the box, metrics such as “IOPS” and “Latency” are just numbers without relevance and should not be used in any comparisons with other products.
The point of this post is NOT to show benchmarking or “hero” numbers. The point of this post is to show that you can get overly positive and false results with artificial load generating tools and to demonstrate Nutanix analytic capabilities to catch these false results.
For example I am going to show ways to get fairly high number of IOPS (random data) or throughput MB/s (sequential data) while getting unrealistically low latencies and how to catch those false positive results.
Workloads in the tests were generated with DiskSPD, a storage testing tool developed by Microsoft, a successor of SQLIO testing tool. DiskSPD is a command line tool with rich features for testing storage performance. I also used a separate GUI for DiskSPD for convenience, it easier to save workload profiles and launch tests with a GUI. Typically when doing performance testing, using just one test machine is not enough to explore the full potential of a box, so I used handful of virtual machines running Windows 2012 server, each running one instance of DiskSPD test.
Out of the box Nutanix comes with rich built-in analytics tools within HTML5 based user interface called “Prism”. With Prism you can get in-depth analysis of your environment at virtual machine or virtual disk level. There are several metrics collected and these can be grouped together and displayed on “Analysis” page for further inspection. For everyday usage Prism analysis quite enough.
On top of normal Prism analysis functionality there is a way to get even more detailed information of workloads. Nutanix system is built from several software components, one of those is called “Stargate”. According to Nutanixbible.com it is described as follows:
- Key Role: Data I/O manager
- Description: Stargate is responsible for all data management and I/O operations and is the main interface from the hypervisor (via NFS, iSCSI, or SMB). This service runs on every node in the cluster in order to serve localized I/O.
As Stargate serves I/O, it also collects performance metrics, some of them are available in Prism. If you want more information about I/O, there is an alternative way to see the statistics that Stargate collects, a separate Stargate web page. This web page is not available by default. Visit Josh Odgers’ blog for instructions how to use and access this web page.
So what about the tests then?
For the first test I used one of the built-in tests in DiskSPD GUI, a simulation of database server, doing all random 8k I/O with 70% read and 30% write ratio, I tweaked few parameters and launched few additional virtual machines running the same test. The size of the active data or working set was such that it was larger than the DRAM cache of Nutanix Storage Controller Virtual machine, I was expecting that the workload would be served from SSDs or HDDs, not from the DRAM cache from Virtual Storage Controller machine. On the other hand working set size was such that it would fit in the SSD layer of a machine. HDDs are typically used only to store cold data and Nutanix system should be sized with enough SSDs to accommodate hot data.
I waited for a while to let the system stabilize and started to explore analytics functionality of Prism interface. While you can get per VM statistics, I am only showing summary statics, so this doesn’t constitute as benchmark.
Screenshot from Prism VM-page summary, I/O from Virtual Machine perspective:
Numbers looked quite normal for a such test, throughput was around 30000 IOPS. Latency (not shown here, since it is not relevant to make my point) was quite typical for a hybrid solution where you have both SSD and HDD serving data.
There is also a way to see what the hardware is doing. For example how much IOPS disks are doing. This information can be found under “Hardware” page from Prism.
Screenshot from Prism Hardware-page disk-summary:
This is the point, where things started to look weird. My virtual machines were pushing 30k IOPS, but only 12k IOPS landed on disk. Where did about 18k IOPS go? Some of the I/O could have been served from DRAM cache, but that much? Not likely, since my working sets were so large that they wouldn’t fit completely into DRAM cache.
Without good analytics capabilities it would been hard to find more information. With some systems I might not have even noticed that disks were serving much less data than what virtual servers were pushing, information might be available, but usually it is not as accessible as with Nutanix Prism.
I scratched my head for a while and then by googling found instructions to get the extra analytics from Stargate web page. With more information it started to make sense. The numbers below are shown per Vdisk, in this case statistics for just one VM running DiskSPD. By looking at just one Vdisk statistics, the numbers won’t match exactly with the summary numbers, but you will get rough idea as whole what is going on.
First of all I was able to verify my working set size. With traditional storage systems this is a very hard number to find, you might find it at volume or LUN level, but since most traditional storage systems don’t have a clue about what the virtual machines are doing, it is very unlikely to find such statistics at VM or Vdisk level. Since Nutanix is virtual machine aware, you can get statistics at VM or Vdisk level. There are some software packages that are able to pull this number from running Vmware environments, such as Pernix Data Architect. I’ve used it on few occasions when sizing traditional storage solutions for Vmware and it is quite cool product.
My Controller Virtual Machines were configured with 32GB of DRAM, so there was not enough room to fit even a single working set completely into memory. With Nutanix you can turn on performance tier deduplication which can increase the amount of data that can be stored at DRAM / SSD layer, how much more, depends on data.
I had turned off performance tier dedupe from the container hosting the DiskSPD virtual machines. Also compression and cold tier dedupe was turned off. While it might be a good idea to turn on some or all of these features in production environments, in testing those features can distort results, especially when using artificial or unknown data.
Source of I/O is divided between read sources and write destinations, let’s look at read sources first:
Since Data (MB/s) is not relevant here, I’ve blocked that. There are several buckets where data can be served for reads. Some of them are in DRAM, some on SSD and some are on HDD.
Cache DRAM was serving only 6.8% read I/O, so that is not the complete solution to our mystery.
Large percentage of reads was served from SSD layer, 50% from local SSD and 3.4% from SSDs on other machines. By the way, this one of the advantages with Nutanix system, a single virtual machine can utilize not only the local SSD, but all SSDs in the same distributed storage fabric cluster, this ability gives tremendous advantages in performance scaling as you are not limited by a single box capabilities to deliver I/O. However this is not helping to solve our problem as those IOPS are shown in disk IOPS statistics. Likewise OpLog is on top of SSD and shown in SSD disk statistics.
None of the I/O lands on HDD. This is typical with Nutanix for small random I/O, which is served from SSD and migrated automatically to HDD once it cools down. As long as data is hot it is kept on SSD.
There are two special buckets with “Zero” in their name and quite a lot of I/O is served from those buckets, 33% from Estore Zero and 2.7% from Oplog Zero, close to 36% of all reads.
Hmm, what are these special “Zero” buckets?
As it turns out Nutanix has a special data avoidance mechanism which detects blocks that contain only zeroes. Instead of writing these blocks to disk, only metadata stating that this block is full of zeroes is stored and no data is written to disk. Since the blocks containing all zeroes are not written to back-end storage, the I/Os related to those blocks are not visible in disk statistics. However the Virtual Machine does not know what is going on under the covers, it only sends I/Os to storage and gets acknowledgements of served I/Os, thus from Virtual machine perspective I/Os done towards blocks containing all zeroes are just like any other I/Os and are counted as normal I/O operations and shown in VM statistics.
It seems that DiskSPD was reading previously written blocks full of zeroes and / or the test file was initialized with blocks filled with only zeroes.
Let’s look at write destination stats:
There are less buckets in write destination statistics.
All random data lands in Oplog which resides on SSD and acts like NVRAM in other solutions. There are buckets for SSD and HDD, under normal operations only large I/O, typically sequential data will land directly in these buckets by-passing Oplog cache.
Since we are doing all random data, nothing should land on SSD or HDD directly, like shown in the statistics. 60% of data is contains data other than all-zeroes and ends up in Oplog. However 40% of data is blocks with only zeroes, these end up in Oplog Zero and only metadata is stored and nothing gets written to storage.
By looking at other Vdisk statistics I was able to verify that the test sets generated by DiskPSD contained a lot of blocks containing only zeroes and that was the reason VM level statistics and disk level statistics were showing different numbers. Mystery solved 🙂
Now is this zero detection mechanism a bad thing? No, it is a good thing, it can improve your performance and save some disk space also with real live production data. How much, it depends on your data. For example databases like Oracle, can reserve disk space by writing bunch of blocks which are filled with zeroes. With Nutanix you can write these files faster and more space efficiently than with storage systems don’t have zero detection and write these “empty” blocks to disk.
So who is at fault here? Well: Me. I was performing tests on bad test data by using default values provided by the load generating tool. This could easily happen to you as well if you don’t pay attention when doing performance tests with artificial load generators.
If you are performing tests with data containing lot of all-zero-blocks or data which is highly repeatable, other data avoidance or saving mechanisms such as deduplication or compression can distort the results quite badly. So having the ability to turn off dedupe or compression while testing is a good thing. With Nutanix you can at least turn off dedupe and compression, not sure how to turn off zero-detection.
With DiskSPD there are ways to prevent having all-zero-blocks in test data. By using “-Z” command line option, you can either tell the tests to run on randomized data or better yet provide a file containing actual production data from existing environment.
In conclusion: If your storage platform provides good analytics tools, you can catch false results while making performance testing and sizing. Without good analytics tools your sizing based on bad tests might give horribly wrong results and you could end up with seriously undersized production system. Analytic tools are also base for automation, with good analytics tools the systems can make better decisions. With bad or non-existing analytics tools it is hard to make good decisions or automation, since you don’t have reliable picture what is going on.
Now that we have found ways to get bogus test results, let’s see how far I can push the system to show silly results, in part 2 of this series, called “How to create unrealistic hero numbers while showcasing storage performance”