From the Nutanix bible.... Your SSD's are not full enough for ILM to kick in and move capacity to the HDD's.
The other case is when the overall tier utilization breaches a specific threshold [curator_tier_usage_ilm_threshold_percent (Default=75)] where DSF ILM will kick in and as part of a Curator job will down-migrate data from the SSD tier to the HDD tier. This will bring utilization within the threshold mentioned above or free up space by the following amount [curator_tier_free_up_percent_by_ilm (Default=15)], whichever is greater. The data for down-migration is chosen using last access time. In the case where the SSD tier utilization is 95%, 20% of the data in the SSD tier will be moved to the HDD tier (95% –> 75%).
vDisk Create Spec:
$vdiskcreate = New-ntnxobject 0name VmDiskSpecCreateDTO
$vdiskcreate.containerUuid = "Provide the container UUID where u need to place the vdisk"
$vdiskcreate.sizeMb = "vDisk size in kb"
New-NtnxVolumeDisk -Uuid "VG UUID" -CreateSpec $vdiskcreate
There's usually a "give-and-take" to the flushing of "episodes" for a vDisk. This can take some time to "flush" if the system (or particular vDisk) is very busy.
So, the first recommendation is to monitor for a bit and re-run the "ncc health_checks stargate_checks oplog_episode_count_check" command for progress.
If that specific NCC check does not clear after a few attempts, say after several minutes, or maybe even hours, the issue could be due to a couple of causes:
1) As the KB 1541 indicates, to a very heavy load (bursts) on a vDisk
2) NCC can also experience a hang/timeout, per previous NOS/AOS versions
3) The vdisk may have underlying issues which would require Support (and possibly Engineering) to be involved.
But first, let's look at finding the file associated:
- Yes, KB 1541 indicates to identify the associated File(s) using the following command on one of the CVMs:
"ncli vm ls | grep -B 9 13334737"
- That should show the VM "Name" and its associated vDisk ID(s) as a list, including the "NFS:" preamble for the vDisk ID(s).
- The vDisk ID(s) associated may be a long string, including the UUID. But the "NFS:13334737" is the important part and can be used to monitor whether the "episode" count is reducing over time.
If you know this VM to be extremely busy, that should also be investigated from within the Guest OS and Application.
Apart from the NCC check, I have used the following to count the "episodes" associated with a vDisk ID:
"medusa_printer -lookup oplog -vdisk_id=13334737 | grep -c episode_seq"
Let's see if that produces a similar count to NCC, and hopefully, both will be decreasing over time.
If it is not decreasing by either of these checks, we may need to have a Support case to look further into the situation.
Hope this helps!
I've been part of running a proof of concept SAS-grid cluster on Nutanix. The results were good. It definitely helped having more than one vdisk on each of the VMs. I would speak to your local Nutnanix rep, there are internal resources that could help.
This has a tendency to change between releases, so you may see some different behavior as you upgrade and we add/remove services. Here's an example from functioning cluster running AOS 22.214.171.124:
2017-06-14 13:38:02.376528: Services running on this node:
abac: [21180, 21208, 21209, 21210]
acropolis: [14691, 26299, 26324, 26325]
alert_manager: [6570, 6627, 6628, 6672]
aplos: [21236, 21263, 21264, 21265, 21345, 21361]
aplos_engine: [14391, 14416, 14417, 14418]
arithmos: [6606, 6660, 6661, 6694]
cassandra: [5331, 5462, 5463, 5490, 5713]
cerebro: [6309, 6372, 6373, 6530]
chronos: [6360, 6415, 6416, 6441]
cim_service: [6518, 6586, 6587, 6647]
cluster_config: [7283, 7322, 7323, 7324]
cluster_health: [5251, 5314, 5315, 5489, 11259, 11288, 11289, 29300, 29301]
curator: [6449, 6474, 6475, 6478]
dynamic_ring_changer: [5960, 5994, 5996, 6050]
ergon: [6239, 6323, 6324, 6327]
genesis: [3061, 3081, 3103, 3104, 4195, 4196]
hera: [6007, 6044, 6045, 6046]
insights_data_transfer: [6209, 6300, 6301, 6365, 6366]
insights_server: [6198, 6245, 6246, 6378]
janus: [6992, 7046, 7047]
lazan: [7489, 7764, 7790, 7791]
minerva_cvm: [7237, 7286, 7287, 7288, 7864]
nutanix_guest_tools: [7060, 7101, 7102, 7148]
orion: [7800, 7861, 7862, 7941]
pithos: [5966, 6018, 6019, 6076]
prism: [31664, 31692, 31693, 31696, 32131, 32134]
scavenger: [3662, 3689, 3690, 3691]
secure_file_sync: [5076, 5124, 5125, 5126]
snmp_manager: [1138, 6880, 6925, 6926]
ssl_terminator: [5069, 5100, 5101, 5102]
stargate: [6173, 6203, 6204, 6216, 6219]
sys_stat_collector: [6903, 6949, 6950, 6952]
tunnel_manager: [6937, 6974, 6975]
uhura: [6856, 6891, 6892, 6894]
zookeeper: [2810, 2839, 2840, 2841, 2893, 2909]
In the case that a process is offline, you would see the empty brackets next to the process (similar to what you see for foundation in the above example). In the above example, we're expecting foundation and cluster_sync to be offline during normal operation, and all other processes should be running.
Additionally, there should be alerts in prism when a particular service goes down. This is a fairly recent addition (I believe recent versions of NCC have this functionality). If you set up alert emails in Prism, you should get a notice anytime a particular process is offline. I just wanted to make sure you were aware of this functionality before you went through the effort of writing the script in powershell that might need caveats to account for transient states, process going up/down expectedly, etc.
Let me know if you would like to discuss further. I would be happy to set up a call if you have additional questions.
EDIT: Updated where the functionality was added for this alert from AOS to NCC.
We have an open ticket to add this functionality but it's not in the product today. What I'd recommend is to slightly modify the disk size to assist in identifying specific drives. For internal reference the tickets are PM-599 and ENG-53402, for any future conversation with support or your account team.