Stack for Nvidia Docker GPU compute? | Nutanix Community
Skip to main content

We’re standing up some docker-based Nvidia GPU compute workloads for the rapids.ai ecosystem for replacing/accelerating Spark & friends. However, we’re lost in Nutanix GPU virtualization docs, so curious if folks have ideas on the pieces for nutanix to work here.

 

Right now, we’re thinking P100/V100 GPU → ahv / esxi → rhel 8.x → docker, and as an optional reach target, see if we can guest multiple host OS’s to share the same GPU(s). We’ve successfully done GPU → Ubuntu+RHEL → docker, but without ahv/esxi in the mix.  Most ahv+esxi gpu articles seem more about VDI than compute, so we’re uncertain.

 

Experiences? Ideas? Tips?

What you are looking for is NVIDIA virtual compute server vGPU. This allows you to share a NVIDIA GPU across multiple VMs for gpu-enhanced compute workloads like AI, etc. Nutanix supports that and the inverse of that which is multi-GPU where you would assign multiple vGPUs to a VM. We added support for multiple vGPU in 5.18. We also just added live migration of vGPU enabled VMs in 5.18.1. 

NVIDIA introduced support for virtual compute server vGPU with NVIDIA GRID v9 and RHEL 8.x with GRID v10. So support on Nutanix would follow that as long as you are running GRID v10 or later on ESXi 6.5 or later and AOS+AHV 5.10 or later.

V100 is usually the best card for AI/ML workloads. We support 1 or 2 of those on several server models across NX, DX, XC, HX, and UCS models.

You’d want to read NVIDIA documentation under vGPU user guide. 
https://docs.nvidia.com/grid/10.0/grid-vgpu-user-guide/index.html 
You want to look at the “C” vGPU profiles as these are the compute profiles. These all have 1 display head but you can allocate 4GB, 8GB, 16GB, or 32GB VRAM per VM, which would be sharing a V100-32 8 ways, 4 ways, 2 ways, or 1 way, respectively.

Do you have a Nutanix node with a GPU to begin testing already? We have some available for testing if not.

Let me know what other questions you have.