Hi, we're starting to look at using Spark on our Nutanix cluster. Not in a huge way but to run some ETL processes in parallel. I'm under pressure to install Hadoop, or at least HDFS on the cluster but the entire concept of adding a distributed, resilient "filesystem" (actually I think it's more an object store) on top of the one already provided by Nutanix seems somewhat off.
Is there a recommended way of doing this? I know that containers are exported to ESXi via NFS. Would that be usable? Would that be able to leverage stargate to access from anywhere? All I really need is a globally available volume shared between all my nodes.
I've moved your post from the CE forums to our production product forums.
In general, for Hadoop on Nutanix, I'd recommend checking out these three assets which you can cherry pick data from
We dont specifically have a Spark on Nutanix guide out yet; however, those two are rich with content for the type of solution that you might want to roll out.
That said, you are correct that HDFS (in general) is designed for non-redundant storage (like bare metal), so it has a lot of the same constructs that Nutanix does already. It is worth nothing that you can (or should be able to) configure the replication copies of Hadoop itself, such that you dont have many copies in Hadoop on top of many copies on Nutanix. Thats generally where "the rub" comes from when we discuss this with customers.
That said, we've got customers doing Hadoop RF2 + Nutanix RF2 (such as in the Cloudera case) and it works just fine, it just imposes a bit of an overhead.
To be clear though, you can't expose HDFS directly from stargate, so you'd always have something like a Hadoop data node (or data nodes plural) in between Nutanix and Spark