Hi, we're starting to look at using Spark on our Nutanix cluster. Not in a huge way but to run some ETL processes in parallel. I'm under pressure to install Hadoop, or at least HDFS on the cluster but the entire concept of adding a distributed, resilient "filesystem" (actually I think it's more an object store) on top of the one already provided by Nutanix seems somewhat off.
Is there a recommended way of doing this? I know that containers are exported to ESXi via NFS. Would that be usable? Would that be able to leverage stargate to access from anywhere? All I really need is a globally available volume shared between all my nodes.
Distributed object storage on Nutanix
This topic has been closed for comments
Hey Kevin,
I've moved your post from the CE forums to our production product forums.
In general, for Hadoop on Nutanix, I'd recommend checking out these three assets which you can cherry pick data from
https://portal.nutanix.com/#/page/solutions/details?targetId=RA-2078-Cloudera-with-Nutanix:RA-2078-Cloudera-with-Nutanix
https://portal.nutanix.com/#/page/solutions/details?targetId=RA-2030_Hadoop_with_AHV:RA-2030_Hadoop_with_AHV
We dont specifically have a Spark on Nutanix guide out yet; however, those two are rich with content for the type of solution that you might want to roll out.
That said, you are correct that HDFS (in general) is designed for non-redundant storage (like bare metal), so it has a lot of the same constructs that Nutanix does already. It is worth nothing that you can (or should be able to) configure the replication copies of Hadoop itself, such that you dont have many copies in Hadoop on top of many copies on Nutanix. Thats generally where "the rub" comes from when we discuss this with customers.
That said, we've got customers doing Hadoop RF2 + Nutanix RF2 (such as in the Cloudera case) and it works just fine, it just imposes a bit of an overhead.
To be clear though, you can't expose HDFS directly from stargate, so you'd always have something like a Hadoop data node (or data nodes plural) in between Nutanix and Spark
I've moved your post from the CE forums to our production product forums.
In general, for Hadoop on Nutanix, I'd recommend checking out these three assets which you can cherry pick data from
https://portal.nutanix.com/#/page/solutions/details?targetId=RA-2078-Cloudera-with-Nutanix:RA-2078-Cloudera-with-Nutanix
https://portal.nutanix.com/#/page/solutions/details?targetId=RA-2030_Hadoop_with_AHV:RA-2030_Hadoop_with_AHV
We dont specifically have a Spark on Nutanix guide out yet; however, those two are rich with content for the type of solution that you might want to roll out.
That said, you are correct that HDFS (in general) is designed for non-redundant storage (like bare metal), so it has a lot of the same constructs that Nutanix does already. It is worth nothing that you can (or should be able to) configure the replication copies of Hadoop itself, such that you dont have many copies in Hadoop on top of many copies on Nutanix. Thats generally where "the rub" comes from when we discuss this with customers.
That said, we've got customers doing Hadoop RF2 + Nutanix RF2 (such as in the Cloudera case) and it works just fine, it just imposes a bit of an overhead.
To be clear though, you can't expose HDFS directly from stargate, so you'd always have something like a Hadoop data node (or data nodes plural) in between Nutanix and Spark
+3
Thanks for that. I was hoping to not have to install a full Hadoop cluster just yet. At the moment it's for only a few Spark jobs. It's looking like I might be able to get away with just running that with Spark on its own but will need a full Hadoop setup, probably HDP in the near future. It's just the scaking that scares me. It's only a small part of what we do and I only have 7 NX3000 nodes to play with and the'yre nearly full anyway.
Enter your E-mail address. We'll send you an e-mail with instructions to reset your password.