Blog

Latest features expand capabilities for data lakes

  • 13 June 2022
  • 0 replies
  • 282 views
Latest features expand capabilities for data lakes
Badge +8

The Nutanix Objects™ storage solution brings the ability to separate compute and storage in big data distributed applications, allowing you to independently scale each of the individual layers in both your on-premises and cloud datacenters. As part of the Nutanix Unified Storage™ portfolio, Objects form a centralized repository for all unstructured and semi-structured data contained within your data lake.

A key value proposition of Nutanix Objects is to scale capacity as needed so you can start small and then grow capacity on demand. It’s a cost-efficient approach where you pay only for the storage you actually need. Data lake management is also efficiently streamlined with the Nutanix Cloud Manager™ (NCM) console , which monitors the processing, analysis, and reporting of your Objects cluster infrastructure. This functionality makes an on-premises data lake built on Nutanix Objects an efficient solution for any size company.

Improved Time-To-Value for Analytics Workloads

When we look at data written by cloud-native and big data solutions, it’s common for the data stored in object stores to be semi-structured (for example, data CSV files or JSON files). With Objects 3.3 and later, we introduced S3 SELECT functionality that indexes the data stored inside objects. Nutanix Objects then provides an SQL-like interface that allows the applications—including many analytics and big data solutions—to query and read only the relevant data, rather than the entire underlying object, shifting compute resources closer to the data. This optimized SQL pushdown approach results in faster access to data, reducing both network bandwidth and index server workload. This is ideal for analytics workloads of all types but is especially efficient for queries where structured data is stored inside objects.

Figure 1 SQL Optimized pushdown – S3 SELECT

Consider the following external table in Apache Hive. It was created from a CSV file stored in a Nutanix Objects bucket directory:

hive> desc demo_10;
OK
# col_name data_type comment
petid string
name string
kind string
gender string
age int
ownerid int

We can then perform an SQL SELECT statement on the table with a defined WHERE clause as follows:

presto:default> select * from demo_10 where Kind='Cat' and Gender='male';
petid | name | kind | gender | age | ownerid
---------+-----------+------+--------+-----+---------
Q0-2001 | Roomba | Cat | male | 9 | 5508
M0-2904 | Simba | Cat | male | 1 | 3086
Z4-4045 | Simba | Cat | male | 0 | 2700
Z6-3226 | Simba | Cat | male | 11 | 4793
U4-5113 | Tiger | Cat | male | 12 | 7772
U0-5987 | Ebenezer | Cat | male | 0 | 5508
L4-2594 | Newcastle | Cat | male | 6 | 6405
W8-5750 | Simba | Cat | male | 15 | 6102
M2-1131 | Rumba | Cat | male | 8 | 1915
M4-9675 | Jeep | Cat | male | 6 | 6923
T0-3277 | Humbert | Cat | male | 12 | 8133
P7-2443 | Rumba | Cat | male | 10 | 7219
G6-6501 | Jake | Cat | male | 2 | 3089
G9-0817 | Kashi | Cat | male | 5 | 2722
S4-2254 | Draper | Cat | male | 3 | 8619
P1-2578 | Tiger | Cat | male | 14 | 3034
L4-4205 | Rumba | Cat | male | 5 | 1312
L8-0046 | Rumba | Cat | male | 7 | 9828
N0-9539 | Swiffer | Cat | male | 14 | 9365
(19 rows)

Query 20220210_024051_00005_e4z5d, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
0:02 [100 rows, 3.3KB] [50 rows/s, 1.68KB/s]

Note the number of rows selected and the time taken to process the statement both in the analysis of the above SQL statement and the time taken for the query to run in the Presto GUI output below.

Figure 2: Presto GUI query overview without S3 SELECT

If we now enable the S3 SELECT pushdown capability via the configuration files:

hive.s3select-pushdown.enabled=true

And then re-run the original query, notice the difference in rows retrieved and the completion time in the Presto query overview screen:

presto:default> select * from demo_10 where Kind='Cat' and Gender='male';
petid | name | kind | gender | age | ownerid
---------+-----------+------+--------+-----+---------
Q0-2001 | Roomba | Cat | male | 9 | 5508
M0-2904 | Simba | Cat | male | 1 | 3086
Z4-4045 | Simba | Cat | male | 0 | 2700
Z6-3226 | Simba | Cat | male | 11 | 4793
U4-5113 | Tiger | Cat | male | 12 | 7772
U0-5987 | Ebenezer | Cat | male | 0 | 5508
L4-2594 | Newcastle | Cat | male | 6 | 6405
W8-5750 | Simba | Cat | male | 15 | 6102
M2-1131 | Rumba | Cat | male | 8 | 1915
M4-9675 | Jeep | Cat | male | 6 | 6923
T0-3277 | Humbert | Cat | male | 12 | 8133
P7-2443 | Rumba | Cat | male | 10 | 7219
G6-6501 | Jake | Cat | male | 2 | 3089
G9-0817 | Kashi | Cat | male | 5 | 2722
S4-2254 | Draper | Cat | male | 3 | 8619
P1-2578 | Tiger | Cat | male | 14 | 3034
L4-4205 | Rumba | Cat | male | 5 | 1312
L8-0046 | Rumba | Cat | male | 7 | 9828
N0-9539 | Swiffer | Cat | male | 14 | 9365
(19 rows)

Query 20220210_021607_00005_tdfmh, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
0:00 [19 rows, 588B] [69 rows/s, 2.09KB/s]
Figure 3: Presto GUI query overview with S3 SELECT

If you want to read more about configuring your environment to use Nutanix Objects features, see the S3 configuration details for various use cases in our solutions document: Nutanix Objects: Additional Use Cases. If you’re working with S3 compatible Objects stores that you wish to run as part of a hybrid cloud environment, we would love to talk more and share experiences. Please get in touch with us via our social media channels or the Nutanix Community Forums.


© 2022 Nutanix, Inc. All rights reserved. Nutanix, the Nutanix logo and all Nutanix product, feature and service names mentioned herein are registered trademarks or trademarks of Nutanix, Inc. in the United States and other countries. Other brand names mentioned herein are for identification purposes only and may be the trademarks of their respective holder(s). This post may contain links to external websites that are not part of Nutanix.com. Nutanix does not control these sites and disclaims all responsibility for the content or accuracy of any external site. Our decision to link to an external site should not be considered an endorsement of any content on such a site. Certain information contained in this post may relate to or be based on studies, publications, surveys and other data obtained from third-party sources and our own internal estimates and research. While we believe these third-party studies, publications, surveys and other data are reliable as of the date of this post, they have not independently verified, and we make no representation as to the adequacy, fairness, accuracy, or completeness of any information obtained from third-party sources.

This post may contain express and implied forward-looking statements, which are not historical facts and are instead based on our current expectations, estimates and beliefs. The accuracy of such statements involves risks and uncertainties and depends upon future events, including those that may be beyond our control, and actual results may differ materially and adversely from those anticipated or implied by such statements. Any forward-looking statements included herein speak only as of the date hereof and, except as required by law, we assume no obligation to update or otherwise revise any of such forward-looking statements to reflect subsequent events or circumstances.


0 replies

Be the first to reply!

Reply