Thursday, November 3, 2016

MPP Analytical Database vs SQL on Hadoop

Users find lower licensing costs when storing data in Hadoop—although they often do pay for subscriptions. Storing data efficiently in a cluster of nodes is the table stakes for data management today. However, it’s important to remember what happens next. The next step is often about performing analytics on the data as it sits in the Hadoop cluster. When it comes to this, our internal benchmarking testing reveals limitations of the Apache Hadoop platform.

Since I work there, I recently got some metrics from a team of Vertica engineers who set up a 5-node cluster of Hewlett Packard Enterprise DL380 ProLiant servers. They created 3 TBs of data in ORC, Parquet, and our own ROS format. Then, they put the TPC-DS benchmarks to the test with Vertica, Impala, Hive on Tez, and even Apache Spark. They took a look at CDH 5.7.1 and Impala 2.5.0 and HDP 2.4.2 Hawq 2.0 in comparison to Vertica.

Performing Complex Analytics
They first took note of whether all the benchmarks would run. This becomes important when you’re thinking about the analytical workload. Do you plan to perform any complex analytics? In these benchmarks, Vertica completed 100% of the TPC-DS benchmarks while all others could not.



For example, if you want to perform time series analytics and the queries are not available, how much will it cost you to engineer a solution? How many lines of code will you have to write and maintain to accomplish the desired analytics. Hadoop-based solutions do not often have out-of-the-box geo-spatial, pattern matching, machine learning, and data preparation – these types of analytics are not part of the benchmark.

Achieving Top Speed
Let’s assume that you don’t need to run all of the TPC-DS queries or that you can spend the time and resources to modify them. In our testing, I compared the performance metrics on just the queries that would run, Hadoop-based solutions were not comparable in performance either. For example, the numbers for Impala were as follows:



This was actually the closest result. On the 59 queries that Hive on Tez could complete, it took about 21 hours to Vertica’s 90 minutes. Apache Spark took 11 hours to complete 29 queries while Vertica took 25 minutes.

Handling Concurrent Queries
In the metrics I received, I also found that Hadoop-based solutions had limitations on the number of concurrent queries they can run before the query fails. In the tests, the engineers continually and simultaneously ran 1 long, 2 medium, and 5 short-running queries. In most cases, the Hadoop-based solutions choked on the long query and sped through the short queries in a reasonable time. Vertica completed all queries, every time.

An Analytical Database is the Right Way to Perform Big Data Analytics
Although they are an inexpensive way to store data, Hadoop-based solutions are no match for columnar analytical databases like Vertica for big data analytics.

Hadoop-based solutions cannot:
  • Perform at the same level of the ANSI SQL analytics, often failing on the TPC-DS benchmark queries
  • Deliver analytics as fast, sometimes significantly slower than a column store
  • Offer the concurrency of an analytical database for a combination of long, medium and short running queries

Hadoop is a good platform for low-cost storage for big data and data transformation. It delivers some level of analytics for a small number of users or data scientists. But if you need to provide your organization with robust advanced analytics for hundreds or thousands of concurrent users and achieve excellent performance, a big data analytics database is the best solution.

Disclaimer: The opinions expressed here are my own and don't necessarily reflect the opinion of my employer. The material written here is copyright (c) 2010 by Steve Sarsfield. To request permission to reuse, please e-mail me.