Classifying the SQL-on-Hadoop Solutions

Almost a year and a half ago on this blog, I went on something that is probably best described as an anti-DBMS/Hadoop-connector rant. There was then (as there still is now) an incredible amount of use cases that require the combination of DBMS and Hadoop technologies, and at the time, both the Hadoop vendors and the DBMS vendors were pushing a “connector” approach, where the customer buys both a Hadoop product and a DBMS product and data can be passed back and forth between these two systems. I explained the architectural wastefulness that is associated with this approach, and why, given the way that parallel database systems and Hadoop are designed, it is relatively easy to combine them (architecturally speaking) into a single system. At the time there were only two solutions that took the combined system approach: Hive and Hadapt.

Since that post was written, it is good to see that several vendors have abandoned the connector approach and have instead launched initiatives (such as Stinger, Impala, and Drill) that, while still immature, are following (or extending) Hive and Hadapt, and going in the direction of bringing SQL technologies directly to Hadoop clusters. In my opinion, this is absolutely the right direction for the market, and will result in the furthering of Hadoop’s dominance in the data processing and analysis space.

Given the rapid entrance of these new “SQL-on-Hadoop” initiatives, now is a good time to classify them and study the similarities and differences between these approaches.

Before comparing and contrasting six approaches to SQL-on-Hadoop (Hive, Hadapt, Stinger, Impala, Polybase, and Drill), I should explain why these are the only approaches that are being compared in this post: since the DBMS/Hadoop connector approach is so fundamentally flawed from an architectural perspective, vendors that use this approach remain in a different category and are not directly competitive with the direct approaches to SQL-on-Hadoop. Even recent attempts from Greenplum and Aster Data to retrofit their MPP database to work on Hadoop clusters through the HAWQ and SQL-H projects respectively still fundamentally use the connector approach: at query time, data is extracted out of HDFS and sent over the network into their MPP execution engines for further processing. Even if the MPP execution engine sits on the same physical cluster as HDFS, if processing is not pushed down to the same nodes that store the data, the MPP database is essentially treating HDFS as a large (cheap) shared-disk storage system, and comes with the scalability constraints and network bottlenecks that are associated with this approach. Shared-disk architectures are fundamentally antithetical to the Google-made-famous “shared-nothing” design that Hadoop emulates, where processing is pushed as close to the data as possible. This is why these MPP+Hadoop vendors typically bundle hardware with software, so that high-end and expensive networking gear can be integrated into the cluster, in order to hide the fundamental limitations of the shared-disk architecture.

Therefore, we are left with the above-mentioned six technologies to compare. (It’s possible that there are additional SQL-on-Hadoop solutions that I’m not aware of – if so, please add them via the comment thread below). They are best divided into three categories, with two technologies placed inside each category:

(1)   SQL translated to MapReduce jobs over a Hadoop cluster. Both Hive and Stinger (without Tez) fall into this category. A SQL query that is sent to a Hadoop cluster is translated into a series of MapReduce jobs which are then processed by the cluster. A major advantage of this approach is that by integrating with Hadoop’s version of MapReduce, queries are run with Hadoop’s dynamic scheduler and are therefore highly tolerant of unexpected performance issues and other forms of heterogeneous performance across the cluster. Furthermore, they leverage MapReduce’s mid-query fault tolerance so that nodes that fail in the middle of query processing do not cause the entire query to fail. Combined, these two properties lead to consistent and reliable execution of queries across clusters containing thousands of nodes. Disadvantages include: (a) in order to facilitate the transaction of SQL to MapReduce jobs, the dialect of SQL that are spoken by these systems is not quite standard SQL, which complicates integration with third party tools; (b) due to the need to automatically generate MapReduce jobs for any type of SQL clause, the amount of SQL coverage is coming along slowly; and (c) due to processing exclusively using the MapReduce framework (Stinger with Tez falls in a different category), the per-query MapReduce overhead prevents the ability of these technologies to process queries interactively (this category is fundamentally a “batch processing” category).

(2)   SQL processed by a specialized (Google-inspired) SQL engine that sits on a Hadoop cluster. Both Impala and Drill fall into this category. Impala is inspired by Google’s F1 project and Drill by Google’s Dremel project. Both push down SQL (or, in the case of Drill/Dremel, SQL-like) operators down to where it is stored in the distributed file system (HDFS) and therefore have the advantage of collocating data with data processing. However, since both systems are building the SQL query execution engine from scratch, both suffer from the same (a) and (b) disadvantages of category (1) – non-standard SQL and poor SQL coverage. Furthermore, by completely eschewing MapReduce, they do not get the associated fault tolerance and dynamic scheduling (and therefore scalability) benefits that are inherent in MapReduce.

(3)   Processing of SQL queries are split between MapReduce and storage that natively speaks SQL. Both Hadapt and Polybase fall into this category. These systems attempt to get the best of both worlds, doing some processing in MapReduce and some processing in native SQL operators. When a SQL query is submitted to the Hadoop cluster, an optimizer analyzes the query, and decides what parts should be performed via MapReduce, and what parts via SQL operators. For queries that require interactive (sub-second) time, MapReduce is typically avoided, and the entire query is performed via native SQL. But for queries that require massive scale and mid-query fault tolerance, more work is left for the MapReduce engine.

Although each of these “SQL-on-Hadoop” categories has different advantages and disadvantages, as a group, they significantly bring Hadoop forward from where it was a year ago, and greatly expand the use cases for which Hadoop technology can be used. As vendors continue to abandon the DBMS-connector approach, customers win through cleaner architectures, fewer data silos, and simplified systems administration.

8 Responses to “Classifying the SQL-on-Hadoop Solutions”

  1. Eli Singer

    Daniel,

    Thanks for a very clear mapping of current SQL-on-Hadoop approaches and pointing out the architectural differences between them. I would like to suggest adding JethroData to this list.

    JethroData is an SQL and Indexing engine for Hadoop. It works by automatically indexing data as it is written into Hadoop. SQL queries use indexes to access only the data they need instead of performing a full-scan of the entire dataset. Both indexes and column data generated by Jethro are stored as standard HDFS files. Jethro Query nodes are used to process SQL requests and access data directly, bypassing MapReduce. Query nodes are typically separate servers from HDFS storage nodes.

    Jethro’s solution probably fits in your 2nd category, as it’s an SQL engine running natively on HDFS data, although it is not inspired by Google’s “commandments”.

    Reply
  2. Daniel Abadi

    Thanks Eli. Sorry for forgetting about you guys. I totally agree that JethroData fits in category 2. In retrospect, I should probably not have put “Google-inspired” in the category name — it’s too restrictive. I appreciate the feedback.

    Reply
  3. Sadu Hegde

    Hi Daniel,

    How does category (3) tools address (a) and (b) disadvantages of category (1) – non-standard SQL and poor SQL coverage?

    Thanks,
    Sadu

    Reply
  4. Daniel Abadi

    Hi Sadu,

    Good question. There is nothing fundamental about category (3) that solves the non-standard SQL / poor coverage issue. However, given that category (1) derives from Hive (which is the source of the non-standard SQL problem) and category (2) derives from Google’s internal database products (Google only needed to focus on a 4-star Google employee user base, so SQL had not been a priority), category (3) doesn’t have the historical baggage that lead to the shortcomings of category (1) and (2). Furthermore, both Hadapt and Polybase are based on relational technology, so SQL coverage is an easier task for them.

    Reply
  5. Vincent

    Daniel,

    Nice article. What category does IBM’s new Big SQL initiative fall under? Would that be in category 3, since it claims to leverage MapReduce or point queries as necessary for quicker response time. “Big SQL provides support for large ad hoc queries by using MapReduce parallelism and point queries, which are low-latency queries that return information quickly to reduce response time and provide improved access to data.”

    Any thoughts on IBM’s technology, given they are both a Hadoop as well as DBMS/Warehouse player?

    Thanks

    Reply
  6. Daniel Abadi

    Hi Vincent,

    I’m not 100% comfortable commenting on Big SQL since I haven’t seen a full paper published about it in an academic conference (both Hadapt and Polybase have papers in SIGMOD/VLDB) that give details about how they work; but from the quote you gave in your comment and also their marketing material I’ve found online, it certainly sounds like Big SQL would be category 3.

    Reply
  7. Frank

    Daniel, thank you for this excellent article.
    I wonder, the motivations and justifications of Hive in creating a new SQL dialect are not in compliance with SQL standard?
    Any study of performance or benchmark comparison to these tools?
    Can somebody recommend me a link to satisfy my curiosity?

    Reply
  8. Sean

    Others that fit in this list are Presto (Facebook), Kiji (WibiData), Apache Tajo, and Phoenix (Salesfoce.com).

    Reply

Leave a Comment