As most readers of this blog are already aware, today we announced Hadapt 2.0 that includes several major upgrades to our software, including “interactive” SQL on Hadoop and a new Hadapt Development Kit (HDK) that greatly expands the variety and reusability of analytical applications that can be built on top of Hadapt. Curt Monash (http://bit.ly/Hadapt20DBMS2) and Derrick Harris (http://bit.ly/Hadapt20GigaOM) have already posted some nice commentary and further details about Hadapt 2.0. In this post I want to expand on the last paragraph of Derrick’s post, which includes a quote from me highlighting the interactive query part of Hadapt 2.0, the feasibility of building this inside Hadoop, and Hadapt’s historical mission to achieve this.
Hadoop started out as an open source effort to replicate the system described in the MapReduce research paper that was published by Google in 2004. It started gaining steam in 2006 and finally got adopted by several major Web enterprises for use in production in 2008. By 2009 it became clear that Hadoop was going to be a major force to be reckoned with for processing unstructured data. Between then and now, just about everybody in the industry has agreed that Hadoop and database systems were perfectly complementary; Hadoop can be used for processing unstructured data, ETL-style transformations, and one-off data processing jobs, while database systems can be used for fast SQL access to structured data. Data can be shipped between Hadoop and relational database systems over a connector. For example, a Hadoop job can be run to structure the data, after which it is sent to a relational database system (which may be bundled together with Hadoop in the same cluster “appliance”) where it can be queried using SQL.
Since the vast majority of the world has spent the past 4 years agreeing that MapReduce and database systems are complementary, very few people perceived the need for high performance SQL on Hadoop. If database systems and Hadoop are going to be deployed side by side (for example, in an appliance that includes “Hadoop nodes” and “database nodes”), it is totally redundant to give Hadoop a high quality SQL interface, since the complementary database system can be used for SQL access. Therefore projects like Hive have languished in mediocrity, with far fewer active developers than other more strategic elements of the Hadoop ecosystem (such as HDFS).
Contrary to conventional wisdom – we believed differently, and for the last 4 years we have been espousing a contradictory vision. Instead of viewing Hadoop and database systems as complimentary, we have viewed them as competitive, and have championed the idea of bringing high performance SQL to Hadoop in order to create a single system that can handle both structured and unstructured data processing. In 2008 we started building a system called HadoopDB that does exactly this, and by March 2009 we completed our initial prototype and submitted our work to VLDB. The work was accepted and published at VLDB, and we founded Hadapt shortly afterwards (in 2010) to productize this defiant vision.
Over the past several years, we have been laser-focused on turning Hadoop into an all-purpose analytical platform for both unstructured and structured data while providing high performance SQL access to it. We have worked hard to get high performance for joins, improving optimization and scheduling of SQL queries, and delivering good performance on complex, ad-hoc, data warehousing-style queries. With Hadapt 2.0, we have even managed to remove the Hadoop start-up overhead for the shorter, simpler queries, so that these queries can run in less than a second.
Despite our focus from the beginning on bringing high performance SQL to Hadoop, only now are we willing to call ourselves “interactive”. To us, interactivity implies a truly fluid and engaging experience for the user with the system. It must include both of the following characteristics:
- Simple queries that involve selections, projections, and aggregations should be measured in milliseconds
- More complex ad-hoc queries that may involve multiple joins should be done without the user having to do something else while the query runs in the background.
Building a truly interactive system that includes both of the above characteristics is highly nontrivial — our robust foundation began with adding fundamental relational database technology to Hadoop; had we not started working on this 4 years ago, and focused our entire engineering efforts on bringing relational database technology to Hadoop, we wouldn’t be able to offer anything near the quality of the software which is in Hadapt 2.0. As customers increasingly demand a single unified system for multi-structured analytics as opposed to multiple systems with connectors between them, Hadapt is the leading innovator and extremely well positioned to meet these customer demands.
Interesting, wondering while hive is almost a SQL like language, how hadapt is different from hive and provides better performance? What if the data is completely unstructured and huge, would hadapt still provide better performance than hive?