For several years, Hadapt and Hive were the only SQL-on-Hadoop solutions. However, as more and more people began to understand the problems and inherent waste of the database connector approach, a preponderance of SQL-on-Hadoop solutions entered the market, starting around one year ago this week (at Hadoop World last year). There are so many now, I dedicated a blog post a month ago to classifying them all.
All of these SQL-on-Hadoop solutions are undoubtedly a good thing for Hadoop, and are critical for Hadoop to integrate with other tools in the analytical ecosystem and adopted by enterprises with deep SQL experience and knowledge.
There’s just one problem: all current SQL-on-Hadoop solutions are completely missing the point of Hadoop.
To defend this assertion, first we have to agree on the ‘point of Hadoop’. Obviously, there could be many opinions on this matter. Hadoop does many things well: it can serve as a platform for data processing and analysis, it can scale to enormous data sets, it is fault tolerant and dynamic, and it automatically parallelizes complex tasks. However, traditional parallel database technology can do all of those things as well. What makes Hadoop unique, and gives it an opening in the market, is its flexibility. It can perform analysis and other data processing tasks over unstructured data and semi-structured data just as well as it can handle structured data. Furthermore, “time to analysis” is very low — no schema or ETL process is required before loading data into Hadoop — data can simply be dumped into HDFS and accessed via MapReduce jobs, higher level scripts, or other schema-on-read access methods. It is Hadoop’s flexibility with regard to data structure and schema that has resulted in its momentum and popularity in the Big Data market, and I would therefore argue that this “flexibility” is the point of Hadoop.
Unfortunately, all current SQL-on-Hadoop solutions, including Hive, Impala, Stinger, and even Hadapt (until recently) significantly restrict this flexibility. For all of these solutions, if you want to issue queries against your data in Hadoop, the first thing you need to do is define a schema for your data. If you don’t understand your data well enough to define a schema, or you don’t know what attributes or types you have, or generally want to further process your data before defining a schema, none of these SQL-on-Hadoop solutions will work for you. No schema means no querying.
Just to be clear, this restriction is not unreasonable. SQL stands for “structured query language”, and queries in SQL are expressed against an assumed schema — issuing SQL against data without a schema is somewhat nonsensical. SQL queries fundamentally need to refer to attributes and tables from a schema.
Thus, we have a problem. Hadoop is famous for its flexibility and ability to handle data that’s not structured or has a dynamic or unknown schema. Yet SQL requires a schema for query processing and therefore SQL-on-Hadoop solutions make users define a schema before they can issue SQL queries. SQL-on-Hadoop solutions are therefore missing the point of Hadoop.
At Hadapt, last year we realized that this requirement is unnecessary. While it is true that SQL requires a schema, it is entirely untrue that the user has to define this schema in advance before query processing. There are many data sets out there, including JSON, XML, and generic key-value data sets that are self-describing — each value is associated with some key that describes what entity attribute this value is associated with. If these data sets are stored in Hadoop, there is no reason why Hadoop cannot automatically generate a virtual schema against which SQL queries can be issued. And if this is true, users should not be forced to define a schema before using a SQL-on-Hadoop solution — they should be able to effortlessly issue SQL against a schema that was automatically generated for them when data was loaded into Hadoop.
Thanks to the hard work of many people at Hadapt from several different groups, including the science team who developed an initial design of the feature, the engineering team who continued to refine the design and integrate it into Hadapt’s SQL-on-Hadoop solution, and the customer solutions team who worked with early customers to test and collect feedback on the functionality of this feature, this feature is now available in Hadapt.
To understand the feature better, let’s look at a (simplified) example of a retail company that has a relational table containing information about customers, and a click-stream of customers interacting with their website (represented as a series of JSON documents) shown below:
Let’s say this company wants to send an email to all customers who viewed the ‘window’ product (at least those who were logged in and whose userId is known). In order to find this list of email addresses, information from both data sets must be extracted — the list of customer userIDs who viewed the ‘window’ product from the clickstream, and the email address of these customers from the relational dataset.
For standard SQL-on-Hadoop solutions, extracting email addresses for particular customers from the relational data is fairly straightforward. However extracting the users from the clickstream who viewed the ‘window’ product is much harder, both because the JSON documents are not in relational form, and also because each click has a different set of attributes associated with it, depending on which page was accessed by the user (so converting the JSON into relational format with a queryable schema is not straightforward). Consequently, the JSON data is either (1) completely inaccessible (2) accessible only via complex UDFs along with a JSON SerDe or (3) accessible after an expert who understands the data creates an ETL process that relationalizes the JSON data as much as possible (which drops attributes that don’t fit into the relationalized schema). None of these options are desirable, and as a result, the clickstream data of the form shown above — data that should be in Hadoop’s wheelhouse — is generally underutilized by SQL-on-Hadoop solutions.
With Hadapt, any data that can be parsed into key-value pairs (such as the JSON clickstream shown above) can be queried via SQL in a much more straightforward way. When such “self-describing” data is loaded into a Hadoop cluster running Hadapt, a schema is automatically created with one virtual column for every unique key in the data set. For example, the JSON data shown above will result in the automatically generated schema shown below:
Due to the sparsity of the relational view of the key-value data (some keys appear very rarely in the data set), Hadapt does not physically store the data according to the relational view. Nonetheless, the view to the user is exactly the above relation, and SQL queries can be issued against it. For example, if the relational table containing information about customers is called customers and the above view of the JSON click data is called clicks then the following SQL can be issued to execute the example query discussed above:
FROM customers, clicks
WHERE customers.userID = clicks.userID
AND action = ‘view’
AND productName = ‘window’
Obviously the above example is simple, but hopefully the power of issuing standard SQL against data whose schema is not known in advance is clear.
The great thing about this approach is that since the schema is automatically generated, it can be just as dynamic as the data itself. As new JSON documents are added that contain previously unseen attributes, these attributes are automatically added to the generated schema with no customer involvement whatsoever.
At the end of the day, requiring a user to define a schema before data can be queried in SQL is simply too restrictive. SQL-on-Hadoop solutions need to be just as flexible and dynamic as the Hadoop platform itself. Anything short of this seems to be missing the point of Hadoop.
(I will go into more detail about the approach in my talk tomorrow at Strata/HadoopWorld — October 29 at 11:50AM in the Rhinelander Center room at the conference).