Before Hadoop, the Big Data market was dominated by a large variety of proprietary relational databases. These relational databases were focused on achieving high performance query processing of structured data. Although most relational databases do not have unlimited scalability (even the parallel “MPP” relational databases), they were still scalable enough to control the vast majority of the Big Data market. However, starting around 2008, Hadoop began to disrupt this market in three major ways:
- Hadoop uses a more flexible programming framework in order to enable high performance query processing, even for unstructured data.
- Hadoop is open source, whereas every other parallel database system used proprietary code.
- Hadoop is more scalable than even the most scalable relational database system.
Fearful of the emergence of Hadoop, the initial reaction of relational database vendors was to (correctly) blast the technical deficiencies of Hadoop in an attempt to stymie Hadoop’s increasing popularity. Most of the criticisms revolved around the problem that Hadoop is not optimized for data processing of traditional, structured, relational data; and that it was “batch-oriented” (as opposed to real-time).
As Hadoop continued its ascent within the enterprise, the relational vendors had no choice but to resort to a coexistence strategy — pigeonholing Hadoop into the role of ETL (extract, transform, load) and processing of unstructured data, leaving relational databases to process all structured data.
At the same time, new start-ups were emerging to commercialize Hadoop — mostly by adding services, support, and management tools around the Hadoop open source core. From a pragmatic standpoint (in order to accelerate Hadoop adoption in the enterprise and receive the services and support dollars that come with such enterprise adoption), these start-ups were willing to allow Hadoop to be pigeonholed by the relational database vendors, thereby enabling these small Hadoop vendors to partner with the much bigger relational database providers. Despite the downside of only yielding Hadoop adoption in a small subset of possible enterprise use-cases, these partnerships led to immediate revenue for these small Hadoop start-ups and increased valuation in follow-up venture capital rounds.
Consequently, due to entirely pragmatic thinking and short-term motivations, every Big Data vendor that matters (both in the relational database space and in the Hadoop space) has advocated for a two system approach to processing Big Data — Hadoop for the unstructured data and relational databases for the structured data, with a connector between them, shipping data back and forth over a network connection.
Every single one of these vendors is incorrect. This is a poor architectural vision for the future of Big Data processing.
Many people don’t realize that Hadoop and parallel relational databases have an extremely similar design. Both are capable of storing large data sets by breaking the data into pieces and storing them on multiple independent (“shared-nothing”) machines in a cluster. Both scale processing over these large data sets by parallelizing the processing of the data over these independent machines. Both do as much independent processing as possible across individual partitions of data, in order to reduce the amount of data that must be exchanged between machines. Both store data redundantly in order to increase fault tolerance. The algorithms for scaling operations like selecting data, projecting data, grouping data, aggregating data, sorting data, and even joining data are the same. If you squint, the basic data processing technology of Hadoop and parallel database systems are identical.
There is absolutely no technical reason why there needs to be two separate systems doing the exact same type of parallel processing. While it is true that, today, Hadoop is lacking some of the important features that are available in relational database systems, this gap is closing over time. And while it is true that primary storage in Hadoop (HDFS) is a file system that is optimized for unstructured data, and the primary storage of parallel database systems is a set of relational tables that are optimized for structured data, there is no reason why the file storage and relational storage can’t sit side by side on the same physical machines and even on the same disk drives.
There is no reason why you need two different systems, sitting in two different clusters, that are architected in a fundamentally similar way. There’s no reason to pay the increased management costs of having two different systems built by two different vendors. There’s no reason to pay the organizational costs that are incurred by the data silos that are created through having multiple systems. And there’s certainly no reason to pay the networking costs of getting a decent sized communication pipe between them.
Connectors between databases and Hadoop are entirely the wrong way to think about scalable data processing of enterprise data. Even the upgraded connectors that have been announced in recent months by database and Hadoop vendors (e.g. making a connector “transactional”, or leveraging projects like HCatalog to make the connector more intelligent at the end points) are counterproductive and only serve to propagate a data processing design that is a fundamentally poor long term strategy for an organization. A super-charged connector is still a connector, and that’s the wrong architectural choice moving forward.
This is What Happens to a Bridge with Heavy Traffic
In the future, it is clear that a single Hadoop installation will be enough to process both structured and unstructured data in the same system. The only question is how long it will be before the Hadoop vendors will be willing to abandon their short-term partnership strategy with relational database vendors and attack them head-on. It likely won’t be very long.