In my last post on this blog, I outlined the flaws of the ubiquitous Hadoop-DBMS “connector” technology that unnecessarily links together two different systems that have an extremely similar architecture.
In this post, I will discuss a new loading technology we call invisible loading that addresses many other pain points that people encounter when using a big data solution that involves structured and unstructured components. This (Hadapt-owned) patent-pending technology was presented by the lead author, Azza Abouzied, last week, and we are finally releasing the academic paper behind this technology today as part of this blog post.
The basic idea is the following: an unstructured data store such as HDFS is great for raw data sets since it does not require the raw data to conform to any predefined schema, and can handle large amounts of data at extremely low cost. This raw data can be scrubbed, cleaned, and refined via MapReduce jobs, Pig scripts, and other useful tools in the Hadoop ecosystem. Over time, this raw data becomes increasingly structured, at which point the “transformation” part of its lifetime comes to an end and the “query” part of its lifetime commences, becoming repeatedly accessed and queried via tools and languages like Hadapt, Hive and SQL. This common data processing workflow is illustrated in the diagram below.
In existing big data solutions, this transition between the transformation and query phases of the data lifetime needs to be detected by a human, who will then initiate a load from one shared-nothing scalable parallel analysis platform (Hadoop) to another shared-nothing scalable parallel analysis platform (an MPP database system). In addition to the design flaws of moving data between architecturally similar data processing platforms (as discussed in my previous post), this workflow causes the following pain points:
- Design expertise and efforts required in separating transformation and query phases
- ETL expertise and efforts needed in moving data from unstructured to structured store
- Performance overhead involved in executing the aforementioned ETL jobs
However, it is possible to automatically detect when data becomes structured enough to fit in a structured store. For example, there exist tools like PADS and RecordBreaker that can discover structure in datasets. Alternatively, it is possible to use static code analysis to detect when MapReduce jobs issued over data in HDFS assumes a certain structure (e.g., if there is parsing code at the beginning of the Map phase of a MapReduce job). Furthermore, if the user is using Pig parsing libraries or has already created a schema for Hadapt, Hive or SQL, then it is trivial to discover the structure in the data.
Once the system has discovered that the data has a certain structure to it, there can be hugeperformance benefits to storing that data in structured format that can leverage knowledge of the repeated structure in the data. This structured data store need not be an MPP database that sits across the network on a different set of servers; rather, data can simply be structured into relational storage on the same physical servers as the raw data.
The key contribution of our research is that this shifting of data from unstructured data stores such as HDFS to structured storage sitting on the same physical machine can happen invisibly and incrementally. The data scientist (or data analyst, or BI client, etc.) can access the data via MapReduce jobs, Pig scripts, or any other standard interface to Hadoop. These jobs read data from HDFS just like normal, and return results just like normal. However, since the data needs to be read anyway in order to process the job/query, a subset of the data that is read is moved into structured storage. Future jobs/queries over the same input data set automatically merge the data that is still in HDFS with the data that is in structured storage (with reads from structured storage being much faster than reads from HDFS). The user/client is completely unaware of the invisible data movement — all that is observable is a steady improvement in query performance as more and more data is read from structured storage.
What’s cool about this process is that the incremental nature of the data shifting allows for the human to be eliminated from the loading process. If the data is still being continuously transformed and refined, then very little progress will be made in moving data to structured storage. However, once the data becomes stable, and is continuously queried, incremental progress will be made for loading the data into faster, structured storage, until the entire data set ends up there. Meanwhile, the cost of the load is nearly invisible, since the reading of the data was required anyway to process the early queries.
The Hadapt Advantage
Technologies such as invisible loading will enable Hadapt to function as a data refinery as well as a full-service Big Data analytical platform. The raw data starts in HDFS (Hadapt is integrated with Hadoop and leverages various Hadoop components), and as it becomes refined and structured, it automatically gets moved into optimized structured storage for fast querying and structured analysis (via languages such as SQL). This process happens automatically, without human intervention, and also as a side effect of standard interactions with the Hadapt/Hadoop platform, so that the client is unable to detect this data movement. Furthermore, it occurs within the same physical hardware, so that the network is not burdened by this loading process (otherwise the loading would certainly not be invisible).
While the invisible loading technology can potentially be adapted to other big data systems, the unified (connectorless) analytics nature of Hadapt makes it the only platform in which the full power of invisible loading can be unleashed.
There are obviously a lot of important details to make all of this work. We encourage readers of this blog to read both the original research paper and the slides that Azza used a couple of days ago to present this work. We also expect to follow up this post with additional details in the future.