Google Cloud Drops MapReduce: The Big Data Wheel Keeps Turning
by Brad Johnson, on Sep 27, 2019 12:35:28 PM
Yesterday, Google Cloud’s SVP of Technical Infrastructure, Urs Hölzle announced that they’d “removed the remaining [MapReduce] internal codebase for good.” He went on to elaborate that GCP was ditching MapReduce in favor of Google DataFlow, which “expresses pipelines more naturally with less code, and you get both batch and streaming from the same code.” It seems like a logical evolution for GCP and probably a long time in the making. Kudos to the GCP team for another successful evolution of their cloud platform!
Less than the announcement itself, I couldn’t help but be amused by some reactions. The general sentiment celebrated the announcement as an inevitability and a positive development. Others ranged from “Spark FTW!!!” to fanfare for data pipeline architectures. NATS.io’s Derek Collison’s reaction will likely prove to be prescient, as he suggested that “most outside trends follow about 7-8 yrs behind Google. So start the clock for the end of Hadoop. I think it will happen faster though.”
Lastly, I wanted to include Urs Hölzle’s closing thoughts on the matter which prove to be an adequate summary of GCP’s perspective: “R.I.P. MapReduce, but long live cloud data analytics!”
It Seems to Me Like We’re Missing the Bigger Shift Here…
While the phasing out of venerable MapReduce is certainly news, it seems like we’re at risk of missing a larger shift occurring here. The cloud data analytics landscape is rapidly evolving, and I’d argue MapReduce is just as much a victim of decentralization as more efficient data processing alternatives. Just as the monolithic MapReduce architectures have evolved into composable pipeline architectures, we’re going to continue on a path to the total decentralization of data processing (and then, of course, the pendulum will ultimately swing back again…). Call it edge computing, distributed cloud computing, whatever; to me, that’s the bigger story here.
Swim.ai’s founder and Chief Architect, Chris Sachs, suggested this shift is more akin to an earlier shift in the hardware landscape. GPUs used to run fixed function rendering pipelines. Those fixed function pipelines were later replaced with programmable shaders, which enabled more flexible use of GPUs. With programmable shading, graphics pipelines became but one use case for general purpose GPU hardware.
The problem with the pipeline approach to cloud data analytics is that each pipeline has a fixed function. Pipelines treat data like cattle, funneling them through a homogeneous meat grinder. Because they are fixed function, data pipelines have no choice but to operate this way. In reality, data is rarely so homogeneous.
Of Pixels and Data
Like fixed function rendering GPUs, processors like Spark treat all data the same. While Spark may be an improvement over MapReduce, it's only an incremental one. The future of data analytics will be aggressively specialized based on the complex, dynamically changing context in which individual data entities exist. This will require "intelligent" data processing within data pipelines, something that Spark and pipeline data architectures can't account for today. In order to do so efficiently, statefulness will be a defining characteristic of tomorrow's cloud analytics architectures.
Oh yeah, and we're already doing all that today with swimOS.