I have a hypothesis: a brain twice as massive as a human brain would only be marginally more intelligent. And a single brain with 7 billion times the mass of a human brain would be orders of magnitude less capable than all of humanity. This feels analogous to Haldane’s famous 1928 piece “On Being The Right Size”, mapped into the domain of learning.
It's in the math. It would take something like 2 x 1028 times longer to train one brain with a mass of 7 billion human brains than it takes to train 7 billion single human mass brains. That's about 1020 times longer to train the big brain than the present age of the universe, measured in human lifetimes.
The point is that intelligence doesn't scale up. But it does scale out: Narrow intelligence (lots of individual humans) does scale, when networked together with a framework that permits distributed learning.
I've been thinking about this in the context of the "problem" of neural networks over-fitting their data. According to the literature, it's a “bad thing” if the Swim traffic prediction neural networks get too good at fitting Tuesday morning traffic conditions, because it likely makes them worse at predicting Saturday evening data. But wait, why is that a problem? Why not just run the Tuesday morning net on Tuesday mornings, and the Saturday evening net on Saturday evening? Scale like economies - with division of labor. It goes further: Why try to solve the problem of predicting traffic behavior for the Bay Area, or even Palo Alto, when each intersection can efficiently learn locally on its own behavior, and make accurate predictions based on a narrow view? The aggregation of the predicted behavior of each intersection is the predicted behavior of the entire city, and that is a problem that is easy to solve on low powered hardware.
The reason why overfitting is considered a problem is because most AI is trained offline – only inferencing happens online. The giant “deep mind” approaches taken by the major cloud and vertical solution vendors are built to try to understand everything all the time. And they're on the losing slope of a polynomial. By learning locally, on the fly, on time-series data as it happens – just like a human would if they were standing there observing the local environment – we can solve the problem in a different way. Distributed, local, low powered learning is the way to go.
It's much cheaper to have a "CEO neural network" that decides how to arrange cheaper, distributed local learning activities than it is to try to model the entire system. For large environments, it’s potentially trillions of times cheaper.
Even a relatively low-powered, ARM based NXP board can train a Tuesday morning neural net for a few hundred intersections. A "society" of small AIs seems much more mathematically plausible to me at this point than a centralized super intelligence. Given that big AIs have rapidly diminishing returns – again, it's manifest in the math – why all the emphasis on big AI? I suspect it's because AI is driven by data scientists, and big brains are a data science problem, whereas societies of little AIs are largely a brass tacks engineering problem. Lucky for us, we have an ideal distributed environment in which to run a whole lot of little neural networks. For us, it's trivial to arrange to run the Tuesday morning network on Tuesday mornings. Heck, we could have a separate network for every minute of every day of every week, if we needed it. And critically, we have the self-awareness to watch the things run, so we can do in-the-loop meta learning to figure out the best division of labor.
We could take it a step further and use genetic algorithms to compete neural networks against each other, and kill off the ones that do poorly. That might give us an even cheaper way to coarsely search the solution space than with backpropagation.
The rapid growth of fast-data streams demands that we stay on the cheap end of the polynomial, and run a million “Average Joe” neural nets that process data on the fly, learning locally on commodity edge hardware. Combining the output of the edge learning functions is easy given the enormous capacity of the cloud – storage and computation wise – and the overall system therefore is learned on the basis of many cheap sub-tasks in learning and predicting local conditions.