Time-series Machine Learning in the Cloud? Nah
by Simon Crosby, on Mar 6, 2018 12:17:28 PM
I love spending time with analysts / data scientists / quants. It takes me back to my days as a grad student when gnuplot was just becoming a thing. It’s still cool, and I’m still amazed at the rich tools that are freely available that enable data owners to quickly get a handle on its meaning.
But I have to admit I’m struggling with the “received wisdom” of the tech industry for learning on large volumes of continuous data. There are tons of great open source components, and cloud vendors of course want to store your data and offer these tools as a service for your ML and analytics needs. But while that’s fine for cloud-native apps like Twitter, if you already operate infrastructure that delivers streams of data to on-prem infrastructure, a cloud-first architecture won’t cut it. Bandwidth, storage & processing sound cheap, but costs add up fast. And despite the appeal of infinite storage, scale-out processing and powerful analytical tools offered as services, you need to clean and label the data, and develop and train your models manually – with a data scientist who is a domain expert.
To be clear, this is not a pitch in favor of the private cloud: Putting together an on-prem stack for learning from streaming data is a challenge: You’ll likely end up with an inferior tool set that will demand substantial investment ahead of any return, and you’ll be saddled with maintenance.
Instead, I find myself questioning the value of the cloud (public or private) for learning and analysis on streaming data. There are just too many fundamental assumptions that are wrong:
- Substantial effort and investment is required before the process can deliver insights of any value
- Learning in the cloud is batch-style: collect data, store, clean, model, train, deploy. This is time consuming and the insights gained are of value only if the system behavior does not change in the meantime. If the system behavior changes, the process must be repeated with new data.
- The idea that a general purpose, centrally trained model can deliver value in specific use cases at the edge is untested, and I believe will be shown to be wrong.
- Learned models use cleaned data and are naturally less rich than a model trained on the full “high def” data available at the edge.
Finally, this all presumes that users have the skills to go through the effort of learning on their data. But the majority of enterprises lack both data science and cloud skill sets.
Fortunately there is a better way, and that is to learn on data continuously, at the edge. More on that very soon.
Learn how SWIM uses edge intelligence to deliver real-time insights from the dark data generated by connected enterprise systems.