At the highest, simplest level, a machine learning (ML) model is an algorithm that ingests data and spits out insights, predictions or recommendations. It has two important phases — first you have to train your model (training) then you let the model run (inference).
Recently, I was trying to explain how important data is in machine learning success and eventually came up with the five ways that your machine learning model is like the big, blue, always-hungry Cookie Monster who has appeared on Sesame Street for over 50 years.
1) Data consumption
Cookie Monster is a voracious consumer of cookies; the more cookies he gets the happier he is. In fact, once he realizes he’s getting tasty cookies, he starts consuming them faster than anyone could imagine.
Your machine learning model is like Cookie as it becomes a voracious consumer of data in the training phase. The more data it gets, the happier it is.
2) Data availability
Just as Cookie Monster needs lots of cookies on a regular schedule (like constantly!) to satisfy his hunger, your model needs lots of data. The more data, the better the output, and a continuous stream of new data is even better.
3) Data accessibility
In the same way Cookie Monster wonders how easy it is to get access to cookies, your data scientist wants to know how easy it is to get access to the data the system needs. Having a well-staffed, efficient cookie-baking process means a steady supply of cookies; the last thing you want is Cookie rummaging around in the kitchen trying to bake his own cookies.
The same applies to data. When data isn’t easily accessible, the process of searching for it and integrating it manually takes up staff time, slows the path to machine learning success and leads to higher costs.
4) Data quality
Just as Cookie Monster isn’t keen on burnt cookies, in the training phase your model depends on clean data. However, unlike Cookie, during the training phase your model isn’t very good at avoiding burnt cookies (aka, sourcing clean data) on its own.
Having expensive data scientists spend up to 80 percent of their time cleaning data instead of modeling adds both time and cost to your ROI equation. It’s interesting to note that once Cookie realizes cookies are good, and is consuming them at speed, he doesn’t seem to mind that there are a few burnt ones in the batch; the same applies to data once you get to the inference phase.
5) Data consistency
And lastly, strangely enough, Cookie Monster doesn’t like it when you change from chocolate chips to caramel chips in his cookies. Consistency in what you feed him is important. Likewise, your model doesn’t like data inputs that change their schemas all the time.
Accessible, clean, integrated, consistent data in large quantities is the key to data scientist productivity. Not having high-quality data makes a huge impact on ML project ROI — second only to picking the wrong use case.
What does this story boil down to? Always make sure you have lots of cookies on hand. Wait, no — if you’re going to go down the ML path, invest in your data estate.