Much of the current discourse around artificial intelligence (AI) and machine learning (ML) is inherently model-centric, focusing on the latest advancements in ML and deep learning. This model-first approach often comes with, at best, little regard for and, at worst, total disregard of the data being used to train said models. Fields like MLOps are exploding with ways to systematically train and utilize ML models with as little human interference as possible to “free up” the engineer’s time.
Many prominent AI figures are urging data scientists to place more focus on a data-centric view of ML that focuses less on the model selection and hyperparameter-tuning process and more on techniques that enhance the data being ingested and used to train our models. Andrew Ng is on record saying that “machine learning is basically feature engineering” and that we need to be moving more toward a data-centric approach. Adopting a data-centric approach is especially useful when the following are true: