Representation is one of the most complex and compelling tasks in machine learning and computer science in general. Pedro Domingos, a computer science professor at the University of Washington, published an article [Domingos, 2012] in which he decomposed machine learning into three main components: representation, evaluation, and optimization. Representation specifically affects three core aspects of a machine learning project’s life cycle:
- The formal language (or schema) in which a training dataset is expressed before it is passed as input to the learning process
- The way in which the result of the learning process—the predictive model—is stored
- How, during the prediction phase, the training data and the prediction model are accessed during forecasting
All these aspects are influenced by the learning algorithm used to infer the generalization from the observed examples in the training dataset, and they affect the overall performance in terms of forecast accuracy and training and prediction performance (speed).
Starting with the second part, this book focuses on data modeling: the formal structures used to represent the training dataset and the inferred model (the result of the learning process) so that a computer program (the learning agent1) can process and access it to provide forecasting or analysis to end users. Hence, two different models are taken into account: