chapter seven

7 An end-to-end example using XGBoost

This chapter covers

Gathering and preparing data from the internet, using generative AI to help
Drafting a baseline and first tentative model to be optimized
Figuring out how the model works and inspecting it

This chapter concludes our overview of classical machine learning for tabular data. To wrap things up, we’ll work through a complete example from the field of data journalism. Along the way, we’ll summarize all the concepts and techniques we’ve used so far. We will also use a generative AI tool, ChatGPT, to help you get the job done and demonstrate a few use cases where having a large language model (LLM) can improve your work with tabular data.

We will finally build a model to predict prices, this time using a regression-based approach. Doing this will help us understand how the model works and why it performs in a particular manner to gain further insights into the pricing dynamics for Airbnb listings and challenge our initial hypothesis regarding how pricing happens for short-term rentals.

7.1 Preparing and exploring your data

7.1.1 Using generative AI to help prepare data

7.1.2 Getting and preparing your data

7.1.3 Engineering more complex features

7.1.4 Finalizing your data

7.1.5 Exploring and fixing your data

7.1.6 Exploring your target

7.2 Building and optimizing your model

7.2.1 Preparing a cross-validation strategy

7.2.2 Preparing your pipeline

7.2.3 Building a baseline model

7.2.4 Building a first tentative model

7.2.5 Optimizing your model

7.2.6 Training the final model

7.3 Explaining your model with SHAP

Summary