chapter four

4 LLM-driven data science

 

This chapter covers

  • Find and explore business data using AI
  • Avoid the pitfalls of synthetic data in analytics
  • Build data notebooks with zero coding
  • Direct prompting vs. open exploration with LLMs
  • Solve business problems using simplified machine learning
  • The trade-off between precision and recall

Let’s imagine that we need to answer a simple question: Which manufacturer dominates a given market? To answer this question, we first need to identify the datasets that provide the necessary information.

The internet is full of data sources, both synthetic and non-synthetic, but selecting the right one is challenging. Once we’ve met that challenge and have the right dataset, we need to perform data cleaning, parsing, and normalization. Only after that can we conduct a simple statistical analysis to answer our previous question. Before LLMs, such a task would involve a data analyst trained in the appropriate technology for the problem, most often Python and Jupyter Notebook, and a substantial amount of time. Fortunately, with the power of LLMs, anyone with basic programming knowledge (you don’t even need to know Python) can create an analytics notebook that answers virtually any question about a given set of data. You only need to focus on what is most importan—the actual data and business domain.

4.1 Choosing the proper dataset

4.2 Set up the environment

4.3 Exploratory data analysis

4.3.1 Data analysis when we know the questions

4.4 Applying feature engineering

4.5 Predicting the price of drugs with machine learning techniques

4.5.1 Supervised learning for predicting a label

4.5.2 The train/test data split

4.5.3 The full predict-the-price ML pipeline

4.5.4 Vibe engineer the price prediction

4.5.5 Accessing the results

4.5.6 Improving the model

4.6 Building a digital guardian angel

4.6.1 Correlating long text with predicted class

4.6.2 Tokenization of medicine description

4.6.3 Deep learning based classification with vibe engineering

4.6.4 How to use the model?

4.6.5 Picking the proper threshold

4.6.6 Understanding the architecture

4.7 Summary