1 Building with GenAI

MEAP v1

In this chapter

You will learn what’s needed to develop useful GenAI applications.
You will get an intuition of how Large Language Models work.

GenAI is a strange animal. Deep inside, it is one of a kind. While it feels natural to use a chatbot these days, you realize what a strange animal a GenAI (short for Generative AI) app can be when you try to build one.

How do AI systems communicate in fluent, human-like language? Does ChatGPT truly think like we do, or is something else at play? In this book, we’ll explore what makes GenAI so unique, mysterious, and fascinating.

Our goal is to help you understand how GenAI applications work. We’ll consider what tasks they can handle well and where you should steer clear. Along the way, we’ll roll up our sleeves and build a few applications of our own using a low-code tool called LangFlow. To begin our journey, let’s take a look at the fundamental characteristics that set GenAI applications apart and make our ride worth the effort.

The uniqueness of GenAI programming

Let’s consider the key differences between programming a traditional application and a GenAI one. Both aim to define the behavior of an application that does something helpful for the user, generating a useful output given an input. But they get there in very different ways.

In traditional programming, you instruct a machine to perform certain steps in a specific order, forming a logical workflow that transforms your input into the desired output. These steps can be described through code, as the application’s behavior will be mainly deterministic. Such codification happens by means of a formal language whose specific syntax and semantics define the particular programming language, such as Python or C, the programmer will use.

In GenAI programming, you have an additional element in the picture, the Large Language Model, or LLM. At some point in your workflow you will pass through this “magic black box” that takes text as input and generates text as output. As you design the workflow, you will have to take care of how to craft the right text that goes as an input to the LLM. Such a text is not written in a programming language but in natural languages, such as English or Italian, which humans naturally use to communicate. Being that the LLM is a black box, it does not matter how carefully you craft your prompt. You cannot be 100% sure of what exact output you will get in return. The magic black box brings inherent randomness into the GenAI application’s behavior.

GenAI apps have an LLM (Magic Black Box) somewhere.

What a strange animal, indeed. If we want to “tame the beast,” we must learn how to effectively deal with the randomness of non-deterministic behaviors and the subtleties of natural languages.

Of course, one can say that what we call traditional programming might also include some of these elements of strangeness. For example, in AI applications (not specifically of a generative nature), we also must come to terms with the wild behavior of probabilistic models. However, in GenAI, we will always have to deal with these elements at once: probabilistic behavior and natural language instructions. So, we better get used to the idea that we need to get very good at managing them effectively, as we will learn to do throughout this book.

Dealing with LLMs

These elements are unmissable in GenAI because of the component that turns a generic application into a GenAI application: the LLM. We will give it a shot to understand how LLM works later. For now, it is enough to imagine an LLM as a seemingly magic black box that takes some text as input and generates some other text as an output.

The magic box. Gets text as input and generates text as output.

This box has certain features that affect the way the output text is generated.

The behavior of an LLM is inherently non-deterministic. You cannot be sure what you’ll get in return for any given input because LLMs work with probabilities, not with set logical rules.
LLMs are not explanatory. You don’t know what’s going on inside it, and you will never know why you got a specific output. In the end, it is a black box—what else did you expect?
LLMs are stateless. Every time they are called (and produce an output), they go back to their original state as if nothing happened. They have no memory and start from a clean slate each time.
LLMs are pre-trained. They have some fixed knowledge hardwired within their “digital brains” that cannot evolve or change as they are used. The way LLMs work is fixed at once during their learning phase: from then on, they will not be able to accept any new knowledge (unless you build a different LLM).

Now, let’s pause for a second. If we compare our daily experience as GenAI users with what we just said above, something doesn’t sum up. It looks like the LLM limitations of being stateless, pre-trained, and so on are not present in real-world applications.

For example, the iconic 2022 ChatGPT by OpenAI does not seem to be affected by such restrictions, although it is clearly based on LLMs. ChatGPT appears to have some memory: it remembers what we said in the earlier part of a conversation and builds upon it. It doesn’t look quite stateless, really. The OpenAI engineers working on ChatGPT clearly managed to “tame the beast” and overcome the memoryless limitation.

Also, GenAI applications used in the workplace, like 2023 Microsoft CoPilot, seem to learn from your documents and provide answers to specific knowledge that wouldn’t have been available originally when the underlying LLM was built. In other words, their understanding of the world doesn’t seem limited to what they have been once pre-trained on, but it seems to be expandable.

Beyond OpenAI

In this chapter, we refer to OpenAI’s GPT-3 and ChatGPT as examples. These models have been iconic, to say the least. Fortunately, many LLMs are now available, making it impossible to provide a fully updated and comprehensive list. At the time of writing, some of the most notable LLMs include Google’s Gemini, Meta’s LLaMA, DeepSeek’s R models, Anthropic’s Claude, and X (formerly Twitter)’s Grok. In this book, we will experiment with a few of these models, including smaller LLMs like Ollama that can run locally on your laptop. The good news is that the techniques we will learn in this book are entirely agnostic to the specific LLM being used.

Building a GenAI application

The reality is that when you build GenAI applications, you have to find ways to overcome the inherent barriers of LLMs. Engineers at OpenAI and Microsoft clearly managed to do so with ChatGPT and CoPilot. As we become GenAI application designers ourselves, it is our duty to do so, too. Let’s get a feel of what we need to do through some examples.

Overcoming the “Stateless” Barrier

LLMs have no built-in memory; they start from a blank slate each time you ask a question. If you want the application to appear as though it “remembers” earlier parts of a conversation, you must supply that memory. In other words, your app must keep track of previous user inputs and LLM outputs and then package them together each time you make a new request.

Let’s see how this works through an example.

Now, if you now open a brand-new chat (with all memory features disabled) and simply ask:

Notice that the second question was asked in isolation, without referencing the initial introduction. The LLM had no memory of the previous conversation—because, under the hood, a new call to the LLM was made out of a clean slate.

LLM relationships: every chat is a first date.

When ChatGPT appears to “remember” your name, it’s actually the ChatGPT application re-sending your entire prior conversation each time to the LLM. As a GenAI application designer, you’ll do something similar: your system will capture and forward relevant snippets to preserve context between interactions.

For example:

By appending the Previous conversation history block to our input, the LLM could answer correctly. This is what needs to happen in GenAI applications: we need to manage the memory so that each call to the LLM is self-sufficient for the LLM to answer.

Providing Fresh Knowledge in Prompts

LLMs are pre-trained on a fixed dataset and don’t automatically know about new or private information. Suppose you need answers based on up-to-date company policies or newly released product details—information that certainly wasn’t in the LLM’s original training data. The only way to incorporate these details is by including them in your prompt each time they’re needed.

Let’s have a look at this example:

Nice try, LLM! The answer we got was generic since the LLM didn’t have the specific knowledge required to answer. But let’s see what happens when we pack our input text with some knowledge:

Here, the LLM uses the extra Relevant knowledge text we provided to craft a relevant response. Without that extra snippet in the prompt, the LLM would have no idea about your specific office hours. Notice that what we provided had even more information than what the user needed in this particular case (Relevant knowledge also included the address and the telephone number): fortunately, the LLM is smart enough to recognize which part of the provided context is helpful and act accordingly.

Still, you might ask yourself: how can the application equip the input for the LLM with the proper knowledge? Well, we will learn what techniques we have available for doing so in the following chapters. For now, think about two simplistic but fair options: one is to always attach the full knowledge to each query. If the knowledge is not very large, you can afford to do so in every call to the LLM. Another option would have been to run a simple, deterministic search query on a document repository, find all the sentences that include relevant keywords (like “office” in this case), and only attach those. We’ll come back to this—for now, you got the gist.

Why do you need this book to build real-world applications

These two basic examples show how building a GenAI application requires more than just forwarding a user message to our magic black box, the LLM. For instance, you need to:

Manage Conversation State: Store and forward past user messages and model outputs so that subsequent requests have context, as we saw in the first case. We will learn how to do this later in our journey.
Leverage the Right Prior Knowledge: Pull in the right snippets—company data, user-specific info, or any external resources—to keep the model’s responses accurate and updated, as we’ve done in the second case. We will have to learn how to ingest the right knowledge and store it properly so the right piece of info is retrieved. Easier said than done—we’ll need several chapters to perfect this ability.

But there’s more to building real-world applications than these two skills. For example, you’ll also need to:

Trigger Actions: We need to learn how to handle any tasks the LLM suggests, such as sending an email or retrieving an extra piece of data, by integrating our application with external tools (for the programmers among you, we need to deal with APIs).
Customize the Behaviour: It’s crucial to ensure that the LLM’s responses align with the desired tone, style, and objectives of your application. This is where the art and science of writing a good prompt (we call it prompt engineering) comes in, helping you fine-tune and control the outputs effectively. Chapter 3 will be devoted to this.
Design Agentic Structures: One of the most exciting potentials of LLMs is creating systems with multiple specialized agents that collaborate. This involves assigning roles, responsibilities, and organizational structures—akin to how an HR manager designs an org chart. It’s both challenging and fun, and we’ll explore how to do this efficiently in the last chapters of this book.

The GenAI solutions you see—like ChatGPT, Microsoft CoPilot, and others—do some of these behind the scenes. As you progress through this book, you’ll learn how to replicate these strategies, tailoring them to your specific domain or industry. By the end, you’ll be fully equipped to tame the beast of GenAI yourself and design a truly helpful, context-rich, and interactive user experience.

Inside the LLM

The source of all the strangenesses of GenAI programming is the magic box, the LLM. Given its centrality, it makes sense to understand what happens under the hood and how the LLM can generate coherent text.

Just as we don't need to know how an engine works to drive a car, we don't need to delve into the intricate workings of an LLM to build generative AI applications. In the following pages, you will learn how large language models work. We will break down complex concepts into simple, digestible explanations.

Learning how to speak

First, we must clarify that an LLM is built by applying several Machine Learning algorithms in series. Each of these algorithms plays a distinct role in processing and generating language. For example, let's consider ChatGPT, the pioneer of GenAI applications that enchanted the world. ChatGPT is the result of multiple steps of machine learning training, each refining it to keep meaningful conversations with human users.

Three types of machine learning

Machine learning algorithms come in three main types: Supervised, Unsupervised, and Reinforcement Learning. Supervised Learning involves predicting unknown outcomes by learning from past examples, much like a child learning colors with the guidance of an adult who indicates what is ‘red,’ ‘yellow,’ and ‘green.’ Unsupervised Learning finds patterns and structures in data without labeled outcomes, akin to children independently recognizing and grouping similar toy shapes. Reinforcement Learning interacts with the environment to achieve goals through trial and error, similar to a child learning to walk by trying different movements and adjusting based on successful stand-ups and failure falls.

We can understand the learning process behind a tool like ChatGPT by considering three machine learning stages, as illustrated below. Intuitively, we can say that the first component of ChatGPT is GPT (which stands for Generative Pre-trained Transformer), an unsupervised model capable of completing sentences by identifying the best-fitting next word. Additional steps, or fine-tuning, enabled the model to sustain conversations with humans. These further human refinement steps were achieved through supervised and reinforcement learning steps, where the output from the initial generation is refined to enhance conversation capabilities, making it ideal for use as a chatbot.

The learning stages of ChatGPT.

This is what happens across the three learning stages:

The first stage involves GPT learning how to "speak" human language by reading a massive library of documents. This is done through unsupervised learning, where the model learns grammar rules and what makes a word the best completion without direct supervision. Essentially, the machine learning algorithms find their own rules for creating coherent sentences and predicting the best next word.
In the following phase, the model undergoes supervised refinement. Here, it is fine-tuned to have meaningful dialogues by providing it with examples of human conversations. This supervised learning process helps the model operate in a question-and-answer (Q&A) fashion. The model, originally able to generically complete sentences, is now trained to find completions that result in coherent and relevant dialogues, enhancing its ability to engage in effective conversations.
The final step involves human feedback. Human reviewers provide feedback on the model’s responses, indicating which answers are effective and which are not. This feedback loop helps the model improve through trial and error, becoming progressively better at having human-like conversations. Additionally, when users interact with chatbots, they often provide feedback on the responses, further refining the model to produce better answers over time.

This combination of unsupervised learning, supervised refinement, and reinforcement through human feedback enables ChatGPT to understand and generate human language effectively.

Where everything starts: sentence completions

Let's dive deep into the first component of ChatGPT, called GPT. This model excels at generating coherent and contextually appropriate continuations of a given text. As mentioned, it can effectively predict and complete sentences by anticipating the “next best word.”

Fortunately, you can extend this output to complete entire phrases and paragraphs, adding more than one word. In fact, if you run the completion step repeatedly, you will get an additional word each time. Such a model is called autoregressive because it is applied over and over, using as an input the previous iteration's output to build more complex completions than just one word, creating complete sentences, paragraphs, poems, songs, or entire reports.

Let's build an intuition for how GPT works by acquainting ourselves with its way of representing text. In the world of GPT, phrases are "tokenized" which means they are split into units called tokens. We can think of a token as a word, a part of a word (some more complex English words are represented as a composition of multiple tokens), or a punctuation sign. These tokens are then translated into numbers through a process called embedding.

Let’s build a simplified example and imagine a basic GPT that uses a dictionary of just eight words: "THE, CAPITAL, OF, ITALY, IS, PARIS, ROME, LONDON". Each word in the dictionary is associated with three numbers (its embeddings) and can be visualized in a three-dimensional space, as the figure below displays. These tuples of numbers, the embeddings, will be the atomic units used for computations.

Words are numbers in the eyes of an LLM.

Let's move to the next step in achieving our goal: completing a sentence. For example, imagine we want to complete the sentence "The capital of Italy is." In large language models, the sentence to be completed is called a prompt. So, the prompt is the sentence that needs completion.

What does the GPT algorithm do? First, it needs to build a context starting from the prompt, using some calculation based on the embedding coordinates. For instance, we can imagine that by calculating the arithmetic mean, coordinate by coordinate, of the words constituting the prompt, we obtain a new tuple of numbers that we will call the context.

As shown here, the x-coordinate of the context is computed by averaging out the x-coordinates of the words composing the prompt. This is, of course, a simplification of the actual calculation process happening within GPT, but let’s use it for now to help us build an intuition about how this seemingly magical process works.

Given a prompt, you can calculate the context.

Now you have a tuple representing the context of the sentence to complete and some candidate words in the dictionary that can become its completion. For simplicity, let’s imagine that only “Paris,” “Rome,” and “London” are our candidate words (typically, every word in the dictionary is considered as a possible completion). By combining the embeddings (representing the candidate words) with the context (representing the prompt to be completed), you can find out which word needs to be added as a completion.

See the example below: the coordinates of the candidate words are combined with the coordinates of the context using a combination—technically, it would be called an inner product—of the two sets of numbers representing each candidate word and the context. The word obtaining the highest result wins. In this case, “The capital of Italy is” obtains “Rome” as its best completion, as the resulting probability for “Rome”, 0.47, is higher than what the other candidate words get. Well done to our simplified GPT for having guessed the right completion!

Guess the next best word by combining embeddings with context.

Let’s summarize step by step how GPT was able to find out the best completion of a sentence:

The text is split into atomic units called tokens. A token corresponds to a word or a part of it. The possible tokens are included in a dictionary.
Every token in the dictionary is represented with an embedding, which is a set of numbers. In our case, we had a dictionary of 8 tokens, each described with embeddings corresponding to a set of 3 decimal numbers.
The phrase to complete is called a prompt. By combining the embeddings of the words included in the prompt, you obtained a set of numbers called a context.
By combining the context with every token in the dictionary you can calculate the probability that each token completes the prompt. The token with the highest likelihood is chosen by GPT as the output and becomes the completion of the prompt.
You can continue generating longer completions by repeating the two steps above again: this time, the previous completion will be added as part of the new prompt. Of course, you will obtain an updated context, calculate the next completion, and keep going.

This process of breaking down phrases, converting them to numerical representations, and iteratively predicting the next token forms the backbone of how LLMs generate text.

Understanding GPT-3: how a real LLM works

Now, this looks too simple to be true. And, indeed, it is.

The simplified process we just saw employed a bunch of elementary calculations, which could, in theory, be done with pen and paper. Linear combinations use only multiplications and sums, the dictionary of only eight tokens, and the embeddings of three numbers for each token give us 24 numbers to start from for doing completions: it’s too simple to be true.

In fact, the reality of modern LLMs and, specifically, GPT-3, which is the one used within 2022’s ChatGPT, is much more complex. Let's delve into the complex reality of GPT-3 by examining three key aspects of the simplified methodology we discussed earlier: the dictionary of embeddings, the calculation of the context, and the calculation of the next best word.

You’ll encounter many new concepts and technical terms in the next few pages. But don’t worry—you don’t need to become an expert on every detail of modern LLM architectures. Think of this as a window into what’s happening under the hood: it’s here to give you a taste of how these advanced tools actually function. If you find some parts too dense, feel free to skim or even skip ahead to the final section, “Keeping the Knowledge Up to Date,” without missing the bigger picture.

Embeddings

In our simplified methodology, we used a dictionary with three numbers per word and an 8-word vocabulary. This minimal setup is indeed straightforward and easy to grasp. However, GPT-3 operates on an entirely different scale.

Simplified Example: 3 numbers per token, 8-token dictionary.
GPT-3: 12,288 numbers per token, 50,000-token dictionary.

To put this into a visual perspective, while a three-dimensional chart could easily represent our simplified dictionary, it would be impossible to envisage GPT-3's embeddings in the same way. You would need nearly 12,000 dimensions to describe a single word. Consider that, in the space of embeddings, words having a similar concept will stay close to each other while unrelated words are apart. By having so many dimensions to spatially place words, the embedding mechanism of GPT-3 captures intricate nuances of meaning and context far beyond the capabilities of simpler models.

Context Calculation

In the simplified example, the context of a prompt is calculated as the simple average of the words. This straightforward approach works well for basic illustrations but falls short in real-world applications.

Simplified Example: Context of a prompt is calculated as the simple average of the words. The relative position of each word in a sentence is uninfluential.
GPT-3: Context is the result of applying prompt tokens to deep neural networks with 96 layers and 175 billion parameters. The position of each word in the prompt largely affects the calculation of the context.

A piece of background here for newbies in deep learning: neural networks are complex systems inspired by the human brain, consisting of layers of interconnected nodes, called neurons. Each neuron takes inputs from the neurons in the preceding layer, combines them with its own set of parameters, and relays its outputs to the neurons of the following layer. Neurons are stacked one upon each other, forming a deep hierarchical architecture (this is why the subfield of Machine Learning developing neural networks is called deep learning). The context calculation in GPT-3 leverages these deep learning techniques, taking into account not only the words but also their positions within the text. This positional awareness is achieved through a mechanism known as transformers.

Deep Learning and Transformers

Deep learning involves training large neural networks with many layers, enabling them to learn hierarchical representations of data. This approach allows models to understand complex patterns and relationships. Transformers are a type of neural network architecture designed to handle sequential data, such as text. They use a mechanism called attention, which enables the model to focus on relevant parts of the input sequence, making it particularly effective for natural language processing tasks.

For completeness, we should say that the number of tokens you can use in a single prompt for GPT-3 is limited to the number of parameters the neural network architecture can process at once. In the case of GPT-3, you could use a maximum of 2,048 tokens in your prompt: this number is called context window size and limits the amount of “pages” you can insert at once to the LLM. For an “old” model such as GPT-3, it meant nearly three pages maximum, although modern models’ context windows at the time of writing can host entire books, and such a limitation might entirely disappear over time.

You can clearly see how the calculation of sending the ordered set of tokens of the prompt through deep neural networks with billions of parameters is a much more complicated way to obtain the context than a simple average of the same tokens would do. This is what it takes to build an LLM that seems confident about what it is talking about, whatever the subject.

Combination Process

The final step in our simplified process involves combining the context with candidate words to find the next best word. In the simplified example, this combination is just a dot product.

Simplified Example: The combination is just a dot product.
GPT-3: The combination is through another neural network with 2 layers and 1.2 billion parameters.

This additional neural network layer adds another level of depth and sophistication. It ensures that the selected word not only fits the immediate context but also aligns with the broader narrative the model is constructing. This layered approach allows GPT-3 to generate coherent and contextually appropriate continuations, from sentences to entire paragraphs.

In summary, the figure below shows how a GPT architecture can be broken down into three main blocks. First, incoming words or tokens are converted into embeddings—large sets of numbers that capture their meaning. Next, these embeddings pass through deep neural networks (the Context Calculation and the Combination steps), which use attention and positional information to produce a probability distribution over all possible next tokens. Finally, the winning token—the one with the highest probability—is selected and added to the prompt. That expanded prompt then re-enters the cycle, generating new embeddings, an updated probability distribution, and so on.

How a GPT architecture generates sentences.

Reality proves to be way more complex than our naïve example. GPT-3 uses a matrix of around 12,000 rows and 50,000 columns for the dictionary: notice that we are talking about more than half a billion numbers, which means over 1 GB of data when stored in a single file—and this one is still the simplest part of the process, which is the conversion of text into numbers through a static dictionary. The subsequent neural networks come with plenty of neurons packed with parameters—we are talking about 175 billion numbers this time. Net, if you “stored” all the numbers GPT-3 needs for completing sentences, you need 350 GB of space in your disk. Additionally, you need solid computing power (your laptop wouldn’t make it, I’m afraid) to combine all these billion numbers every time a single new word needs to appear on the screen as the completion of your prompt.

Pre-training the model

We need to realize that there are two separate stages in the lifecycle of an LLM model like GPT-3: the first is model training, and the second is inference. During model training you calculate the best set of parameters able to predict word completion. During inference, you apply those parameters to a given prompt and obtain the actual completion.

In unveiling step-by-step what happens under the hood with GPT-3, we are moving backward and have started from the second stage. Across the steps we have seen so far (translating words in numbers, calculating context, and combining it with the prompt), a given set of parameters is applied to an input (the prompt) by means of some complex computations. But where do such given parameters used during inference come from? In other words, how do we obtain the precise set of numbers that produce “Rome” as the completion of our sentence on the capital of Italy?

This is what happens during the first stage of an LLM lifecycle, the model training. The ‘P’ of GPT stands for pre-trained because the parameters able to make such a marvel happen during inference are found once and for all during the preliminary training process. Let’s see how this stage works.

To start, we should build an intuition of how neural network models are trained. Neural networks are built through a process that starts with random parameters, which are numbers that influence how the network makes decisions. Think of it like a student starting with no knowledge and guessing answers. Initially, these guesses are often wrong, but with each new piece of information (or training example), the student adjusts their guesses to get closer to the correct answer.

This adjustment happens through a series of iterative learning passes, often described as "backward" learning or backpropagation. Here's how it works: the neural network makes a prediction based on its current parameters. If the prediction is wrong, the network goes back and tweaks the parameters slightly to improve future predictions. This process repeats many times, with the network constantly adjusting its parameters to reduce errors.

Training requires a lot of human-intelligible text. Each time the network sees a new piece of text, it learns to better predict and complete sentences. To train effectively, the text must cover diverse subjects, from different disciplines. This helps the network learn to handle a wide variety of contexts and topics.

For GPT-3, the OpenAI team used a massive collection of text, amounting to the gargantuan size of 45 TB of textual archive. Such a learning dataset included text from various sources, such as websites (using pages “scraped” from blogs, forums, news portals, and so on), books, the whole of English Wikipedia, and academic articles, providing a rich and varied dataset for training. By processing this extensive and diverse text, the neural network parameters gradually converged to values that enable accurate and contextually appropriate text completions.

The massive power it takes to train aN LLM

Flops (Floating Point Operations per Second) measure a computer's performance, with a Petaflop equaling one quadrillion (1,000,000,000,000,000) operations per second. Training GPT-3 required a total of 3,640 Petaflop-days of computation, meaning it used the equivalent of a computer performing one quadrillion calculations per second for 3,640 days. In comparison, a high-end gaming laptop, which at the time of writing can host a GPU capable of 100 Teraflops, or 0.1 Petaflops, would take around 100 years to do the same model training. The energy required would be similar to that required by 120 households for a year.

Now we have the full picture of how GPT-3 works. As shown here, the process begins with a massive collection of training text, which the model uses to learn and develop its parameters. This model training phase equips GPT-3 with the ability to understand and generate human-like text. Then, when a prompt is given to the trained model during inference, it processes the input using the learned parameters and produces a generated completion.

The two stages of GPT-3. First, it gets trained, and then the sentence completion is inferred.

Keeping the knowledge up to date

Now that we have the full picture of how a real LLM works, we can better understand the inherent limitation related to its static knowledge and envision how to solve it. It is important to distinguish between what is static and what is dynamic in an LLM. The training of the model happens once and for all, which is why GPT is termed "pre-trained." This static training means that the model's knowledge is fixed up to a specific date, reflecting the latest documents in its training set. Consequently, when interacting with an LLM, you might encounter disclaimers stating that the model's knowledge is current only up to a certain date, referred to as the knowledge cutoff date. This implies that to incorporate recent or real-time knowledge into our GenAI applications, we must enable the model to retrieve up-to-date information from the web, a topic we will explore further in this book.

Conversely, the inference process is dynamic; it occurs in real time based on the user's prompt. Before calling the LLM, the prompt must be enriched with any additional or updated information, leveraging specific prompting techniques, as we will learn in the book. Essentially, we need to equip the LLM with additional mechanisms, devices, and customization to be able to interact with all the knowledge it needs, integrating its pre-trained knowledge with whatever else it needs.

Another important consideration is the concept of fine-tuning, which allows us to enhance or specialize a pre-trained model. Fine-tuning involves taking a pre-trained neural network like our GPT-3 and conducting additional learning passes with new, specific, or more relevant data. This process can yield multiple benefits: it can make the LLM more adept at specific tasks by tailoring it to particular sets of instructions, or it can integrate specialized knowledge from domains that were underrepresented in the initial training data, such as a specific branch of medicine. Additionally, fine-tuning can improve the model's ability to maintain dialogues with humans, as demonstrated by ChatGPT, which fine-tuned GPT-3 for better conversational capabilities. This fine-tuning process follows a typical supervised learning approach. The model is provided with numerous prompts and the desired completions, so that it can learn by those given examples. The neural network adjusts its parameters to converge towards the desired completions through backpropagation-like learning passes, enhancing its performance and applicability.

Summary

GenAI programming is fundamentally different from traditional coding, requiring us to handle non-deterministic outputs and natural-language inputs.
Building real-world GenAI applications involves orchestrating multiple supporting components (like memory management and updated knowledge retrieval) around the core LLM.
Transformer models, like GPT-3, generate text one token at a time by leveraging embeddings, huge neural networks, and massive pre-training.
LLMs have fixed, pre-trained knowledge and are stateless by default, but we can work around these limits with fine-tuning and additional architectural elements.

Guided notes

To understand how transformers really work, I recommend: Tsourakis, Nikos. Machine Learning Techniques for Text: Apply modern techniques with Python for text processing, dimensionality reduction, classification, and evaluation. Packt Publishing Ltd, 2022.
To learn more about how GPT-3 works and was trainerd, you can read the OpenAI paper published at its launch: Brown, Tom, et al. "Language models are few-shot learners." Advances in neural information processing systems 33 (2020): 1877-1901.