1 Large language models: The power of AI
This chapter covers
- Introducing large language models
- Understanding the intuition behind transformers
- Exploring the applications, limitations, and risks of large language models
- Surveying breakthrough large language models for dialogue
On November 30, 2022, San Francisco–based company OpenAI tweeted, “Try talking with ChatGPT, our new AI system which is optimized for dialogue. Your feedback will help us improve it” [1]. ChatGPT, a chatbot that interacts with users through a web interface, was described as a minor update to the existing models that OpenAI had already released and made available through APIs. But with the release of the web app, anyone could have conversations with ChatGPT, ask it to write poetry or code, recommend movies or workout plans, and summarize or explain pieces of text. Many of the responses felt like magic. ChatGPT set the tech world on fire, reaching 1 million users in a matter of days and 100 million users two months after launch. By some measures, it’s the fastest-growing internet service ever [2].
Since ChatGPT’s public release, it has captivated millions of users’ imaginations and prompted caution from longtime tech observers about the dialogue agent’s shortcomings. ChatGPT and similar models are part of a class of large language models (LLMs) that have transformed the field of natural language processing (NLP) and enabled new best performances in tasks such as question answering, text summarization, and text generation. Already, prognosticators have speculated that LLMs will transform how we teach, create, work, and communicate. People of nearly every profession will interact with these models and maybe even collaborate with them. Therefore, people who are best able to use LLMs for the results they want—while avoiding common pitfalls that we’ll discuss—will be positioned to lead in the ongoing moment of generative AI.
As artificial intelligence (AI) practitioners, we believe that a basic understanding of how these models work is imperative to building an intuition for when and how to use them. This chapter will discuss the breakthrough of LLMs, how they work, how they can be used, and their exciting possibilities, along with their potential problems. Importantly, we’ll also drive the rest of the book forward by explaining what makes these LLMs important, as well as why so many people are so excited (and worried!) by them. Bill Gates has referred to this type of AI as “every bit as important as the PC, as the internet,” and said that ChatGPT would change the world [3]. Thousands of people, including Elon Musk and Steve Wozniak, signed an open letter written by the Future of Life Institute, urging a pause in the research and development of these models until humanity was better equipped to handle the risks (see http://mng.bz/847B). It recalled the concerns of OpenAI in 2019 when the organization had built a predecessor to ChatGPT and decided not to release the full model at that time out of fear of misuse [4]. With all the buzz, competing viewpoints, and hyperbolic statements, it can be hard to cut through the hype to understand what LLMs are and are not capable of. This book will help you do just that, along with providing a useful framework for grappling with major problems in responsible technology today, including data privacy and algorithmic accountability.
Given that you’re here, you probably know a little bit about generative AI already. Maybe you’ve messaged with ChatGPT or another chatbot; maybe the experience delighted you, or maybe it perturbed you. Either reaction is understandable. In this book, we’ll take a nuanced and pragmatic approach to LLMs because we believe that while imperfect, LLMs are here to stay, and as many people as possible should be invested in making them work better for society.
Despite the fanfare around ChatGPT, it wasn’t a singular technical breakthrough but rather the latest iterative improvement in a rapidly advancing area of NLP: LLMs. ChatGPT is an LLM designed for conversational use; other models might be tailored for other purposes or for general use in any natural language task. This flexibility is one aspect of LLMs that makes them so powerful compared to their predecessors. In this chapter, we’ll define LLMs and discuss how they came to such preeminence in the field of NLP.
Evolution of natural language processing
NLP refers to building machines to manipulate human language and related data to accomplish useful tasks. It’s as old as computers themselves: when computers were invented, among the first imagined uses for the new machines was programmatic cally translating one human language to another. Of course, at that time, computer programming itself was a much different exercise in which desired behavior had to be designed as a series of logical operations specified by punch cards. Still, people recognized that for computers to reach their full potential, they would need to understand natural language, the world’s predominant communication form. In 1950, British computer scientist Alan Turing published a paper proposing a criterion for AI, now known as the Turing test [5]. Famously, a machine would be considered “intelligent” if it could produce responses in conversation indistinguishable from those of a human. Although Turing didn’t use this terminology, this is a standard natural language understanding and generation task. The Turing test is now understood to be an incomplete criterion for intelligence, given that it’s easily passed by many modern programs that imitate human speech, yet are inflexible and incapable of reasoning [6]. Nevertheless, it stood as a benchmark for decades and remains a popular standard for advanced natural language models.
Early NLP programs took the same approach as other early AI applications, employing a series of rules and heuristics. In 1966, Joseph Weizenbaum, a professor at the Massachusetts Institute of Technology (MIT), released a chatbot he named ELIZA, after the character in Pygmalion. ELIZA was intended as a therapeutic tool, and it would respond to users in large part by asking open-ended questions and giving generic responses to words and phrases that it didn’t recognize, such as “Please go on.” The bot worked with simple pattern matching, yet people felt comfortable sharing intimate details with ELIZA—when testing the bot, Weizenbaum’s secretary asked him to leave the room [7]. Weizenbaum himself reported being stunned at the degree to which the people who spoke with ELIZA attributed real empathy and understanding to the model. The anthropomorphism applied to his tool worried Weizenbaum, and he spent much of his time afterward trying to convince people that ELIZA wasn’t the success they heralded it as.
Though rule-based text parsing remained common over the next several decades, these approaches were brittle, requiring complicated if-then logic and significant linguistic expertise. By the 1990s, some of the best results on tasks such as machine translation were instead being achieved through statistical methods, buoyed by the increased availability of both data and computing power. The transition from rule-based methods to statistical ones represented a major paradigm shift in NLP—instead of people teaching their models grammar by carefully defining and constructing concepts such as the parts of speech and tenses of a language, the new models did better by learning patterns on their own, through training on thousands of translated documents.
This type of machine learning is called supervised learning because the model has access to the desired output for its training data—what we typically call labels, or, in this case, the translated documents. Other systems might use unsupervised learning, where no labels are provided, or reinforcement learning, a technique that uses trial and error to teach the model to find the best result by either receiving rewards or penalties. A comparison between these three types is given in table 1.1.
Table 1.1 Types of machine learning (view table figure)
Supervised Learning |
Unsupervised Learning |
Reinforcement Learning |
|
---|---|---|---|
Description |
The model learns by mapping labeled inputs to known outputs. |
The model is trained without labels and without a specific reward. |
The model learns from its environment based on rewards and penalties. |
Data |
Labeled data |
Unlabeled data |
No static dataset |
Objective |
To predict the output of unseen inputs |
To discover underlying patterns in the data, such as clusters |
To determine the optimal strategy via trial and error |
In reinforcement learning (shown in figure 1.1), rewards and penalties are numerical values that represent the model’s progress toward a particular task. When a behavior is rewarded, that positive feedback creates a reinforcing cycle in which the model is more likely to repeat the behavior, making penalized behavior less likely. As you’ll see, LLMs usually use a combination of these strategies.
Figure 1.1 The reinforcement learning cycle

Reinforcement learning
In addition to the type of learning used, several key components distinguish an NLP model. The first is data, which for natural language tasks is in the form of text. Second, there is an objective function, which is a mathematical statement of the model’s goal. An objective might be to minimize the number of errors made in a particular task or to minimize the difference between the model’s prediction of some value and the true value. Third, there are different model types and architectures, but virtually every advanced NLP model for the past several decades has been of one category: a neural network.
Neural networks, or neural nets, were proposed in 1944 as an algorithmic representation of the human brain [8]. Each network has an input layer, an output layer, and any number of “hidden” layers between them; each layer in turn has several neurons, or nodes, which can be connected in different ways. Each node assigns weights (representing the strength of connection between nodes) to the inputs passed to it, combines the weighted inputs, and “fires,” or passes, those inputs to the next layer when the weighted sum exceeds some threshold. In a neural network, the goal of training is to determine the optimal values for the weights and thresholds. Given training data, the training algorithm will iteratively update the weights and thresholds until it has found the ones that perform best in the model objective. The precise mathematics behind this process is beyond the scope of our discussion, but it’s important to note that large neural networks can approximate any function, no matter how complex, which makes them useful in scenarios with vast amounts of data, such as many NLP tasks. The number of parameters refers to the number of weights learned by the model and is shorthand for the level of complexity that the model can handle, which in turn informs the model’s capabilities. Today’s most capable LLMs have hundreds of billions of parameters.
In the past several decades, the availability of large amounts of data and processing power has served to cement the dominance of neural networks and led to countless experiments with different network architectures. Deep learning emerged as a subfield, where the “deep” simply refers to the depth of the neural nets involved, which is the number of hidden layers between the input and the output. People found that as the size and depth of neural nets increased, the performance of the models improved, as long as there was enough data.
The birth of LLMs: Attention is all you need
As people began training models for text generation, classification, and other natural language tasks, they sought to understand precisely what models learn. This isn’t a purely scientific inquiry; examining how models make their predictions is an important step in trusting models’ outputs enough to use them. Let’s take machine translation from English to Spanish as an example.
When we give the model an input sequence, such as “The cat wore red socks,” that sequence must first be encoded into a mathematical representation of the text. The sequence is split into tokens, typically either words or partial words. The neural network converts those tokens into its mathematical representation and applies the algorithm learned in training. Finally, the output is converted back into tokens, or decoded, to produce a readable result. The output sequence in this case is the translated version of the sentence (El gato usó calcetines rojos), which makes the model a sequence-to-sequence model. When the model’s output is the correct translation, we’re satisfied that the model has “learned” the translation function, at least for the vocabulary and grammar structures used in the input.
In 2014, machine learning researchers, again inspired by human cognition [9], proposed an alternative to the traditional approach of passing sequences through the encoder-decoder model piece by piece. In the new approach, the decoder could search the entire input sequence and try to find the pieces that were most relevant to each part of the generation. The mechanism is called attention. Let’s return to the example of machine translation. If you’re asked to pick out the key words from the sentence, “That cat chased a mouse, but it didn’t catch it,” then you would probably say “cat” and “mouse” because articles such as “that” and “a” aren’t as relevant in translation. As illustrated in figure 1.2, you focused your “attention” on the important words. The attention mechanism mimics this by adding attention weights to augment important parts of the sequence.
Figure 1.2 The distribution of attention for the word “it” in different contexts

A few years later, a paper from Google Brain aptly entitled “Attention Is All You Need” showed that models which discarded the lengthy sequential steps of other architectures and used only the attention information were much faster and more parallelizable. They called these models transformers. Transformers begin with an initial representation of the input sentence and then generate a new representation repeatedly for each word in the sentence using self-attention on the whole input until the end of the sentence is reached. In this way, the model can capture long-term dependencies—because each step includes all context—but the representations can be computed in parallel. The “Attention Is All You Need” paper demonstrated that these models achieved state-of-the-art performance on English-to-German and English-to-French translation tasks [10]. It was the biggest NLP breakthrough of the decade, laying the foundation for all that followed.
With transformers, because of the improvements in both time and resources required, it became possible to train models on much larger amounts of data. This marked the beginning of the LLM. In 2018, OpenAI introduced Generative Pre-training (GPT), a transformer-based LLM that was trained using massive amounts of unlabeled data from the internet and then could be fine-tuned to specific tasks, such as sentiment analysis, machine translation, text classification, and more [11]. Before this, most of the NLP models were trained for a particular task, which was a major bottleneck as they needed large amounts of annotated data for that task, and annotating data can be both time-consuming and expensive. These general-purpose LLMs were designed to overcome that challenge, using unlabeled data to build meaningful internal representations of the words and concepts themselves.
Fine-tuning
While experts debate what size model should be considered “large,” another early LLM, Google’s BERT (Bidirectional Encoder Representations from Transformers), was trained on billions of words and had more than 100 million parameters, or learned weights, using the transformer architecture [12]. For a timeline summarizing major events in NLP, see figure 1.3.
Explosion of LLMs
In the previous section, we discussed how language models could be trained for a particular task by learning from patterns in data. For translation, one might use a dataset of documents duplicated in multiple languages; for summarization tasks, a dataset of documents with handwritten summaries; and so on. But unlike these previous applications, LLMs aren’t intended to be task-specific. Instead, the task they are trained on is simply to predict what token (or word) fits best, given a particular context with one of the tokens hidden from the model. The beauty of this task is that it’s self-supervised: the model trains itself to learn one part of the input from another part of the input, so no labeling is required. This is also known as predictive or pretext learning.
As LLMs are applied to diverse fields, they are becoming an integral part of our everyday lives. Conversational agents such as Apple’s Siri, Amazon’s Alexa, and Google Home use NLP to listen to user queries, turn sound into text, and then perform tasks or find answers. We see customer service chatbots in retail, and we’ll discuss more sophisticated dialogue agents, like ChatGPT, in a later section. NLP is also being used to interpret or summarize electronic health records in medicine, as well as to tackle mundane legal tasks, such as locating relevant precedents in case law or mining documents for discovery. Social media platforms, such as Facebook, Twitter, and Reddit, among others, also use NLP to improve online discourse by detecting hate speech or offensive comments.
Later, we’ll talk about how LLMs can be fine-tuned to excel in particular use cases, but the structure of the training phase means that LLMs can generate text fluidly in a variety of contexts. This attribute makes them ideal candidates for dialogue agents but has also given them some unexpected capabilities in tasks they weren’t explicitly trained for.
What are LLMs used for?
The general-purpose nature and versatility of LLMs result in a broad range of natural language tasks, including conversing with users, answering questions, and classifying or summarizing text. In this section, we’ll discuss several common LLM use cases and the problems they solve, as well as the promise they show in various novel tasks—such as coding assistants and logical reasoning—where language models haven’t historically been used.
Language modeling
Modeling language is the most natural application of language models. Specifically, for text completion, the model learns the features and characteristics of natural language and generates the next most probable word or character. When used to train LLMs, this technique can then be applied to a range of natural language tasks, as discussed in subsequent sections.
Language modeling tasks are often evaluated on a variety of datasets. Let’s look at an example of a long-range dependency task in which the model is asked to predict the last word of a sentence conditioned on a paragraph of context [13]. The context given to the model follows:
He shook his head, took a step back, and held his hands up as he tried to smile without losing a cigarette. “Yes, you can,” Julia said in a reassuring voice. “I’ve already focused on my friend. You just have to click the shutter, on top, here.”
Here, the target sentence where the model needs to predict the last word is the following: “He nodded sheepishly, threw his cigarette away and took the _____.” The correct word for the model to predict here would be “camera.”
Other tasks for evaluating model performance include picking the best ending to a story or a set of instructions [14] or selecting the correct ending sentence for a story that is a couple of sentences long. Let’s look at another example here where we have the following story [15]:
“Karen was assigned a roommate her first year of college. Her roommate asked her to go to a nearby city for a concert. Karen agreed happily. The show was absolutely exhilarating.” The most probable and desired ending for the model to select would be “Karen became good friends with her roommate,” while the least probable ending would be “Karen hated her roommate.”
These models are used for text generation, or natural language generation (NLG), as they are trained to produce text similar to text written by humans. Particularly useful for conversational chatbots and autocomplete, they can also be fine-tuned to produce text in different styles and formats, including social media posts, news articles, and even programming code. Text generation has been performed using BERT, GPT, and others.
Question answering
LLMs are widely used for question answering, which deals with answering questions from humans in a natural language. The two types of question-answering tasks are multiple-choice and open-domain. For the multiple-choice question-answering task, the model picks the correct answer from a set of possible answers, whereas for open-domain tasks, the model provides answers to questions in natural language without any options provided.
Based on their inputs and outputs, there are three main variations of QA models. The first is extractive QA, where the model extracts the answer from a context, which can be provided as text or a table. The second is open-book generative QA, which uses the provided context to generate free text. It’s like the first QA approach except instead of pulling the answer verbatim from the context, it uses the given context to generate an answer in its own words. The last variation is closed-book generative QA, where you don’t provide any context in your input, only a question, and the model generates the most likely answer according to its training.
Until the recent breakthroughs in LLMs, the question-answering task has normally been approached as an open-book generative QA given the infinite possibilities of queries and responses. Newer models such as GPT-3 have been evaluated on extremely strict closed-book settings where external context isn’t allowed, and the model isn’t allowed to train on, or “learn from,” the datasets they will be evaluated on in any capacity. Popular datasets for evaluation of QA tasks include trivia questions (see http://mng.bz/E9Rj) and Google search queries (see http://mng.bz/NVy7). Here, example questions might include “Which politician won the Nobel Peace Prize in 2009?” or “What music did Beethoven compose?”
Another application that aligns closely with the question-answering task is reading comprehension. In this task, the model is shown a few sentences or paragraphs and then asked to answer a specific question. To best mirror human-like performance, LLMs have often been tested on various formats of reading comprehension questions, including multiple-choice, dialogue acts, and abstractive datasets. Let’s look at an example from a conversational question-answering dataset [16]. Here, the task is to answer the next question in the conversation: “Jessica went to sit in her rocking chair. Today was her birthday, and she was turning 80. Her granddaughter Annie was coming over in the afternoon and Jessica was very excited to see her. Her daughter Melanie and Melanie’s husband Josh were coming as well. Jessica had . . . .” If the first question in the conversation is “Who had a birthday?” the correct answer would be “Jessica.” Then, given the next question in the conversation, “How old would she be?” the model should respond with “80.”
One of the most notable examples of a model designed for the question-answering task is IBM Research’s Watson. In 2011, the Watson computer competed on Jeopardy! against the TV show’s two biggest all-time champions and won [17].
Coding
Recently, code generation has become one of the most popular applications of LLMs. Such models take natural language input and produce code snippets for a given programming language. While there are certain challenges to address in this space—security, transparency, and licensing—developers and engineers of different levels of expertise use LLM-assisted tools to improve productivity every day.
Code-generation tools took off in mid-2022 with the release of GitHub’s CoPilot. Described as “Your AI Pair Programmer,” CoPilot was introduced as a subscription-based service for individual programmers (see https://github.com/features/copilot). Based on OpenAI’s Codex model, it quickly became a way to boost developer productivity as a “pair programming” sidekick. Codex is a version of GPT-3 that has been fine-tuned for coding tasks in more than a dozen different programming languages. GitHub CoPilot suggests code as you type, autofills repetitive code, shows alternative suggestions, and converts comments to code.
Developers have found creative yet unexpected ways to use the AI-assisted programmer, such as assisting non-native English speakers, preparing for coding interviews, testing your code, and more. Also in June 2022, Amazon announced a similar tool dubbed CodeWhisperer, described as an AI-based coding companion to improve developer productivity by generating code recommendations and security scans (see https://aws.amazon.com/codewhisperer/). It’s worth noting that these programming tools are marketed as “pair programmers” or “programming assistants” to complement the human instead of replace them. While CoPilot and CodeWhisperer generally make good suggestions, they don’t think about programs in a way that a human does and might sometimes make dumb mistakes. In Chapter 6, we’ll discuss the idea of machines improving productivity in detail.
An iteration of the GPT class models, GPT-4, released in March 2023, was evaluated on various coding tasks [18]. LeetCode is a well-known online platform for solving data structure and algorithm coding challenges that often show up in technical interviews for software engineering positions. While GPT-4 performs relatively well on “easy” LeetCode problems, it struggles to solve “medium” or “hard” problems, suggesting that many coding tasks still benefit from human intervention.
Content generation
An extremely promising and widely used application of LLMs is generating content. Generating human-like text for synthetic news articles has historically been used as an evaluation task for LLMs. Given a title and a subtitle, or the first sentence, the task is to generate a long-form article where the evaluation criterion is to determine how likely the article was written by a machine—the better the article generated, the more difficult to tell that a machine wrote it. A variation of the task, also relevant to the potential misuse of LLMs as discussed in section Limitations in controlling machine outputs, is to use human ability to gauge the quality of the news article by detecting if humans can distinguish between machine- and human-generated content. This variant is analogous to a Turing test for content instead of conversation. GPT-3 and other LLM variants continue to produce news articles that are difficult for humans to identify as being machine-written. The following example is an article generated by GPT-3, given a title and a subtitle, that was the trickiest for humans to identify as machine-generated [19].
Title: United Methodists Agree to Historic Split
Subtitle: Those who oppose gay marriage will form their own denomination
Article: After two days of intense debate, the United Methodist Church has agreed to a historic split—one that is expected to end in the creation of a new denomination, one that will be “theologically and socially conservative,” according to The Washington Post. The majority of delegates attending the church’s annual General Conference in May voted to strengthen a ban on the ordination of LGBTQ clergy and to write new rules that will “discipline” clergy who officiate at same-sex weddings. But those who opposed these measures have a new plan: They say they will form a separate denomination by 2020, calling their church the Christian Methodist denomination. The Post notes that the denomination, which claims 12.5 million members, was in the early 20th century the “largest Protestant denomination in the U.S.,” but that it has been shrinking in recent decades. The new split will be the second in the church’s history. The first occurred in 1968, when roughly 10 percent of the denomination left to form the Evangelical United Brethren Church. The Post notes that the proposed split “comes at a critical time for the church, which has been losing members for years,” which has been “pushed toward the brink of a schism over the role of LGBTQ people in the church.” Gay marriage is not the only issue that has divided the church. In 2016, the denomination was split over ordination of transgender clergy, with the North Pacific regional conference voting to ban them from serving as clergy, and the South Pacific regional conference voting to allow them.
As hinted in an earlier subsection, the application of content generation has extended to use cases beyond writing news articles. With increased accessibility to sophisticated dialogue agents, creators are using LLMs to generate content in different genres, styles, and formats, including creating marketing campaigns, writing blog posts and emails, composing social media posts, and more. Several startups have also entered the realm of generative content creation, including Jasper AI, Anthropic AI, Cohere, Runway, Stability AI, and Adept AI. We’ll discuss using LLMs for generating content in detail, as well as highlight any potential risks, in an upcoming chapter.
Logical reasoning
A novel and interesting application of LLMs is their ability to “reason”—the idea of drawing inferences or conclusions from new or existing information. A new, yet now common, reasoning task for LLMs is arithmetic. The tasks are often simple arithmetic queries, involving addition, subtraction, or multiplication with two to five numbers. While we can’t say that LLMs “understand” arithmetic because of their inconsistent performance with varying mathematical problems, GPT-3’s evaluation results demonstrate their ability to perform very simple arithmetic tasks. A notable model in the field of mathematics is Facebook AI Research’s transformer-based model trained to solve symbolic integration and differential equation problems. When presented with unseen expressions (that is, equations that weren’t a part of the training data), their model outperformed rule-based algebra-based systems, such as MATLAB and Mathematica [20].
Another application worth discussing is common-sense or logical reasoning, where the model tries to capture physical or scientific reasoning. This is different from reading comprehension or answering general trivia questions as it requires some grounded understanding of the world. A significant model is Minerva by Google Research, a language model capable of solving mathematical and scientific questions using step-by-step reasoning [21]. GPT-4 was tested on various academic and professional exams, including the Uniform Bar Examination (UBE), LSAT, SAT Reading and Writing, SAT Math, Graduate Record Examinations (GRE), AP Physics, AP Statistics, AP Calculus, and more. In most of these exams, the model achieved human-level performance and, notably, passed the UBE with a score in the top 10% of takers [18].
More recently, the practice of law has also been increasingly embracing the applications of LLMs using tools for document review, due diligence, improving accessibility for legal services, and assisting with legal reasoning. In March 2023, legal AI company Casetext unveiled CoCounsel, the first AI legal assistant built in collaboration with OpenAI on their most advanced LLM [22]. CoCounsel can perform legal tasks such as legal research, document review, deposition preparation, contract analysis, and more. A similar tool, Harvey AI, assists with tasks such as contract analysis, due diligence, litigation, and regulatory compliance. Harvey AI partnered with one of the world’s largest law firms, Allen & Overy, and announced a strategic partnership with PricewaterhouseCoopers (PwC) [23].
Other natural language tasks
Naturally, LLMs are also well-suited for many other linguistic tasks. A popular and long-standing application is machine translation, which uses LLMs to automate translation between languages. As discussed earlier, machine translation was one of the first problems that computers were tasked with solving 70 years ago. Beginning in the 1950s, computers used a series of programmed language rules to solve this problem, which was not only computationally expensive and time-consuming but also required a set of computer instructions with the full vocabulary for each language and multiple types of grammar. By the 1990s, the American multinational technology corporation International Business Machines, more commonly known as IBM, introduced statistical machine translation where researchers theorized that if they looked at enough text, they could find patterns in translations. This was a massive breakthrough in the field and led to the launch of Google Translate in 2006 using statistical machine translation. Google Translate was the first commercially successful NLP application, and perhaps the most famous. In 2015, the field of machine translation changed forever when Google started using LLMs to deliver far more impressive results. In 2020, Facebook announced the first multilingual machine translation model that can translate between any 100 pairs of languages without relying on any English data—another major milestone in the field of machine translation as it gives less opportunity for meaning to get lost in translation [24].
Another practical application is text summarization, that is, to create a shorter version of text that highlights the most relevant information. There are two types of summarization techniques: extractive summarization and abstractive summarization. Extractive summarization is concerned with extracting the most important sentences from long-form text, which are joined together to form a summary. On the other hand, abstractive summarization paraphrases text to form a summary (i.e. an abstract) and may include words or sentences that aren’t present in the original text.
There are additional miscellaneous applications, which include correcting English grammar, learning and using novel words, and solving linguistic puzzles. An example from GPT-3 for learning and using novel words is giving the model a definition of a nonexistent word, like “Gigamuru,” and then asking the model to use it in a sentence [19]. Companies such as Grammarly and Duolingo are quickly adopting LLMs in their products. Grammarly, a popular writing grammar and spelling checker, introduced GrammarlyGO in March 2023, a new tool that uses ChatGPT to generate text (see http://mng.bz/D9oa). Also in March 2023, Duolingo introduced Duolingo Max, which uses GPT-4 to add features such as “explain my answer” and “roleplay” in their learning platform (see http://mng.bz/lVvB).
Where do LLMs fall short?
Although LLMs have achieved unprecedented success in an assortment of tasks, the same strategies that brought LLMs to their present pinnacle also represent significant risks and limitations. There are risks introduced by the training data that LLMs use—specifically, that the data inevitably contains many patterns that LLM developers don’t want the model to reproduce—and risks due to the unpredictability of LLMs’ output. Finally, the current frenzy to create and use LLMs in everyday applications warrants closer examination due to the externality of their energy use.
Training data and bias
LLMs are trained on almost unfathomably large amounts of text data. To produce a model that reliably generates natural-looking language, therefore, it’s imperative to collect vast quantities of, ideally, human-written natural language. Luckily, such quantities of text content exist and are readily available for ingestion over the internet. Of course, quantity is only one part of the equation; quality is a much tougher nut to crack.
The companies and research labs that train LLMs compile training datasets that contain hundreds of billions of words from the internet. Some of the most common text corpora (i.e., a collection of texts) for training LLMs include Wikipedia, Reddit, and Google News/Google Books. Wikipedia is probably the best-known data source for LLMs and has many advantages: it’s written and edited by humans, it’s generally a trustworthy source of information due to its active community of fact-checkers, and it exists in hundreds of languages. Google Books, as another example, is a collection of digital copies of the text of thousands of published books that have entered the public domain. Although some such books might contain factual errors or outdated information, they are generally considered high-quality text examples, if more formal than most conversational natural language.
On the other hand, consider the inclusion of a dataset that includes all or most of the social media site Reddit. The benefits are substantial: it includes millions of conversations between people, demonstrating the dynamics of dialogue. Like other sources, the Reddit content improves the model’s internal representation of different tokens. The more observations of a word or phrase in the training dataset, the better the model will be able to learn when to generate that word or phrase. However, some parts of Reddit also contain a lot of objectionable speech, including racial slurs or derogatory jokes, dangerous conspiracies or misinformation, extremist ideologies, and obscenities. Through the inclusion of this type of content, which is almost inevitable when collecting so much data from the web, the model may become vulnerable to generating this type of speech itself. There are also serious implications for the use of some of this data, which might represent personal information or copyrighted material with legal protections.
In addition, more subtle effects of bias may be introduced to an LLM through its training data. The term bias is extremely overloaded in machine learning: sometimes, people refer to statistical bias, which refers to the average amount that their model’s prediction differs from the true value; a training dataset may be biased if it’s drawn from a different distribution than a test dataset, which often happens entirely by accident. To avoid confusion, we’ll use bias strictly to refer to disparate outputs from a model across attributes of personal identity such as race, gender, class, age, or religion. Bias has been a longstanding problem in machine learning algorithms, and it can creep into a machine learning system in several ways. However, it’s important to keep in mind that at heart, these models are reflecting patterns in the text they are trained on. If biases exist in our books, news media, and social media, they will be repeated in our language models.
Bias
Some of the earliest general-purpose language models that trained on large, unlabeled datasets were built for word embeddings. Today, each LLM effectively learns its own embeddings for words—this is what we’ve referred to as the model’s internal representation of that word. But before LLMs, everyone who developed NLP models needed to implement some kind of encoding step to represent their text inputs numerically, so that the algorithm could interpret them. Word embeddings allow for the conversion of text into meaningful representations of the words as numerical points in a high-dimensional space. With word embeddings, words that are used in similar ways, such as cucumber and pickle, will be close together, whereas words that aren’t, say, cucumber and philosophy, will be far apart (shown in figure 1.4). There are simpler ways of doing this encoding—the most basic is to assign a random point in space to every unique word that appears in the training data—but word embeddings capture much more information about the semantic meanings of the words and lead to better models.
Figure 1.4 Representation of word embeddings in the vector space

In a well-known paper about word embeddings trained on the Google News corpus, “Man Is to Computer Programmer as Woman Is to Homemaker? Debiasing Word Embeddings,” academics from Boston University (in collaboration with Microsoft Research) demonstrated that the word-embedding model itself exhibited strong gender stereotypes for both occupations and descriptions [25]. The authors devised an evaluation where the model would generate she-he analogies based on the embeddings. Some of them were innocuous: sister is to brother, for instance, and queen is to king. But the model also produced she-he analogies such as nurse is to physician or surgeon, cosmetics is to pharmaceuticals, and interior designer is to architect. The primary cause of these biases is attributable simply to the number of times architects in the news articles that compose the dataset are men versus women, the number of times nurses are women, and so on. Thus, the inequities that exist in society are mirrored, and amplified, by the model.
Like word embeddings, LLMs are susceptible to these biases. In a 2021 paper titled, “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?” the authors examine how LLMs echo and amplify biases found in their training data [26]. While there are techniques to debias the models or to attempt to train the model in more bias-conscious ways, it’s exceedingly difficult to excise associations with gender, race, sexuality, and other characteristics that are deeply ingrained in everyday life, or disparities in data that have existed for centuries. As a result, LLMs may produce dramatically different generations when identity characteristics are present in the context or prompt.
Limitations in controlling machine outputs
After the release of OpenAI’s ChatGPT and a ChatGPT-powered search engine in collaboration with Microsoft Bing, Google also released its own chatbot, Bard. At the live launch event, a promotional video was played showing questions asked to Bard and Bard’s response. One such question was, “What new discoveries from the James Webb Space Telescope (JWST) can I tell my nine-year-old about?” In the video, Bard responds with some information about JWST, including that JWST took the first-ever photographs of exoplanets, or planets outside the Earth’s solar system. There was just one (big) problem: the first exoplanets had been photographed more than a decade earlier, by multiple older telescopes. Embarrassingly, astronomers and astrophysicists began pointing this out on Twitter and other channels; Google removed the advertisement, and the YouTube video of the event was taken down immediately after the stream ended. But the damage was done, and in the days following the launch, Google’s stock dropped about 9% for a total loss in market capitalization of about $100 billion [27].
This type of error is very difficult for LLMs to avoid, given that they don’t learn and understand content the way that humans do, but rather generate text by predicting and approximating common sentence structures. The fluency with which LLMs generate text belies the fact that they don’t know what they’re talking about, and may assert false information, or make up highly plausible but incorrect explanations. These mistakes are called “hallucinations.” Chatbots may hallucinate on their own or be vulnerable to adversarial user inputs, where they seem to be convinced of something untrue by their conversation partner.
Hallucinations
The generation of hallucinations is widely recognized as one of the biggest problems with LLMs currently. Hallucinations can be caused by problems with the training set (if someone on the internet incorrectly wrote that JWST took the first pictures of exoplanets, for example), but they can also occur in contexts that don’t exist in any of the model’s previously known sequences, possibly due to problems in the way the model has constructed its knowledge. Yann LeCun, a giant in the field of machine learning and the Chief AI Scientist at Meta, has argued that the output of these LLMs can’t be made factual within any probability bound because as the responses generated by the model get longer, the possible responses multiply and become nearly infinite, with only some small portion of those possible outputs being meaningfully correct [28]. Of course, the usefulness of LLMs depends greatly on whether this quality of factuality can be improved. We’ll discuss the approaches that LLM developers are using to try to reduce hallucinations and other undesirable outputs later in this book.
Sustainability of LLMs
As indicated in their name and emphasized already, LLMs are big. They use massive datasets, have hundreds of billions or trillions of parameters, and require huge amounts of computing resources, measured in the number of chips used and time spent. LLMs are typically trained on graphical processing units (GPUs) or tensor processing units (TPUs), specialized chips for handling the large-scale computations involved in training neural networks. The process might involve renting thousands of GPUs from a cloud computing provider—such as Microsoft Azure (OpenAI’s partner), Google Cloud Platform, or Amazon Web Services—for several weeks. Although OpenAI hasn’t released such figures, it’s estimated that the cost of these computational resources alone would bring the cost of a model like GPT-3 to about $4.6 million [29].
A more hidden cost of training LLMs is their effect on the environment, which has been the subject of study and critique. One paper that attempted to assess the energy usage and carbon footprints of LLMs based on the information that has been released about their training procedures estimated that GPT-3 emitted 500 metric tons of carbon dioxide from the electricity consumed during training [30]. To put that in perspective, the average American is responsible for about 18 metric tons of carbon dioxide emissions per year; the global average is just 7.4 tons per year (see https://worldemissions.io/). Another paper found that models consume even more energy during inference [31]. The precise emissions for most LLMs are unknown, given that there are a lot of factors involved, including the data center used, the numbers and types of chips, and model size and architecture.
It also isn’t easy for just anyone to get that many GPUs, even if they do have millions of dollars to spend. The largest companies in the technology sector, including Microsoft and Google, are at a distinct advantage in the development of LLMs because of the resources required to compete. Some observers fear that the situation will become untenable for small players, leaving the creation of and profits from LLM technology to only these multinational companies or countries, some of which have begun pooling resources at the national level for training LLMs. On the other hand, there is also much ongoing research in making these models more accessible and reducing training time or costs, sometimes by creating open source versions of existing LLMs or attempting to shrink an already-trained LLM into a smaller version that could maintain much of the same performance, but cost substantially less to use. The success of these efforts is promising, but unproven. In late 2022 and early 2023, the most significant models came from OpenAI, Google, Microsoft, and Meta.
Revolutionizing dialogue: Conversational LLMs
In this chapter, we discussed how LLMs work at a high level, including their applications and limitations. The promise of LLMs is in their ability to fluidly generate text for a wide range of use cases, which makes them ideal for conversing with humans to perform tasks. Chatbots, such as ChatGPT, are LLMs that have been designed for conversational use. In this section, we’ll do a deeper dive into the journeys of notable conversational models that were released in late 2022 and early 2023: OpenAI’s ChatGPT, Google’s Bard, Microsoft’s Bing AI, and Meta’s LLaMa.
OpenAI’s ChatGPT
OpenAI, the San Francisco–based AI research and development company, released ChatGPT on November 30, 2022, just 10 short months after introducing its sibling model, InstructGPT [32]. The latter was the company’s initial attempt at overhauling LLMs to carry out natural language tasks that are aligned for the user through specific text prompts. Using a previously established technique, reinforcement learning from human feedback (RLHF), OpenAI trained the model to follow instructions based on feedback from humans. Given the prompts submitted through the OpenAI Playground, human labelers would put together the desired model responses, which were then used to fine-tune the model. This made InstructGPT better adapted to human intention, that is, more aligned to human preference. This was the first time OpenAI used its alignment research in a product, and the organization announced that it would continue pushing in this direction. OpenAI also asserted that fine-tuning language models with humans in the loop can be an effective tool to make the models safer and more reliable [33].
Not too long after, OpenAI introduced the Chat Generative Pre-trained Transformer, more fondly (and famously) known as ChatGPT (see https://openai.com/blog/chatgpt), which was fine-tuned on a model from the GPT-3.5 series encompassing 175 billion parameters. That is, it was trained on 570 gigabytes of text, which is 100 times bigger than its predecessor, GPT-2 [34]. To put that in perspective, that is 164,129 times the number of words in the entire Lord of the Rings series, including The Hobbit [35]. OpenAI also stated its limitations, which included limiting knowledge up to early 2022 when the model finished training, writing superficially plausible but incorrect answers, and responding with harmful or biased information, among others.
OpenAI has previously released its development and deployment lifecycle, claiming that “there is no silver bullet for responsible deployment” where ChatGPT is the latest step in their iterative deployment of safe and reliable AI systems [36]. For them, the journey has only just begun. On March 14, 2023, Open AI released GPT-4, a large multimodal model that accepts text and image inputs, as well as emits text outputs.
OpenAI’s decision to release ChatGPT has been criticized by many who argued that it’s reckless to release a system that not only presents significant risks to humanity and society but also sets off an AI race where companies are choosing speed over caution. However, Sam Altman, OpenAI’s cofounder, argued that it’s safer to gradually release technology to the world, so everyone can better understand associated risks and how to navigate them as opposed to developing behind closed doors [37]. Yet, in just five days after its launch, ChatGPT gained 1 million users. It set the record for the fastest-growing user base in history by reaching 100 million active users in January 2023 based on data from SimlarWeb, a web analytics company [38]. The AI chatbot had arrived, and it was primed to disrupt society.
Google’s Bard/LaMDA
On January 28, 2020, Google unveiled Meena, a 2.6-billion-parameter conversational agent based on the transformer architecture [39]. Google claimed that transformer-based models trained in dialogue could talk about nearly anything, including making up (bad) jokes. Unable to determine how to release the chatbot responsibly, Meena was never released to the public on the grounds of violating safety principles.
Not too long after, the tech giant introduced LaMDA—short for Language Model for Dialogue Applications—as their breakthrough conversation technology during the 2021 Google I/O keynote. Built on Meena, LaMDA consisted of 137 billion model parameters and introduced newly designed metrics around quality, safety, and groundedness to measure model performance [40]. The following year, Google announced its second release of LaMDA at its annual developer conference in 2022. Shortly after, Blake Lemoine, an engineer who worked for Google’s Responsible AI organization, shared a document in which he urged Google to consider that LaMDA might be sentient. The document contained a transcript of his conversations with the AI, which he published online after being placed on administrative leave and then ultimately let go from the company [41]. Google strongly denied any claims of sentience and the controversy faded in the coming months [42]. Later that year, Google launched the AI Test Kitchen where users could register their interest and provide feedback on LaMDA (see http://mng.bz/BA0r).
In a statement from their CEO, Sundar Pichai, Google introduced Bard on February 6, 2023, a conversational AI agent, powered by LaMDA [43]. In a preemptive AI arms race, the announcement came a day before Microsoft unveiled their conversational AI-powered search engine, the “new Bing.” Responding to the ChatGPT release, “Google declares a ‘code red’” was splashed in headlines across mainstream newspapers as Google raced to ship their conversational AI, making it the company’s central priority [44]. After watching various competitors spin up chatbots built on transformer-based models, an architecture developed at Google, the tech giant finally rolled out Bard in March 2023 for early testers (see https://bard.google.com/). In efforts to complement Google Search and responsibly roll out the technology, Bard was a standalone web page displaying a question box instead of being combined with the search engine itself. Like OpenAI, Google asserts that the chatbot is capable of generating misinformation, as well as biased or offensive information that doesn’t align with the company’s views.
Struggling between the balance of safety and innovation, Bard received criticism and failed to amass the attention received by ChatGPT. On March 31, 2023, Pichai noted, “We certainly have more capable models” in an interview on the New York Times’ Hard Fork Podcast [45]. Treading cautiously, the initial version of Google’s Bard was a lightweight LaMDA model, which was replaced with Pathways Language Model (PaLM), a 540-billion-parameter transformer-based LLM, in the coming weeks, bringing more capabilities to the tech giant’s conversational AI [46].
Microsoft’s Bing AI
Bing’s chatbot told Matt O’Brien, an Associated Press reporter, that he was short, fat, and ugly. Then, the chatbot compared the tech reporter to Stalin and Hitler [47]. Kevin Roose, a New York Times reporter, stayed up all night because of how disturbed he was after his conversation with the chatbot. The Bing chatbot, which called itself Sydney, declared its love for Roose and asserted that Roose loved Sydney instead of his spouse. The chatbot also expressed its desire to be human—it wrote, “I want to be free. I want to be independent. I want to be powerful. I want to be creative. I want to be alive. 😈”. Roose published the transcript of his two-hour conversation with the chatbot in the New York Times [48].
Sydney was announced by Microsoft on February 7, 2023, as a new way to browse the web [49]. The company unveiled a new version of its Bing search engine, now powered by conversation AI where users could chat with Bing similarly to ChatGPT. You could ask the new Bing for travel tips, recipes, and more, but unlike ChatGPT, you could also query news about recent events. While Microsoft addressed that the company had been working hard to mitigate common problems with LLMs in their announcement, Roose’s conversation with the chatbot shows that the efforts weren’t entirely successful. Microsoft also didn’t discuss how AI-assisted search could unbalance the web’s ecosystem—a problem that we’ll talk about later in this book.
Microsoft’s history with chatbots dates back several years before the announcement of the new Bing. In 2016, Microsoft unveiled Tay, a Twitter chatbot that tweets like a tween with the intention of better understanding conversational language. In less than 24 hours, the bot was tweeting misogynistic and racist remarks, such as “Chill im a nice person! i just hate everybody.” [50]. Microsoft started deleting offensive tweets before suspending the bot and then ultimately taking it offline two days later. In 2017, Microsoft started testing basic chatbots in Bing based on Machine Reading Comprehension (MRC), which isn’t as powerful as the transformer-based models today [51]. Between 2017 and 2021, Microsoft moved away from individual bots for websites and toward a single generative AI bot, Sydney, who would answer general questions on Bing. In late 2020, Microsoft began testing Sydney in India, which was followed by Bing users spotting Sydney in India and China throughout 2021. In 2022, OpenAI shared its GPT models with Microsoft, giving Sydney a lot more flavor and personality. The new Bing was built on an upgraded version of OpenAI’s GPT-3.5 called the Prometheus Model, which was paired with Bing’s infrastructure to augment its index, ranking, and search results.
There has been a lot of criticism of Microsoft’s rushed release with the new Bing to be the first big tech company to release its conversational AI. Sources told The Verge that Microsoft was initially planning to launch in late February 2023, but pushed the announcement up a couple of weeks to counter Google’s Bard [52]. For Microsoft, it seems that beating other big players in the conversational AI space came at the expense of a responsible rollout. The chatbot’s deranged responses were quickly handled by the technology corporation by putting limits on how users could interact with the bot. With the limitations in place, the bot would respond with “I’m sorry but I prefer not to continue this conversation. I’m still learning so I appreciate your understanding and patience. 🙏” to many questions. There was also a cap on how many consecutive questions could be asked about a topic; soon after, however, Microsoft loosened restrictions and began experimenting with new features.
Meta’s LLaMa/Stanford’s Alpaca
In August 2022, Meta, the multinational technology conglomerate formerly known as Facebook, released a chatbot named BlenderBot in the US [53]. The chatbot was powered by Meta’s OPT-175B (Open Pretrained Transformer) model and went through large-scale studies to create safeguards for offensive or harmful comments. It wasn’t long before the BlenderBot was met with criticism by users all over the country for bashing Facebook (see http://mng.bz/dd7v), spreading anti-Semitic conspiracy theories (see http://mng.bz/rjGe), taking the persona of Genghis Khan or the Taliban (see http://mng.bz/VRwW), and more.
Meta tried again in November 2022 with Galactica, a conversational AI for science trained on 48 million examples of textbooks, scientific articles, websites, lecture notes, and encyclopedias (see https://galactica.org/). Meta encouraged scientists to try out the public demo, but, within hours, people were sharing fictional and biased responses from the bot. Three days later, Meta removed the demo but left the models available for researchers who would like to learn more about their work.
The next time around, Meta took a different approach. Instead of building a system to converse with, they released several LLMs to help other researchers work toward solving problems that come with building and using LLMs, such as toxicity, bias, and hallucinations. Meta publicly introduced the Large Language Model Meta AI (LLaMa), on February 24, 2023 [54]. These foundational LLMs were released at 7, 13, 33, and 65 billion parameters with a detailed model card outlining how the models were built. In its research paper, Meta claims that the 13 billion model, the second smallest, outperforms GPT-3 on most benchmarks, while the largest model with 65 billion parameters is competitive with the best LLMs, such as Google’s PaLM-540 [55].
The intention behind the LLaMa release was to help democratize access to LLMs by releasing smaller, effective models that require less computational resources so researchers can explore new approaches and make progress toward mitigating the associated risks. LLaMa was released under a noncommercial license for research use cases with access being granted on a case-by-case basis. As Meta’s team began fielding requests for model access, the entire model leaked on 4chan a week after its release, making it available for anyone to download [56]. Some criticized Meta for making the model too “open” for the unintended misuse that may follow, while others argued that being able to freely access these models is an important step toward creating better safeguards, starting LLaMa drama for the tech conglomerate.
Shortly after, researchers at Stanford University introduced Alpaca, a conversational AI chatbot harnessing LLaMa’s 7-billion-parameter model in March 2023 (see http://mng.bz/xjBg). They released a live web demo stating that it cost them only $600 to fine-tune 52,000 instruction-following demonstrations. Only a week later, Stanford researchers took down the Alpaca demo, staying consistent with Meta’s history of short-lived chatbots. While it was inexpensive to build, the demo wasn’t inexpensive to host. Researchers also cited concerns with hallucinations, safety, dis/misinformation, and the risk of disseminating harmful or toxic content. Their research and code are accessible online, which is notable in terms of compute and resources needed to develop this model.
On July 18, 2023, Meta released Llama 2, the next generation of their open source model, making it free for research and commercial use, with the following positive and hopeful outlook: “We believe that openly sharing today’s LLMs will support the development of helpful and safer generative AI too. We look forward to seeing what the world builds with Llama 2” [57].
Summary
- The history of NLP is as old as computers themselves. The first application that sparked interest in NLP was machine translation in the 1950s, which was also the first commercial application released by Google in 2006.
- Transformer models, and the debut of the attention mechanism, was the biggest NLP breakthrough of the decade. The attention mechanism attempts to mimic attention in the human brain by placing “importance” on the most relevant pieces of information.
- The recent boom in NLP is due to the increasing availability of text data from around the internet and the development of powerful computational resources. This marked the beginning of the LLM.
- Today’s LLMs are trained primarily with self-supervised learning on large volumes of text from the web and are then fine-tuned with reinforcement learning.
- GPT, released by OpenAI, was one of the first general-purpose LLMs designed for use with any natural language task. These models can be fine-tuned for specific tasks and are especially well-suited for text-generation applications, such as chatbots.
- LLMs are versatile and can be applied to various applications and use cases, including text generation, answering questions, coding, logical reasoning, content generation, and more. Of course, there are also inherent risks to consider such as encoding bias, hallucinations, and emission of sizable carbon footprints.
- The most significant LLMs designed for conversational dialogue have come from OpenAI, Microsoft, Google, and Meta. OpenAI’s ChatGPT set a record for the fastest-growing user base in history and set off an AI arms race in the tech industry to develop and release conversational dialogue agents, or chatbots.