chapter seven

7 Controlling LLMs via the Ollama Python API

 

This chapter covers

  • Writing your first Python script that talks to a local LLM
  • Understanding message roles: system, user, and assistant
  • Streaming responses for real-time token-by-token output
  • Building a multi-turn chatbot that maintains conversation history

Everything you have done so far---installing Ollama, downloading models, setting up VS Code, creating a virtual environment---was preparation. In this chapter, you cross the line from AI user to AI programmer. You will write Python scripts that send instructions to an LLM and receive its responses, giving you programmatic control over a locally running AI model.

Make sure your virtual environment is activated (you should see (venv) in your terminal prompt) and that the Ollama service is running before you begin. This chapter uses the model gemma3:4b. If you have not pulled it yet, run ollama pull gemma3:4b in a terminal. If ollama serve reports that port 11434 is already in use, Ollama is already running and you can continue.

7.1 Your First Ollama Python Script

In this section, you will write a short Python script that sends a question to the Gemma 2 model and prints its answer. Do not worry if the code looks unfamiliar---we will break down every line afterward. The goal is to type the code exactly as shown, run it, and see the result. Understanding will follow.

7.1.1 The simplest AI program

7.1.2 Breaking down the code

7.2 Understanding the Message Format

7.2.1 The three roles

7.2.2 Adding a system message

7.2.3 Experimenting with different system messages

7.3 Streaming Responses for Real-Time Output

7.3.1 The problem with waiting

7.3.2 Implementing streaming

7.3.3 Understanding the code

7.3.4 The communication flow

7.4 Maintaining Conversation History

7.4.1 The memory problem

7.4.2 Building a conversation loop

7.4.3 Running the conversation

7.4.4 How the history list grows

7.4.5 The complete flow

7.5 Summary

7.6 Exercises