Let's be real for a second. The term "AI agent" gets thrown around a lot these days. It sounds futuristic and amazing, like something out of a sci-fi movie. But when you actually try to build one, you quickly run into a wall. How do you get it to think step-by-step? How do you connect it to real-world tools? How do you make it do more than just answer simple questions?
It can feel a bit like you've been handed a super-powered brain in a jar, but with no instructions on how to connect it to a body.
I've been there. That’s why I was so excited to get my hands on Z.AI’s GLM-5 model. It feels like it was designed from the ground up to solve these exact problems. It has built-in features that make creating these complex, "agentic" systems surprisingly straightforward.
So, in this post, I want to walk you through it. We're going to go from zero to building a functional multi-tool agent. No fluff, just practical steps and real code. Think of me as your co-pilot as we explore what this thing can really do.
Getting Started: The Setup
First things first, we need to get our environment ready. It's pretty simple. We'll be using the Z.AI SDK, and if you've ever used OpenAI's Python library, this will feel very familiar.
Let's install the libraries we need.
pip install -q zai-sdk openai rich
Now, you'll need an API key. You can grab a free one from their site. The little script below will ask for your key securely so you don't have to paste it into your code in plain text.
import os
import json
import time
from datetime import datetime
from typing import Optional
import getpass
API_KEY = os.environ.get("ZAI_API_KEY")
if not API_KEY:
API_KEY = getpass.getpass(" Enter your Z.AI API key (hidden input): ").strip()
if not API_KEY:
raise ValueError(
" No API key provided! Get one free at: https://z.ai/manage-apikey/apikey-list"
)
os.environ["ZAI_API_KEY"] = API_KEY
print(f" API key configured (ends with ...{API_KEY[-4:]})")
from zai import ZaiClient
client = ZaiClient(api_key=API_KEY)
print(" ZaiClient initialized — ready to use GLM-5!")
Once you run that and see the "ZaiClient initialized" message, you're all set. Let's make our first call.
A Simple "Hello World" Chat
Before we build a complex agent, let's just make sure we can have a basic conversation. We’re going to ask GLM-5 to explain a technical concept, "Mixture-of-Experts," and ask it to be concise.
print("\n" + "=" * 70)
print(" SECTION 2: Basic Chat Completion")
print("=" * 70)
response = client.chat.completions.create(
model="glm-5",
messages=[
{"role": "system", "content": "You are a concise, expert software architect."},
{"role": "user", "content": "Explain the Mixture-of-Experts architecture in 3 sentences."},
],
max_tokens=256,
temperature=0.7,
)
print("\n GLM-5 Response:")
print(response.choices[0].message.content)
print(f"\n Usage: {response.usage.prompt_tokens} prompt + {response.usage.completion_tokens} completion tokens")
Easy enough, right? It works just like you'd expect. We send a prompt, we get a response back. But waiting for the full response can feel slow, especially for longer answers. Let's fix that.
Making it Feel Alive: Streaming Responses
Have you ever used a chat app where you can see the other person typing? It feels more immediate, more real. We can do the same thing with our AI using streaming. Instead of waiting for the entire block of text, the model sends us the response word by word (or token by token) as it generates it.
This is a game-changer for user experience.
print("\n" + "=" * 70)
print(" SECTION 3: Streaming Responses")
print("=" * 70)
print("\n GLM-5 (streaming): ", end="", flush=True)
stream = client.chat.completions.create(
model="glm-5",
messages=[
{"role": "user", "content": "Write a Python one-liner that checks if a number is prime."},
],
stream=True,
max_tokens=512,
temperature=0.6,
)
full_response = ""
for chunk in stream:
delta = chunk.choices[0].delta
if delta.content:
print(delta.content, end="", flush=True)
full_response += delta.content
print(f"\n\n Streamed {len(full_response)} characters")
When you run this, you'll see the Python code appear on your screen piece by piece. It’s a small change in the code (stream=True), but a huge difference in how the application feels.
Peeking Behind the Curtain: Thinking Mode
Okay, this is where things get really interesting and, frankly, it’s one of my favorite features. Sometimes, you don't just want the answer; you want to know how the AI got the answer. This is especially true for logic puzzles, math problems, or complex reasoning.
GLM-5 has a "Thinking Mode" that exposes its internal monologue, or chain-of-thought, before it gives you the final answer. It’s like asking a student to show their work on a math test. It builds trust and helps you debug when things go wrong.
Let's try it with a classic riddle.
print("\n" + "=" * 70)
print(" SECTION 4: Thinking Mode (Chain-of-Thought)")
print("=" * 70)
print("GLM-5 can expose its internal reasoning before giving a final answer.")
print("This is especially powerful for math, logic, and complex coding tasks.\n")
print("─── Thinking Mode + Streaming ───\n")
stream = client.chat.completions.create(
model="glm-5",
messages=[
{
"role": "user",
"content": (
"A farmer has 17 sheep. All but 9 run away. "
"How many sheep does the farmer have left? "
"Think carefully before answering."
),
},
],
thinking={"type": "enabled"}, # This is the magic line
stream=True,
max_tokens=2048,
temperature=0.6,
)
reasoning_text = ""
answer_text = ""
for chunk in stream:
delta = chunk.choices[0].delta
if hasattr(delta, "reasoning_content") and delta.reasoning_content:
if not reasoning_text:
print(" Reasoning:")
print(delta.reasoning_content, end="", flush=True)
reasoning_text += delta.reasoning_content
if delta.content:
if not answer_text and reasoning_text:
print("\n\n Final Answer:")
print(delta.content, end="", flush=True)
answer_text += delta.content
print(f"\n\n Reasoning: {len(reasoning_text)} chars | Answer: {len(answer_text)} chars")
First, you'll see the Reasoning: stream out, where the model breaks down the riddle. It'll point out that "all but 9" is the key phrase. Then, after it's done thinking, the Final Answer: will appear. This is incredibly powerful for building reliable systems.
Keeping the Conversation Going: Multi-Turn Memory
An agent isn't very useful if it has the memory of a goldfish. For a real conversation, the AI needs to remember what you've already talked about. This is called a multi-turn conversation.
The concept is simple: we just keep a running list of the entire conversation and send it back to the model with each new message. Let's see it in action with a few follow-up questions about Python.
print("\n" + "=" * 70)
print(" SECTION 5: Multi-Turn Conversation")
print("=" * 70)
messages = [
{"role": "system", "content": "You are a senior Python developer. Be concise."},
{"role": "user", "content": "What's the difference between a list and a tuple in Python?"},
]
# First turn
r1 = client.chat.completions.create(model="glm-5", messages=messages, max_tokens=512, temperature=0.7)
assistant_reply_1 = r1.choices[0].message.content
messages.append({"role": "assistant", "content": assistant_reply_1})
print(f"\n User: {messages[1]['content']}")
print(f" GLM-5: {assistant_reply_1[:200]}...")
# Second turn (follow-up)
messages.append({"role": "user", "content": "When should I use a NamedTuple instead?"})
r2 = client.chat.completions.create(model="glm-5", messages=messages, max_tokens=512, temperature=0.7)
assistant_reply_2 = r2.choices[0].message.content
print(f"\n User: {messages[-1]['content']}")
print(f" GLM-5: {assistant_reply_2[:200]}...")
messages.append({"role": "assistant", "content": assistant_reply_2})
# Third turn (another follow-up)
messages.append({"role": "user", "content": "Show me a practical example with type hints."})
r3 = client.chat.completions.create(model="glm-5", messages=messages, max_tokens=1024, temperature=0.7)
assistant_reply_3 = r3.choices[0].message.content
print(f"\n User: {messages[-1]['content']}")
print(f" GLM-5: {assistant_reply_3[:300]}...")
print(f"\n Conversation: {len(messages)+1} messages, {r3.usage.total_tokens} total tokens in last call")
As you can see, the model remembers the context from the previous questions, allowing it to give relevant and helpful answers. We're building up the messages list with each turn, which acts as the conversation's memory.
Giving Your AI a Toolbox: The Magic of Function Calling
This is the core of what makes an agent an agent. We need to give it the ability to do things beyond just talking. We can do this by giving it a set of "tools" (which are just functions in our code) that it can choose to call.
Think of it like this: you wouldn't ask a human assistant to calculate 2^20 in their head. You'd expect them to use a calculator. We're doing the same thing for our AI.
Let's define two tools: one to get the weather and another to act as a safe calculator.
print("\n" + "=" * 70)
print(" SECTION 6: Function Calling (Tool Use)")
print("=" * 70)
print("GLM-5 can decide WHEN and HOW to call external functions you define.\n")
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a given city",
"parameters": {
"type": "object",
"properties": {
"city": { "type": "string", "description": "City name, e.g. 'San Francisco', 'Tokyo'", },
"unit": { "type": "string", "enum": ["celsius", "fahrenheit"], "description": "Temperature unit (default: celsius)", },
},
"required": ["city"],
},
},
},
{
"type": "function",
"function": {
"name": "calculate",
"description": "Evaluate a mathematical expression safely",
"parameters": {
"type": "object",
"properties": { "expression": { "type": "string", "description": "Math expression, e.g. '2**10 + 3*7'", } },
"required": ["expression"],
},
},
},
]
# --- Helper Functions (our actual tools) ---
def get_weather(city: str, unit: str = "celsius") -> dict:
# (In a real app, this would call a real weather API)
weather_db = {
"san francisco": {"temp": 18, "condition": "Foggy", "humidity": 78},
"tokyo": {"temp": 28, "condition": "Sunny", "humidity": 55},
"london": {"temp": 14, "condition": "Rainy", "humidity": 85},
}
data = weather_db.get(city.lower(), {"temp": 20, "condition": "Clear"})
if unit == "fahrenheit":
data["temp"] = round(data["temp"] * 9 / 5 + 32)
return {"city": city, "unit": unit or "celsius", **data}
def calculate(expression: str) -> dict:
try:
result = eval(expression, {"__builtins__": {}}, {}) # Safer eval
return {"expression": expression, "result": result}
except Exception as e:
return {"error": str(e)}
TOOL_REGISTRY = {"get_weather": get_weather, "calculate": calculate}
# --- The Tool-Calling Loop ---
def run_tool_call(user_message: str):
print(f"\n User: {user_message}")
messages = [{"role": "user", "content": user_message}]
# 1. First call to see if the model wants to use a tool
response = client.chat.completions.create(
model="glm-5", messages=messages, tools=tools, tool_choice="auto"
)
assistant_msg = response.choices[0].message
messages.append(assistant_msg.model_dump())
# 2. If it wants to use a tool, we run it
if assistant_msg.tool_calls:
for tc in assistant_msg.tool_calls:
fn_name = tc.function.name
fn_args = json.loads(tc.function.arguments)
print(f" Tool call: {fn_name}({fn_args})")
result = TOOL_REGISTRY[fn_name](**fn_args)
print(f" Result: {result}")
messages.append({
"role": "tool",
"content": json.dumps(result, ensure_ascii=False),
"tool_call_id": tc.id,
})
# 3. Send the result back to the model for a final answer
final = client.chat.completions.create(model="glm-5", messages=messages, tools=tools)
print(f" GLM-5: {final.choices[0].message.content}")
else:
print(f" GLM-5: {assistant_msg.content}")
run_tool_call("What's the weather like in Tokyo right now?")
run_tool_call("What is 2^20 + 3^10 - 1024?")
run_tool_call("Compare the weather in San Francisco and London, and calculate the temperature difference.")
Look at that last one! The model was smart enough to call get_weather twice (once for each city) and then call calculate to find the difference, all from a single, natural language request. This is the foundation of a truly helpful agent.
Getting Predictable: Structured JSON Output
Sometimes you don't want a chatty, human-like response. You want clean, structured data that your program can easily understand. You can instruct GLM-5 to respond in a specific JSON format. This is perfect for tasks like data extraction.
Let's ask it to pull key information from a block of text and format it as JSON.
print("\n" + "=" * 70)
print(" SECTION 7: Structured JSON Output")
print("=" * 70)
print("Force GLM-5 to return well-structured JSON for downstream processing.\n")
response = client.chat.completions.create(
model="glm-5",
messages=[
{
"role": "system",
"content": (
"You are a data extraction assistant. "
"Always respond with valid JSON only — no markdown, no explanation."
),
},
{
"role": "user",
"content": (
'Extract structured data from this text:\n\n'
'"Acme Corp reported Q3 2025 revenue of $4.2B, up 18% YoY. '
'Net income was $890M. The company announced 3 new products '
'and plans to expand into 5 new markets by 2026. CEO Jane Smith '
'said she expects 25% growth next year."\n\n'
"Return JSON with keys: company, quarter, revenue, revenue_growth, "
"net_income, new_products, new_markets, ceo, growth_forecast"
),
},
],
max_tokens=512,
temperature=0.1,
# For stricter enforcement, you can add: response_format={'type': 'json_object'}
)
raw_output = response.choices[0].message.content
print(" Raw output:")
print(raw_output)
try:
parsed = json.loads(raw_output.strip())
print("\n Parsed JSON:")
print(json.dumps(parsed, indent=2))
except json.JSONDecodeError as e:
print(f"\n JSON parsing failed: {e}")
The output is clean, predictable JSON that you can immediately use in your application without any messy string parsing.
Putting It All Together: The Agentic Loop
We've seen all the individual pieces. Now, let's assemble them into a reusable agent that can use multiple tools over multiple turns to solve complex problems. We'll wrap our logic in a simple class.
We'll give our agent the weather and calculator tools from before, plus a couple of new ones: a time-checker and a unit converter.
print("\n" + "=" * 70)
print(" SECTION 8: Multi-Tool Agentic Loop")
print("=" * 70)
print("Build a complete agent that can use multiple tools across turns.\n")
# --- Define new tools ---
def get_current_time() -> dict:
return {"datetime": datetime.now().isoformat()}
def unit_converter(value: float, from_unit: str, to_unit: str) -> dict:
conversions = {
("celsius", "fahrenheit"): lambda v: v * 9 / 5 + 32,
("fahrenheit", "celsius"): lambda v: (v - 32) * 5 / 9,
}
key = (from_unit.lower(), to_unit.lower())
if key in conversions:
return {"result": round(conversions[key](value), 2)}
return {"error": f"Conversion not supported"}
# --- Combine all tools ---
extended_tools = tools + [
{"type": "function", "function": {"name": "get_current_time", "description": "Get the current date and time"}},
{"type": "function", "function": {"name": "unit_converter", "description": "Convert between units", "parameters": {
"type": "object", "properties": {
"value": {"type": "number"}, "from_unit": {"type": "string"}, "to_unit": {"type": "string"}
}, "required": ["value", "from_unit", "to_unit"]
}}},
]
extended_registry = {**TOOL_REGISTRY, "get_current_time": get_current_time, "unit_converter": unit_converter}
# --- The Agent Class ---
class GLM5Agent:
def __init__(self, system_prompt: str, tools: list, tool_registry: dict):
self.client = ZaiClient(api_key=API_KEY)
self.messages = [{"role": "system", "content": system_prompt}]
self.tools = tools
self.registry = tool_registry
self.max_iterations = 5
def chat(self, user_input: str) -> str:
self.messages.append({"role": "user", "content": user_input})
for iteration in range(self.max_iterations):
response = self.client.chat.completions.create(
model="glm-5", messages=self.messages, tools=self.tools, tool_choice="auto"
)
msg = response.choices[0].message
self.messages.append(msg.model_dump())
if not msg.tool_calls:
return msg.content # Final answer
for tc in msg.tool_calls:
fn_name = tc.function.name
fn_args = json.loads(tc.function.arguments)
print(f" [{iteration+1}] Calling: {fn_name}({fn_args})")
result = self.registry[fn_name](**fn_args)
self.messages.append({
"role": "tool",
"content": json.dumps(result, ensure_ascii=False),
"tool_call_id": tc.id,
})
return "Agent reached max iterations."
# --- Run the Agent ---
agent = GLM5Agent(
system_prompt="You are a helpful assistant with access to weather, math, time, and unit conversion tools.",
tools=extended_tools,
tool_registry=extended_registry,
)
user_query = "What time is it? Also, if it's 28°C in Tokyo, what's that in Fahrenheit? And what's 2^16?"
print(f" User: {user_query}")
result = agent.chat(user_query)
print(f"\n Agent: {result}")
This is it! This is a real agent. When you run this, you'll see it methodically call get_current_time, then unit_converter, and finally calculate. It performs all three tasks and then synthesizes the results into a single, coherent answer for the user.
A Final Touch: OpenAI SDK Compatibility
One last thing that makes trying GLM-5 so easy. If you already have code written for the OpenAI API, you can switch to GLM-5 with just one line of code. It's designed to be a drop-in replacement.
You just need to point the base_url to the Z.AI endpoint. That's it.
print("\n" + "=" * 70)
print(" SECTION 10: OpenAI SDK Compatibility")
print("=" * 70)
print("GLM-5 is fully compatible with the OpenAI Python SDK.")
print("Just change the base_url — your existing OpenAI code works as-is!\n")
from openai import OpenAI
openai_client = OpenAI(
api_key=API_KEY,
base_url="https://api.z.ai/api/paas/v4/",
)
completion = openai_client.chat.completions.create(
model="glm-5",
messages=[
{"role": "system", "content": "You are a writing assistant."},
{"role": "user", "content": "Write a 4-line poem about artificial intelligence discovering nature."},
],
max_tokens=256,
)
print(" GLM-5 (via OpenAI SDK):")
print(completion.choices[0].message.content)
This compatibility is a huge time-saver and makes it painless to experiment and compare models without having to rewrite everything.
So, What Did We Build?
Wow, we covered a lot of ground. We went from a simple chat to a full-fledged agent that can reason, use tools, and hold a conversation.
Just to recap, you've now seen how to use GLM-5 for:
- Basic and streaming chat
- Peeking into the AI's reasoning with Thinking Mode
- Maintaining context in multi-turn conversations
- Giving the AI a toolbox with Function Calling
- Forcing clean, structured JSON output
- Building a multi-tool agentic loop
These aren't just neat tricks; they are the fundamental building blocks for creating production-ready AI systems that can solve real-world problems. The journey from a simple chatbot to a capable agent is shorter than you might think. Now you've got the map. Happy building




