Let’s be honest for a second. How much time have you wasted creating mock data for your tests or prototypes?
You know the drill. You create a new feature, define a Pydantic model or a dataclass, and then you have to write a test for it. So you sit there, manually typing out fake names, random UUIDs, and plausible-looking email addresses. It’s tedious, boring, and every time your data model changes, you have to go back and update all that mock data.
It’s a huge time sink, and frankly, it’s not the most exciting part of our job.
What if you could just point a tool at your data models and say, "Hey, give me 100 realistic-looking examples of this"? That's pretty much the magic of Polyfactory. It’s a Python library that reads your type hints and automatically generates rich, complex, and realistic data for you.
I've been using it a lot lately, and it's completely changed my testing workflow. So, I wanted to walk you through how to use it, from the simple stuff to the more advanced tricks that make it so powerful. Think of this as a friendly chat where we build something cool together.
First Things First: Getting Set Up
Before we can start generating data, we need to get our environment ready. It's just a few packages to install. We'll need polyfactory itself, plus a few other libraries it integrates with, like pydantic, faker, and attrs.
You can just run this little script to get everything installed quietly.
import subprocess
import sys
def install_package(package):
subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", package])
packages = [
"polyfactory",
"pydantic",
"email-validator",
"faker",
"msgspec",
"attrs"
]
for package in packages:
try:
install_package(package)
print(f"✓ Installed {package}")
except Exception as e:
print(f"✗ Failed to install {package}: {e}")
Once that's done, we're ready to dive in.
The Simplest Magic Trick: Basic Dataclass Factories
Let's start with a classic example: a Person with an Address. In the old days, you'd have to manually create an Address object and then a Person object, filling in every single field.
Watch how Polyfactory handles it. First, we define our dataclasses.
from dataclasses import dataclass
from typing import List, Optional
from datetime import datetime, date
from uuid import UUID
from polyfactory.factories import DataclassFactory
@dataclass
class Address:
street: str
city: str
country: str
zip_code: str
@dataclass
class Person:
id: UUID
name: str
email: str
age: int
birth_date: date
is_active: bool
address: Address # Notice this is a nested dataclass
phone_numbers: List[str]
bio: Optional[str] = None
Now, here’s the fun part. To generate mock data, we just create a factory that inherits from DataclassFactory and tell it which model to use.
class PersonFactory(DataclassFactory[Person]):
pass
# Generate a single person
person = PersonFactory.build()
print("Generated Person:")
print(f" ID: {person.id}")
print(f" Name: {person.name}")
print(f" Address: {person.address.city}, {person.address.country}")
print()
# Generate a batch of 5 people
people = PersonFactory.batch(5)
print(f"Generated {len(people)} people:")
for i, p in enumerate(people, 1):
print(f" {i}. {p.name} - {p.email}")
Look at that! With just one line of code in our factory (pass), we generated a complete Person object. Polyfactory saw the type hints (UUID, str, int, etc.) and automatically filled them with plausible data. It even saw the nested Address dataclass and created a whole Address object for it.
This is the core power of Polyfactory. It takes the "thinking" out of creating basic test data.
Making Your Fake Data More Realistic
Random data is great, but sometimes you need it to look a little more... real. You don't want a name like "aXyZ" or an email like "foo@bar". You want things that look like they could actually exist.
This is where we can start customizing our factory. Let's imagine we're creating an Employee model. We want realistic names, company emails, and specific departments.
We can do this by adding methods to our factory that match the field names. Polyfactory will see a method named full_name and use it to generate the value for the full_name field. We can also integrate the amazing Faker library to generate high-quality fake data.
from faker import Faker
@dataclass
class Employee:
employee_id: str
full_name: str
department: str
salary: float
hire_date: date
is_manager: bool
email: str
class EmployeeFactory(DataclassFactory[Employee]):
# Use Faker for realistic data
__faker__ = Faker(locale="en_US")
# Set a random seed for reproducible results
__random_seed__ = 42
@classmethod
def employee_id(cls) -> str:
return f"EMP-{cls.__random__.randint(10000, 99999)}"
@classmethod
def full_name(cls) -> str:
return cls.__faker__.name()
@classmethod
def department(cls) -> str:
departments = ["Engineering", "Marketing", "Sales", "HR", "Finance"]
return cls.__random__.choice(departments)
@classmethod
def salary(cls) -> float:
return round(cls.__random__.uniform(50000, 150000), 2)
@classmethod
def email(cls) -> str:
# Faker can even generate company-specific emails!
return cls.__faker__.company_email()
employees = EmployeeFactory.batch(3)
for emp in employees:
print(f" {emp.employee_id}: {emp.full_name}")
print(f" Department: {emp.department}")
print(f" Salary: ${emp.salary:,.2f}")
print()
See the difference? Now our data isn't just random; it's contextually aware. We're getting real-looking names, salaries within a specific range, and departments from a predefined list. Setting __random_seed__ is also a neat trick for making your tests reproducible—you'll get the same "random" data every time you run the code.
What About Calculated Fields?
Okay, things are getting interesting. But what about fields that depend on other fields? Think about a Product model. It has a price and a discount_percentage, but what you really care about is the final_price. You don't want to set that manually; it should be calculated.
Polyfactory has a clever way to handle this. You can override the build method to add custom logic that runs after the initial object has been created.
Here’s how we can model a Product with a calculated final_price and sku.
@dataclass
class Product:
product_id: str
name: str
price: float
discount_percentage: float
stock_quantity: int
# These will be calculated
final_price: Optional[float] = None
sku: Optional[str] = None
class ProductFactory(DataclassFactory[Product]):
# (We'll skip the individual field generators for brevity, but they're similar to the Employee example)
@classmethod
def price(cls) -> float:
return round(cls.__random__.uniform(10.0, 1000.0), 2)
@classmethod
def discount_percentage(cls) -> float:
return round(cls.__random__.uniform(0, 30), 2)
# This is the magic part!
@classmethod
def build(cls, **kwargs):
# First, let Polyfactory do its thing
instance = super().build(**kwargs)
# Now, we add our custom logic
if instance.final_price is None:
instance.final_price = round(
instance.price * (1 - instance.discount_percentage / 100), 2
)
if instance.sku is None:
name_part = instance.name.replace(" ", "-").upper()[:10]
instance.sku = f"PROD-123-{name_part}"
return instance
product = ProductFactory.build()
print(f"Generated Product:")
print(f" Name: {product.name}")
print(f" Price: ${product.price:.2f} with a {product.discount_percentage:.2f}% discount")
print(f" Final Price: ${product.final_price:.2f}")
print(f" SKU: {product.sku}")
This is a huge deal. It means your mock data can encapsulate real business logic. Your factories become a single source of truth for how your data objects are supposed to behave, which makes your tests much more robust.
Tackling Complex, Nested Data Structures
Real-world data is rarely flat. It’s messy and nested. You have orders that contain lists of items, and maybe those orders have shipping information attached. Building this by hand is a nightmare.
Let's model an Order system. An Order has a list of OrderItem objects and optional ShippingInfo. We can create separate factories for each part and then compose them together.
from enum import Enum
class OrderStatus(str, Enum):
PENDING = "pending"
SHIPPED = "shipped"
DELIVERED = "delivered"
@dataclass
class OrderItem:
product_name: str
quantity: int
unit_price: float
total_price: Optional[float] = None # Calculated
@dataclass
class ShippingInfo:
carrier: str
tracking_number: str
@dataclass
class Order:
order_id: str
customer_name: str
items: List[OrderItem]
status: OrderStatus
shipping_info: Optional[ShippingInfo] = None
total_amount: Optional[float] = None # Calculated
Now we create a factory for each dataclass, and the OrderFactory will use the OrderItemFactory to build its list of items.
class OrderItemFactory(DataclassFactory[OrderItem]):
# ... (custom logic for quantity, price, etc.)
@classmethod
def build(cls, **kwargs):
instance = super().build(**kwargs)
instance.total_price = round(instance.quantity * instance.unit_price, 2)
return instance
class ShippingInfoFactory(DataclassFactory[ShippingInfo]):
pass
class OrderFactory(DataclassFactory[Order]):
@classmethod
def items(cls) -> List[OrderItem]:
# Use another factory to generate a batch of items!
return OrderItemFactory.batch(cls.__random__.randint(1, 5))
@classmethod
def build(cls, **kwargs):
instance = super().build(**kwargs)
# Calculate the total amount from the generated items
instance.total_amount = round(sum(item.total_price for item in instance.items), 2)
# Conditionally add shipping info
if instance.status == OrderStatus.SHIPPED and instance.shipping_info is None:
instance.shipping_info = ShippingInfoFactory.build()
return instance
# Let's generate a complete, complex order
order = OrderFactory.build(status=OrderStatus.SHIPPED)
print(f"Generated Order {order.order_id}:")
print(f" Status: {order.status.value}")
print(f" Total: ${order.total_amount:.2f}")
for item in order.items:
print(f" - {item.quantity}x {item.product_name}")
if order.shipping_info:
print(f" Shipping via: {order.shipping_info.carrier}")
This is where Polyfactory really shines. We're building complex, nested objects with internal consistency and business logic, all with clean, readable factories. We can even pass in specific values when we build, like status=OrderStatus.SHIPPED, and let the factory handle the rest.
It's Not Just for Dataclasses! (Attrs and Pydantic Support)
So far we've only used dataclasses, but what if your team prefers attrs or relies heavily on Pydantic for data validation? No problem. Polyfactory has dedicated factories for them, and they work almost exactly the same way.
Here’s a quick example with an attrs-based class for a BlogPost.
import attrs
from polyfactory.factories.attrs_factory import AttrsFactory
@attrs.define
class BlogPost:
title: str
author: str
content: str
views: int = 0
tags: List[str] = attrs.field(factory=list)
class BlogPostFactory(AttrsFactory[BlogPost]):
# We can add custom generators just like before
@classmethod
def title(cls) -> str:
return "A Very Interesting Blog Post"
@classmethod
def content(cls) -> str:
return "Here is some insightful content..."
post = BlogPostFactory.build()
print(f"Generated Blog Post: '{post.title}' by {post.author}")
The same principles apply to Pydantic's ModelFactory. This flexibility means you can adopt Polyfactory without having to refactor your existing data models. It just fits right into your stack.
Sometimes You Need to Be Specific
While random data is great for general-purpose testing, sometimes you need to test a very specific edge case. You might need a user with a specific name or an order with a known total.
Polyfactory makes this incredibly easy. You can pass any values directly into the .build() or .batch() methods, and they will override the randomly generated ones.
# Use the PersonFactory from our first example
custom_person = PersonFactory.build(
name="Alice Johnson",
age=30
)
print(f"Custom Person: {custom_person.name}, Age: {custom_person.age}")
print(f"ID (still random): {custom_person.id}")
print()
# You can even do it for batches
vip_customers = PersonFactory.batch(3, bio="VIP Customer")
for customer in vip_customers:
print(f" {customer.name} is a {customer.bio}")
This gives you the best of both worlds: you get randomness where you don't care and precise control where you do. It's perfect for creating a mix of general and specific test cases without writing a ton of extra code.
Final Thoughts
Look, at the end of the day, tools like Polyfactory are about more than just generating data. They're about removing friction from the development process. They let you spend less time on tedious boilerplate and more time solving actual problems.
By letting your type hints drive your test data generation, you create a system that's easier to maintain and less prone to breaking when you refactor. It encourages you to build realistic, complex scenarios for your tests, which ultimately leads to more reliable software.
If you're not already using a data factory library in your Python projects, I genuinely encourage you to give Polyfactory a try. It might just become one of your favorite tools.




