Aicosoft - AI & Technology News, Insights & Innovation

Let’s be honest for a second. How much time have you wasted creating mock data for your tests or prototypes?

You know the drill. You create a new feature, define a Pydantic model or a dataclass, and then you have to write a test for it. So you sit there, manually typing out fake names, random UUIDs, and plausible-looking email addresses. It’s tedious, boring, and every time your data model changes, you have to go back and update all that mock data.

It’s a huge time sink, and frankly, it’s not the most exciting part of our job.

What if you could just point a tool at your data models and say, "Hey, give me 100 realistic-looking examples of this"? That's pretty much the magic of Polyfactory. It’s a Python library that reads your type hints and automatically generates rich, complex, and realistic data for you.

I've been using it a lot lately, and it's completely changed my testing workflow. So, I wanted to walk you through how to use it, from the simple stuff to the more advanced tricks that make it so powerful. Think of this as a friendly chat where we build something cool together.

First Things First: Getting Set Up

Before we can start generating data, we need to get our environment ready. It's just a few packages to install. We'll need polyfactory itself, plus a few other libraries it integrates with, like pydantic, faker, and attrs.

You can just run this little script to get everything installed quietly.

import subprocess
import sys

def install_package(package):
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", package])

packages = [
    "polyfactory",
    "pydantic",
    "email-validator",
    "faker",
    "msgspec",
    "attrs"
]

for package in packages:
    try:
        install_package(package)
        print(f"✓ Installed {package}")
    except Exception as e:
        print(f"✗ Failed to install {package}: {e}")

Once that's done, we're ready to dive in.

The Simplest Magic Trick: Basic Dataclass Factories

Let's start with a classic example: a Person with an Address. In the old days, you'd have to manually create an Address object and then a Person object, filling in every single field.

Watch how Polyfactory handles it. First, we define our dataclasses.

from dataclasses import dataclass
from typing import List, Optional
from datetime import datetime, date
from uuid import UUID
from polyfactory.factories import DataclassFactory

@dataclass
class Address:
    street: str
    city: str
    country: str
    zip_code: str

@dataclass
class Person:
    id: UUID
    name: str
    email: str
    age: int
    birth_date: date
    is_active: bool
    address: Address  # Notice this is a nested dataclass
    phone_numbers: List[str]
    bio: Optional[str] = None

Now, here’s the fun part. To generate mock data, we just create a factory that inherits from DataclassFactory and tell it which model to use.

class PersonFactory(DataclassFactory[Person]):
    pass

# Generate a single person
person = PersonFactory.build()
print("Generated Person:")
print(f" ID: {person.id}")
print(f" Name: {person.name}")
print(f" Address: {person.address.city}, {person.address.country}")
print()

# Generate a batch of 5 people
people = PersonFactory.batch(5)
print(f"Generated {len(people)} people:")
for i, p in enumerate(people, 1):
    print(f" {i}. {p.name} - {p.email}")

Look at that! With just one line of code in our factory (pass), we generated a complete Person object. Polyfactory saw the type hints (UUID, str, int, etc.) and automatically filled them with plausible data. It even saw the nested Address dataclass and created a whole Address object for it.

This is the core power of Polyfactory. It takes the "thinking" out of creating basic test data.

Making Your Fake Data More Realistic

Random data is great, but sometimes you need it to look a little more... real. You don't want a name like "aXyZ" or an email like "foo@bar". You want things that look like they could actually exist.

This is where we can start customizing our factory. Let's imagine we're creating an Employee model. We want realistic names, company emails, and specific departments.

We can do this by adding methods to our factory that match the field names. Polyfactory will see a method named full_name and use it to generate the value for the full_name field. We can also integrate the amazing Faker library to generate high-quality fake data.

from faker import Faker

@dataclass
class Employee:
    employee_id: str
    full_name: str
    department: str
    salary: float
    hire_date: date
    is_manager: bool
    email: str

class EmployeeFactory(DataclassFactory[Employee]):
    # Use Faker for realistic data
    __faker__ = Faker(locale="en_US")
    # Set a random seed for reproducible results
    __random_seed__ = 42

    @classmethod
    def employee_id(cls) -> str:
        return f"EMP-{cls.__random__.randint(10000, 99999)}"

    @classmethod
    def full_name(cls) -> str:
        return cls.__faker__.name()

    @classmethod
    def department(cls) -> str:
        departments = ["Engineering", "Marketing", "Sales", "HR", "Finance"]
        return cls.__random__.choice(departments)

    @classmethod
    def salary(cls) -> float:
        return round(cls.__random__.uniform(50000, 150000), 2)
    
    @classmethod
    def email(cls) -> str:
        # Faker can even generate company-specific emails!
        return cls.__faker__.company_email()

employees = EmployeeFactory.batch(3)
for emp in employees:
    print(f" {emp.employee_id}: {emp.full_name}")
    print(f"  Department: {emp.department}")
    print(f"  Salary: ${emp.salary:,.2f}")
    print()

See the difference? Now our data isn't just random; it's contextually aware. We're getting real-looking names, salaries within a specific range, and departments from a predefined list. Setting __random_seed__ is also a neat trick for making your tests reproducible—you'll get the same "random" data every time you run the code.

What About Calculated Fields?

Okay, things are getting interesting. But what about fields that depend on other fields? Think about a Product model. It has a price and a discount_percentage, but what you really care about is the final_price. You don't want to set that manually; it should be calculated.

Polyfactory has a clever way to handle this. You can override the build method to add custom logic that runs after the initial object has been created.

Here’s how we can model a Product with a calculated final_price and sku.

@dataclass
class Product:
    product_id: str
    name: str
    price: float
    discount_percentage: float
    stock_quantity: int
    # These will be calculated
    final_price: Optional[float] = None
    sku: Optional[str] = None

class ProductFactory(DataclassFactory[Product]):
    # (We'll skip the individual field generators for brevity, but they're similar to the Employee example)
    @classmethod
    def price(cls) -> float:
        return round(cls.__random__.uniform(10.0, 1000.0), 2)
    
    @classmethod
    def discount_percentage(cls) -> float:
        return round(cls.__random__.uniform(0, 30), 2)

    # This is the magic part!
    @classmethod
    def build(cls, **kwargs):
        # First, let Polyfactory do its thing
        instance = super().build(**kwargs)
        
        # Now, we add our custom logic
        if instance.final_price is None:
            instance.final_price = round(
                instance.price * (1 - instance.discount_percentage / 100), 2
            )
        
        if instance.sku is None:
            name_part = instance.name.replace(" ", "-").upper()[:10]
            instance.sku = f"PROD-123-{name_part}"

        return instance

product = ProductFactory.build()
print(f"Generated Product:")
print(f" Name: {product.name}")
print(f" Price: ${product.price:.2f} with a {product.discount_percentage:.2f}% discount")
print(f" Final Price: ${product.final_price:.2f}")
print(f" SKU: {product.sku}")

This is a huge deal. It means your mock data can encapsulate real business logic. Your factories become a single source of truth for how your data objects are supposed to behave, which makes your tests much more robust.

Tackling Complex, Nested Data Structures

Real-world data is rarely flat. It’s messy and nested. You have orders that contain lists of items, and maybe those orders have shipping information attached. Building this by hand is a nightmare.

Let's model an Order system. An Order has a list of OrderItem objects and optional ShippingInfo. We can create separate factories for each part and then compose them together.

from enum import Enum

class OrderStatus(str, Enum):
    PENDING = "pending"
    SHIPPED = "shipped"
    DELIVERED = "delivered"

@dataclass
class OrderItem:
    product_name: str
    quantity: int
    unit_price: float
    total_price: Optional[float] = None # Calculated

@dataclass
class ShippingInfo:
    carrier: str
    tracking_number: str

@dataclass
class Order:
    order_id: str
    customer_name: str
    items: List[OrderItem]
    status: OrderStatus
    shipping_info: Optional[ShippingInfo] = None
    total_amount: Optional[float] = None # Calculated

Now we create a factory for each dataclass, and the OrderFactory will use the OrderItemFactory to build its list of items.

class OrderItemFactory(DataclassFactory[OrderItem]):
    # ... (custom logic for quantity, price, etc.)
    @classmethod
    def build(cls, **kwargs):
        instance = super().build(**kwargs)
        instance.total_price = round(instance.quantity * instance.unit_price, 2)
        return instance

class ShippingInfoFactory(DataclassFactory[ShippingInfo]):
    pass

class OrderFactory(DataclassFactory[Order]):
    @classmethod
    def items(cls) -> List[OrderItem]:
        # Use another factory to generate a batch of items!
        return OrderItemFactory.batch(cls.__random__.randint(1, 5))

    @classmethod
    def build(cls, **kwargs):
        instance = super().build(**kwargs)
        
        # Calculate the total amount from the generated items
        instance.total_amount = round(sum(item.total_price for item in instance.items), 2)
        
        # Conditionally add shipping info
        if instance.status == OrderStatus.SHIPPED and instance.shipping_info is None:
            instance.shipping_info = ShippingInfoFactory.build()
            
        return instance

# Let's generate a complete, complex order
order = OrderFactory.build(status=OrderStatus.SHIPPED)
print(f"Generated Order {order.order_id}:")
print(f" Status: {order.status.value}")
print(f" Total: ${order.total_amount:.2f}")
for item in order.items:
    print(f" - {item.quantity}x {item.product_name}")
if order.shipping_info:
    print(f" Shipping via: {order.shipping_info.carrier}")

This is where Polyfactory really shines. We're building complex, nested objects with internal consistency and business logic, all with clean, readable factories. We can even pass in specific values when we build, like status=OrderStatus.SHIPPED, and let the factory handle the rest.

It's Not Just for Dataclasses! (Attrs and Pydantic Support)

So far we've only used dataclasses, but what if your team prefers attrs or relies heavily on Pydantic for data validation? No problem. Polyfactory has dedicated factories for them, and they work almost exactly the same way.

Here’s a quick example with an attrs-based class for a BlogPost.

import attrs
from polyfactory.factories.attrs_factory import AttrsFactory

@attrs.define
class BlogPost:
    title: str
    author: str
    content: str
    views: int = 0
    tags: List[str] = attrs.field(factory=list)

class BlogPostFactory(AttrsFactory[BlogPost]):
    # We can add custom generators just like before
    @classmethod
    def title(cls) -> str:
        return "A Very Interesting Blog Post"

    @classmethod
    def content(cls) -> str:
        return "Here is some insightful content..."

post = BlogPostFactory.build()
print(f"Generated Blog Post: '{post.title}' by {post.author}")

The same principles apply to Pydantic's ModelFactory. This flexibility means you can adopt Polyfactory without having to refactor your existing data models. It just fits right into your stack.

Sometimes You Need to Be Specific

While random data is great for general-purpose testing, sometimes you need to test a very specific edge case. You might need a user with a specific name or an order with a known total.

Polyfactory makes this incredibly easy. You can pass any values directly into the .build() or .batch() methods, and they will override the randomly generated ones.

# Use the PersonFactory from our first example
custom_person = PersonFactory.build(
    name="Alice Johnson",
    age=30
)

print(f"Custom Person: {custom_person.name}, Age: {custom_person.age}")
print(f"ID (still random): {custom_person.id}")
print()

# You can even do it for batches
vip_customers = PersonFactory.batch(3, bio="VIP Customer")
for customer in vip_customers:
    print(f" {customer.name} is a {customer.bio}")

This gives you the best of both worlds: you get randomness where you don't care and precise control where you do. It's perfect for creating a mix of general and specific test cases without writing a ton of extra code.

Final Thoughts

Look, at the end of the day, tools like Polyfactory are about more than just generating data. They're about removing friction from the development process. They let you spend less time on tedious boilerplate and more time solving actual problems.

By letting your type hints drive your test data generation, you create a system that's easier to maintain and less prone to breaking when you refactor. It encourages you to build realistic, complex scenarios for your tests, which ultimately leads to more reliable software.

If you're not already using a data factory library in your Python projects, I genuinely encourage you to give Polyfactory a try. It might just become one of your favorite tools.

Stop Writing Mock Data By Hand: A Guide to Using Polyfactory in Python

First Things First: Getting Set Up

The Simplest Magic Trick: Basic Dataclass Factories

Making Your Fake Data More Realistic

What About Calculated Fields?

Tackling Complex, Nested Data Structures

It's Not Just for Dataclasses! (Attrs and Pydantic Support)

Sometimes You Need to Be Specific

Final Thoughts

Tags

Source

Stay Updated

Related Articles

Taming Tangled Python: A Practical Guide to Measuring and Fixing Code Complexity

Beyond the Loop: Mastering Python's tqdm for Pro-Level Progress Bars

Stop Moving Your Data: Build In-Database Feature Pipelines with Ibis and DuckDB

Stop Writing Mock Data By Hand: A Guide to Using Polyfactory in Python

First Things First: Getting Set Up

The Simplest Magic Trick: Basic Dataclass Factories

Making Your Fake Data More Realistic

What About Calculated Fields?

Tackling Complex, Nested Data Structures

It's Not Just for Dataclasses! (Attrs and Pydantic Support)

Sometimes You Need to Be Specific

Final Thoughts

Tags

Source

Stay Updated

Related Articles

Taming Tangled Python: A Practical Guide to Measuring and Fixing Code Complexity

Beyond the Loop: Mastering Python's tqdm for Pro-Level Progress Bars

Stop Moving Your Data: Build In-Database Feature Pipelines with Ibis and DuckDB

Cookie Settings