Ever tried to load a massive CSV into a Pandas DataFrame and watched your laptop’s fan spin up like it’s preparing for takeoff? Yeah, me too. We’ve all been there, waiting… and waiting… for gigabytes of data to load into memory, only to have the kernel die. It’s a classic data science rite of passage, but honestly, it’s a pain.

For years, the standard workflow has been: pull data from a database, load it into local memory (hello, Pandas!), do your transformations, and then push it somewhere else. But what if we could skip the whole “pull data” step? What if we could write our familiar, clean Python code but have all the heavy lifting happen inside the database itself?

That’s exactly the magic we’re going to talk about today. We're going to use two amazing tools, Ibis and DuckDB, to build a feature engineering pipeline that feels like Pandas but runs with the speed and efficiency of pure SQL. Think of it this way: you’re the head chef writing a complex recipe in a language you love (Python), and you hand it to a super-fast kitchen (the database) that has all the ingredients right there. You don't have to lug all the raw materials back to your tiny apartment kitchen.

So, What Are We Working With?

Let me quickly introduce our two key players for this little project:

Ibis: This is the star of the show. Ibis is a Python library that gives you a Pandas-like API for data manipulation. The brilliant part is that it doesn't actually run any code itself. Instead, it translates your Python commands into SQL for a specific database backend. It's a universal translator for data work.
DuckDB: If you haven't heard of DuckDB, you're in for a treat. It's an incredibly fast, in-process analytical database. Think of it as SQLite but built for data analysis and speed. It’s perfect for running on your laptop.

By combining these two, we get the best of both worlds: a friendly Python interface and a lightning-fast SQL engine running underneath. Let’s get this set up.

Step 1: Getting Our Tools Ready

First things first, we need to install the libraries. Nothing fancy here, just a quick pip install in your terminal.

pip -q install "ibis-framework[duckdb,examples]" duckdb pyarrow pandas

Once that’s done, we’ll fire up Python and import Ibis. We then connect to DuckDB. This command creates a new, temporary in-memory database for us to play in.

import ibis
from ibis import _

print("Ibis version:", ibis.__version__)

# This creates our super-fast, in-memory database connection
con = ibis.duckdb.connect()

# This tells Ibis to show us previews of our expressions
ibis.options.interactive = True

See that ibis.options.interactive = True line? That’s a nice little helper that will show us the output of our expressions as we build them, without actually executing the full query. It's great for debugging and seeing what you're doing.

Step 2: Let's Get Some Data (the Smart Way)

Now, we need some data. Ibis comes with a few example datasets, so we’ll grab the classic "penguins" dataset. But here’s the crucial part: we’re not just loading it into a Pandas DataFrame. We are going to load it and immediately register it as a table inside our DuckDB database.

# Try to fetch the data and point our backend to it
try:
    base_expr = ibis.examples.penguins.fetch(backend=con)
except TypeError:
    base_expr = ibis.examples.penguins.fetch()

# Make sure the 'penguins' table exists inside DuckDB
if "penguins" not in con.list_tables():
    try:
        con.create_table("penguins", base_expr, overwrite=True)
    except Exception:
        con.create_table("penguins", base_expr.execute(), overwrite=True)

# Now, let's create a pointer to that table
t = con.table("penguins")

print(t.schema())

This is a really important concept to grasp. The variable t is not the data itself. It doesn't contain any penguins. It's just a "lazy" expression—a pointer or a reference to the penguins table that lives inside DuckDB. Our laptop's memory is still free and clear. We can now tell Ibis what to do with t, and it will build a query plan without ever touching the raw data.

Step 3: Building Our Feature Pipeline in Pure Python

Alright, this is where the fun begins. We’re going to define a single Python function that takes our Ibis table object and performs a whole series of transformations to create new features for a potential machine learning model.

This might look like a lot of code at first, but if you’ve used Pandas or dplyr, it will feel very familiar. Let’s break it down.

def penguin_feature_pipeline(penguins):
    # --- Part 1: Basic cleaning and feature creation ---
    base = penguins.mutate(
        bill_ratio=_.bill_length_mm / _.bill_depth_mm,
        is_male=(_.sex == "male").ifelse(1, 0),
    )

    cleaned = base.filter(
        _.bill_length_mm.notnull() & _.bill_depth_mm.notnull() &
        _.body_mass_g.notnull() & _.flipper_length_mm.notnull() &
        _.species.notnull() & _.island.notnull() & _.year.notnull()
    )

    # --- Part 2: Defining our "windows" for complex calculations ---
    # A window over each species
    w_species = ibis.window(group_by=[cleaned.species])
    
    # A rolling 3-year window for each island
    w_island_year = ibis.window(
        group_by=[cleaned.island],
        order_by=[cleaned.year],
        preceding=2,
        following=0,
    )

    # --- Part 3: Creating advanced features with our windows ---
    feat = cleaned.mutate(
        # How does this penguin's mass compare to its species' average?
        species_avg_mass=cleaned.body_mass_g.mean().over(w_species),
        species_std_mass=cleaned.body_mass_g.std().over(w_species),
        mass_z=(
            (cleaned.body_mass_g - cleaned.body_mass_g.mean().over(w_species)) /
             cleaned.body_mass_g.std().over(w_species)
        ),

        # What's its mass rank on its island?
        island_mass_rank=cleaned.body_mass_g.rank().over(
            ibis.window(group_by=[cleaned.island])
        ),
        
        # What's the rolling 3-year average mass on its island?
        rolling_3yr_island_avg_mass=cleaned.body_mass_g.mean().over(
            w_island_year
        ),
    )

    # --- Part 4: The final aggregation ---
    return feat.group_by(["species", "island", "year"]).agg(
        n=feat.count(),
        avg_mass=feat.body_mass_g.mean(),
        avg_flipper=feat.flipper_length_mm.mean(),
        avg_bill_ratio=feat.bill_ratio.mean(),
        avg_mass_z=feat.mass_z.mean(),
        avg_rolling_3yr_mass=feat.rolling_3yr_island_avg_mass.mean(),
        pct_male=feat.is_male.mean(),
    ).order_by(["species", "island", "year"])

Let's quickly walk through what's happening here:

Mutate & Clean: We start by creating a couple of simple new columns (bill_ratio, is_male) and then filter out any rows with missing values. Standard stuff.
Window Functions: This is the really powerful part. A window function lets you perform calculations across a set of rows that are related to the current row. We defined one window for each species and another for a rolling 3-year period on each island. It's like telling the database, "For this penguin, look at all the other penguins of the same species" or "look at the data from the last two years on this island."
Advanced Features: Using these windows, we calculate some cool features like a Z-score for body mass (how many standard deviations from the species average is this penguin?) and a rolling average mass. This is complex stuff that would be a pain to write in SQL by hand.
Final Aggregation: Finally, we group everything by species, island, and year and calculate some summary statistics. This is our final, feature-engineered dataset.

The most important thing to remember? We haven't actually run anything yet. We’ve just defined a recipe.

Step 4: The Big Reveal - Python to SQL

Now, let's apply our pipeline function to our table pointer t.

features = penguin_feature_pipeline(t)

Again, features is still just a lazy expression. It’s a set of instructions. But now, we can ask Ibis to show us the SQL it plans to run on DuckDB.

print(con.compile(features))

Go ahead and run that. You’ll see a big, complex, but highly optimized SQL query with window functions (OVER (PARTITION BY ...)), common table expressions (WITH), and everything else. We just wrote all of that without writing a single line of SQL! This, right here, is the "aha!" moment. Ibis did all the hard translation work for us.

Now, let's actually execute the query and pull the final, aggregated results into a Pandas DataFrame.

# This is the moment the query runs and we get data back
try:
    df = features.to_pandas()
except Exception:
    df = features.execute()

display(df.head())

Notice that we’re only pulling back the small, aggregated summary table. We never loaded the entire raw penguin dataset into our Python session. All that heavy lifting, all those window function calculations, happened inside DuckDB at native speed. Our memory usage stayed tiny.

Step 5: Saving Our Work and Handing It Off

We're not done yet. Since our results are useful, let's save them. We can materialize them as a new table directly within DuckDB.

# Create a new table in DuckDB from our feature pipeline
con.create_table("penguin_features", features, overwrite=True)

feat_tbl = con.table("penguin_features")

And just like before, we can also easily export this final table to a file format like Parquet, which is perfect for downstream machine learning tasks or other analyses.

out_path = "/content/penguin_features.parquet"
con.raw_sql(f"COPY penguin_features TO '{out_path}' (FORMAT PARQUET);")

print(f"Features exported to: {out_path}")

And there you have it. We went from raw data to a clean, feature-rich Parquet file, and the entire transformation process happened inside the database.

Why This Changes Everything

This approach of keeping the computation close to the data is a game-changer, especially as datasets get bigger. You don't need a monster machine with 128GB of RAM to process a 50GB file anymore. You can use a fast database engine (even a local one like DuckDB) to do the heavy lifting and only pull the results you need into your Python environment.

Plus, here's the kicker: because we wrote our logic in Ibis, it's portable. If we decide to move our data from DuckDB to a production system like Snowflake, BigQuery, or PostgreSQL, we don't have to change our Python code. We just swap out the connection object, and Ibis will automatically generate the correct SQL dialect for the new backend.

So next time you find yourself staring at a loading bar, remember there’s a better way. Let your database do the work it was designed for, and keep your Python code clean, readable, and focused on the logic, not the logistics.

Stop Moving Your Data: Build In-Database Feature Pipelines with Ibis and DuckDB

So, What Are We Working With?

Step 1: Getting Our Tools Ready

Step 2: Let's Get Some Data (the Smart Way)

Step 3: Building Our Feature Pipeline in Pure Python

Step 4: The Big Reveal - Python to SQL

Step 5: Saving Our Work and Handing It Off

Why This Changes Everything

Tags

Source

Stay Updated

Related Articles

Pandas Choking on Big Data? Here’s How to Build a Scalable ML Pipeline with Vaex

How to Analyze the Huge TaskTrove Dataset Without Downloading It

Tired of Messy ML Experiments? Let's Tame the Chaos with Hydra

Stop Moving Your Data: Build In-Database Feature Pipelines with Ibis and DuckDB

So, What Are We Working With?

Step 1: Getting Our Tools Ready

Step 2: Let's Get Some Data (the Smart Way)

Step 3: Building Our Feature Pipeline in Pure Python

Step 4: The Big Reveal - Python to SQL

Step 5: Saving Our Work and Handing It Off

Why This Changes Everything

Tags

Source

Stay Updated

Related Articles

Pandas Choking on Big Data? Here’s How to Build a Scalable ML Pipeline with Vaex

How to Analyze the Huge TaskTrove Dataset Without Downloading It

Tired of Messy ML Experiments? Let's Tame the Chaos with Hydra

Cookie Settings