Pandas Choking on Big Data? Here’s How to Build a Scalable ML Pipeline with Vaex

Akram Chauhan
Akram Chauhan
8 min read101 views
Pandas Choking on Big Data? Here’s How to Build a Scalable ML Pipeline with Vaex

We’ve all been there. You get a new dataset, you’re excited to dig in, you fire up your Jupyter notebook, and type pd.read_csv(). And then you wait. And wait. And then… your kernel dies. Your laptop fan sounds like it’s about to achieve liftoff, all because your dataset has a few million rows and your machine is running out of memory.

It’s a frustratingly common roadblock for anyone working with data. For years, the answer was either "get a bigger machine" or "sample your data and hope for the best." But what if there was a different way?

That’s where Vaex comes in. It’s a Python library that looks and feels a lot like Pandas, but it has a superpower: lazy evaluation. Think of it like this: instead of loading the entire multi-gigabyte file into your computer's memory, Vaex just peeks at it. It scans the file to understand the layout and then waits for your instructions. You can tell it to perform all sorts of complex calculations, and it won’t actually do the heavy lifting until the very last second when you ask for the final result. It’s a total game-changer for working with datasets that are bigger than your RAM.

Today, I want to walk you through how to build a complete, end-to-end machine learning pipeline using Vaex. We’re going to go from raw data to a trained model, handling millions of rows without breaking a sweat. Let’s get started.

Setting the Stage: Creating Our Playground

First things first, we need some data to play with. Instead of downloading a massive file, let's just create a realistic-looking dataset ourselves. We'll simulate a couple of million customer records with details like their city, age, income, and how long they've been a customer.

This isn't just a toy example. We're generating data that has the kind of complexity you'd see in the real world—different distributions, relationships between variables, and a target variable we want to predict.

import numpy as np
import vaex

# Let's create a 2-million row dataset
n = 2_000_000
rng = np.random.default_rng(7)

# Generate some realistic-looking features
cities = np.array(["Montreal","Toronto","Vancouver","Calgary","Ottawa","Edmonton","Quebec City","Winnipeg"])
city = rng.choice(cities, size=n, p=[0.16,0.18,0.12,0.10,0.10,0.10,0.10,0.14])
age = rng.integers(18, 75, size=n)
tenure_m = rng.integers(0, 180, size=n)
tx = rng.poisson(lam=22, size=n)
income = rng.lognormal(mean=10.6, sigma=0.45, size=n)
# ... a bit more logic to make income and target realistic ...
p = 1.0/(1.0 + np.exp(-(score_latent - np.quantile(score_latent, 0.70))))
target = (rng.random(n) < p).astype("int8")

# Here's the magic: create a Vaex DataFrame
df = vaex.from_arrays(
    city=city, 
    age=age, 
    tenure_m=tenure_m, 
    tx=tx, 
    income=income, 
    target=target
)

Now, here’s where Vaex really starts to shine. We want to create some new features—a process called feature engineering. In Pandas, every time you create a new column (df['new_col'] = ...), you're using up more memory. With Vaex, we define these new columns as "virtual columns" or expressions.

# These don't use any extra memory!
df["income_k"] = df.income / 1000.0
df["tenure_y"] = df.tenure_m / 12.0
df["log_income"] = df.income.log1p()
df["tx_per_year"] = df.tx / (df.tenure_y + 0.25)
df["is_new"] = df.tenure_m < 6

Think of it like giving a chef a recipe. You haven't actually cooked anything yet, you've just written down the instructions. Vaex now knows how to calculate tenure_y if you ever ask for it, but it hasn't wasted any resources computing it for all 2 million rows yet. It's incredibly efficient.

Finding the Story: Aggregations and Group-Bys at Scale

Okay, we have our data and some basic features. Now let's start asking some interesting questions. For example, how does income or customer value differ by city? In Pandas, a groupby() on millions of rows can be slow. Vaex is built for this.

It uses a highly optimized, out-of-core (meaning it doesn't need to load everything into RAM) algorithm for these kinds of operations. We can quickly calculate things like the average income, 95th percentile income, and target rate for each city.

# First, let's numerically encode our 'city' column
encoder = vaex.ml.LabelEncoder(features=["city"])
df = encoder.fit_transform(df)

# Now, let's calculate stats for each city
n_cities = len(encoder.labels_["city"])
p95_income_k_by_city = df.percentile_approx("income_k", 95, binby="label_encoded_city", shape=n_cities)
avg_income_k_by_city = df.mean("income_k", binby="label_encoded_city", shape=n_cities)
target_rate_by_city = df.mean("target", binby="label_encoded_city", shape=n_cities)

# ... assemble these into a summary table ...

We get back a neat summary table of our cities. This is powerful stuff—we've just summarized millions of records in seconds.

But we can take this a step further. What if we want to create features that compare an individual to their city's average? For example, is a customer's income high for their city? This kind of contextual feature is often super predictive. We can easily join our city-level summary table back to our main DataFrame.

# Join the city-level features back to the main dataframe
df = df.join(df_city_features, on="city", rsuffix="_city")

# Create new features comparing individuals to their city's stats
df["income_vs_city_p95"] = df.income_k / (df.p95_income_k + 1e-9)
df["value_vs_city_median"] = df.value_score - df.median_value_score

Again, these are all lazy expressions. We're building up a complex, multi-layered feature engineering pipeline, and our memory usage has barely budged.

Getting Predictive: Training a Model with Scikit-Learn

Now for the fun part: building a machine learning model. You might be thinking, "Okay, Vaex is great for data prep, but do I have to learn a whole new modeling library?" The answer is a resounding no! Vaex is designed to play nicely with the tools you already know and love, like Scikit-learn.

First, we need to prepare our features for the model. A common step is to scale all our numeric features so they're on a similar playing field. Vaex has built-in transformers for this.

features_num = [
    "age","tenure_y","tx","income_k","log_income",
    # ... and all our other great features
]

# Scale the features (e.g., to have mean=0, std=1)
scaler = vaex.ml.StandardScaler(features=features_num, prefix="z_")
df = scaler.fit_transform(df)

Next, we split our data into a training set and a testing set. Notice we’re not creating two massive new dataframes in memory. Vaex just creates two views of the original data.

# Split the data without copying it
df_train, df_test = df.split_random([0.80, 0.20], random_state=42)

Finally, we can train our model. We'll use a simple Logistic Regression from Scikit-learn. Vaex provides a handy wrapper that feeds the data from the Vaex DataFrame to the model in manageable chunks. The model never has to see the entire dataset at once.

from sklearn.linear_model import LogisticRegression
from vaex.ml.sklearn import Predictor

# Define our model and features
features = ["z_"+f for f in features_num] + ["label_encoded_city"]
model = LogisticRegression(max_iter=250)

# Wrap it in a Vaex Predictor
vaex_model = Predictor(model=model, features=features, target="target", prediction_name="pred")

# Train the model
vaex_model.fit(df=df_train)

# Make predictions on the test set
df_test = vaex_model.transform(df_test)

And just like that, we've trained a model on millions of rows. We can now easily evaluate its performance using standard metrics like AUC and Average Precision.

Did It Actually Work? Checking the Lift

A model's AUC score is great, but in the business world, we often want to know something more practical: "If I target the top 10% of customers ranked by your model, how much better are they than average?" This is called a lift analysis.

We can do this easily by taking our model's predictions, bucketing them into deciles (0-10%, 10-20%, etc.), and then calculating the actual target rate within each bucket.

A good model will have a very high rate in the top buckets and a very low rate in the bottom ones. This shows it's doing a great job of separating the "positives" from the "negatives." Running this analysis in Vaex is straightforward and confirms that our model has learned the patterns in the data effectively.

Bottling the Magic: Saving Your Pipeline for Later

You’ve done all this incredible work. You’ve engineered features, trained a model, and validated it. The last thing you want is for it to be stuck in a notebook. You need to be able to save this entire process so you can re-apply it to new data later.

This is where productionizing your pipeline comes in. We can do two key things:

  1. Export the transformed data: We can save a sample of our fully-featured dataset to a file format like Parquet, which is super efficient for storage and future use.
  2. Save the pipeline state: This is the crucial part. We can save all the "recipes" we created—the label encoder mappings, the scaler's means and standard deviations—into a JSON file.
# Save a sample of the data to Parquet
df_export = df[export_cols].sample(n=500_000)
df_export.export_parquet("/path/to/customers_vaex.parquet")

# Save the entire pipeline's state to a JSON file
pipeline_state = {
    "encoder_labels": encoder.labels_,
    "scaler_mean": scaler.mean_.tolist(),
    "scaler_std": scaler.std_.tolist(),
    # ... and other metadata
}
# ... write to json ...

With these artifacts, you can now load new raw data and perfectly replicate every single transformation and prediction step. You can apply the exact same label encoding, use the exact same scaling parameters, and feed the data into your trained model to get predictions. You've created a truly reproducible, end-to-end pipeline.

So, the next time you’re faced with a dataset that makes you nervous about your machine’s memory, don’t be. With a tool like Vaex, you're not limited by your RAM anymore. You can explore, transform, and model massive datasets right from your laptop, building scalable and production-ready pipelines without the headache. It’s all about working smarter, not harder.

Tags

Machine Learning Data Science MLOps Performance Optimization Python Pandas Big Data Scalable AI data engineering Data Processing Vaex ML Pipeline Lazy Evaluation Large Datasets Memory Optimization Analytics Pipeline Jupyter Notebook End-

Stay Updated

Get the latest articles and insights delivered straight to your inbox.

We respect your privacy. Unsubscribe at any time.

Aicosoft

AI & Technology News, Insights & Innovation

AICOSOFT delivers cutting-edge AI news, technology breakthroughs, and innovation insights. Stay informed about artificial intelligence, machine learning, robotics, and the latest tech trends shaping tomorrow.

Connect With Us

© 2026 Aicosoft. All rights reserved.