If you’ve spent any time doing data analysis in a Python notebook, you know the drill. You write some code, generate a static chart, look at it, and think, "Hmm, what if I grouped by this instead?"
So you go back, tweak your code, run the cell again, and get a new chart. Then you want to filter by something else. Back to the code. Again. And again.
It’s a clunky, stop-and-start process that can kill your flow of thought. For years, I felt stuck in this cycle. I’d have to either wrestle with Matplotlib code for hours or export my data to a separate BI tool like Tableau just to get the interactive, drag-and-drop experience I wanted.
But what if you could have that Tableau-like power right inside your notebook?
That’s what I want to talk about today. I stumbled upon a library called PyGWalker, and honestly, it’s changed how I approach exploratory data analysis (EDA). It lets you turn a pandas DataFrame into a fully interactive user interface where you can drag, drop, filter, and visualize your data on the fly.
Let's walk through how to build a seriously powerful and interactive analysis workflow, using the classic Titanic dataset as our guinea pig. We're not just going to visualize raw data; we're going to do some smart prep work first to make our interactive session even more insightful.
Getting Our Tools Ready
First things first, let's get our environment set up. We just need a few key libraries. PyGWalker is the star of the show, but we'll also grab pandas for data wrangling and DuckDB to help PyGWalker run calculations super fast in the background.
Think of this as gathering your ingredients before you start cooking. We'll install everything and then load up the Titanic dataset from Seaborn.
# Let's get our libraries installed
import sys, subprocess
def pip_install(pkgs):
subprocess.check_call([sys.executable, "-m", "pip", "install", "-q"] + pkgs)
pip_install([
"pygwalker>=0.4.9",
"duckdb>=0.10.0",
"pandas>=2.0.0",
"numpy>=1.24.0",
"seaborn>=0.13.0"
])
# Now, import them and load the data
import pandas as pd
import seaborn as sns
df_raw = sns.load_dataset("titanic").copy()
print("Raw shape:", df_raw.shape)
df_raw.head(3)
A quick look at df_raw.head() shows us the familiar columns: survived, pclass, age, sex, etc. It’s clean, but it’s not yet analysis-ready. And that’s where the real magic begins.
The Secret Sauce: Prepping Your Data for Discovery
Here’s a lesson I learned the hard way: the quality of your analysis is directly tied to the quality of your features. Just throwing raw data into a visualization tool often leads to messy charts and shallow insights.
So, before we even touch PyGWalker, we’re going to do some feature engineering. We're going to transform and create new columns that tell a clearer story. This isn't just busywork; it's about making our data more expressive so we can ask better questions later.
Let's look at the code to do this, and then I'll break down why we're doing each step.
import numpy as np
import pandas as pd
# A helper function to safely create bins for continuous data
def make_safe_bucket(series, bins=None, labels=None, q=None, prefix="bucket"):
# ... [function code from original article] ...
def preprocess_titanic_advanced(df):
out = df.copy()
out.columns = [c.strip().lower().replace(" ", "_") for c in out.columns]
# Clean up numeric columns
for c in ["survived", "pclass", "sibsp", "parch"]:
if c in out.columns:
out[c] = pd.to_numeric(out[c], errors="coerce").fillna(-1).astype("int64")
# Engineer features from 'age'
if "age" in out.columns:
out["age"] = pd.to_numeric(out["age"], errors="coerce").astype("float64")
out["age_is_missing"] = out["age"].isna()
out["age_bucket"] = make_safe_bucket(
out["age"], bins=[0, 12, 18, 30, 45, 60, 120],
labels=["child", "teen", "young_adult", "adult", "mid_age", "senior"],
)
# Engineer features from 'fare'
if "fare" in out.columns:
out["fare"] = pd.to_numeric(out["fare"], errors="coerce").astype("float64")
out["fare_is_missing"] = out["fare"].isna()
out["log_fare"] = np.log1p(out["fare"].fillna(0))
out["fare_bucket"] = make_safe_bucket(out["fare"], q=8)
# Clean up string columns
for c in ["sex", "class", "who", "embarked", "alone", "adult_male"]:
if c in out.columns:
out[c] = out[c].astype("string").fillna("Unknown")
# Extract 'deck' from 'cabin'
if "cabin" in out.columns:
out["deck"] = out["cabin"].astype("string").str.strip().str[0].fillna("Unknown")
out["deck_is_missing"] = out["cabin"].isna()
else:
out["deck"] = "Unknown"
out["deck_is_missing"] = True
# Create a 'family_size' feature
if "sibsp" in out.columns and "parch" in out.columns:
out["family_size"] = (out["sibsp"] + out["parch"] + 1).astype("int64")
out["is_alone"] = (out["family_size"] == 1)
# Extract 'title' from 'name'
if "name" in out.columns:
title = out["name"].astype("string").str.extract(r",\s*([^\.]+)\.", expand=False).fillna("Unknown").str.strip()
vc = title.value_counts(dropna=False)
keep = set(vc[vc >= 15].index.tolist())
out["title"] = title.where(title.isin(keep), other="Rare").astype("string")
else:
out["title"] = "Unknown"
# Create a combined 'segment' for easy cohort analysis
out["segment"] = (
out["sex"].fillna("Unknown").astype("string") + " | " +
out["class"].fillna("Unknown").astype("string") + " | " +
out["age_bucket"].fillna("Unknown").astype("string")
)
return out
df = preprocess_titanic_advanced(df_raw)
print("Prepped shape:", df.shape)
df.head(3)
Okay, that's a lot of code, but let's talk about the why.
- Age Buckets: Instead of treating age as a number from 0 to 80, we created buckets like "child," "teen," and "adult." This is huge. It lets us compare entire groups instead of getting lost in the noise of individual ages.
- Fare Buckets: Same idea for fare. We split it into 8 quantiles (groups with an equal number of people). This helps us see patterns between low-fare and high-fare passengers more easily.
- Extracting Information: We pulled the
deckfrom thecabinnumber and thetitle(Mr, Mrs, Miss) from the passenger's name. These are hidden signals that can be incredibly predictive. - Creating Segments: The
segmentcolumn is my favorite. We combined sex, class, and age bucket into one field. Now, we can instantly create cohorts like "female | First | young_adult" and compare their survival rates. This is a massive shortcut for deep-dive analysis.
By doing this prep work, we’ve turned our simple table into a rich, analysis-ready dataset. We're no longer just looking at data; we're looking at structured information.
One Last Check: The Data Quality Report
Before we jump into the fun part, it's always a good idea to run a quick quality check on our newly prepped data. This little utility function gives us a bird's-eye view of every column: its data type, how many missing values it has, and how many unique values it contains.
It’s like a final inspection to make sure everything is in order.
def data_quality_report(df):
# ... [function code from original article] ...
dq = data_quality_report(df)
dq.head(10)
This report immediately tells us what we're working with. We can see which columns are clean, which have missing values, and which might be good candidates for grouping or filtering.
Setting the Stage: Detailed Rows vs. Aggregated Views
Here’s another pro-tip for interactive analysis: sometimes you want to see individual rows (the "street view"), and sometimes you want to see the high-level summary (the "satellite view").
To enable this, we'll create two DataFrames:
df_for_ui: The detailed, row-by-row data. We'll sample it down if it's huge to keep the UI snappy.agg: A pre-aggregated table that groups our data by the cool segments we created. This will let us compare cohorts super fast.
# Create a sampled dataframe for the UI (if needed)
df_for_ui = df # ... [code to sample if len(df) > 200_000]
# Create an aggregated table for cohort analysis
agg = (
df.groupby(["segment", "deck", "embarked"], dropna=False)
.agg(
n=("survived", "size"),
survival_rate=("survived", "mean"),
avg_fare=("fare", "mean"),
avg_age=("age", "mean"),
)
.reset_index()
)
Having both of these ready is like having two different lenses to look at your data. You can zoom in on a specific passenger or zoom out to see the survival rate of entire classes of people.
Let's Get Visual: Firing Up PyGWalker!
Alright, the moment of truth. All our prep work has led to this. With a single line of code, we can launch the PyGWalker interface.
This is where you'll want to lean in. We're going to create a walker object, and the notebook will magically render an interactive UI.
import pygwalker as pyg
import json
from pathlib import Path
SPEC_PATH = Path("/content/pygwalker_spec_titanic.json")
# Helper functions to save and load our visualization layout
def load_spec(path):
# ... [function code from original article] ...
def save_spec(path, spec_obj):
# ... [function code from original article] ...
def launch_pygwalker(df, spec_path):
spec = load_spec(spec_path)
kwargs = {}
if spec is not None:
kwargs["spec"] = spec
# Launch the walker!
walker = pyg.walk(df, use_kernel_calc=True, **kwargs)
# ... [code to capture and save the spec] ...
# This part saves your chart layout so you don't lose it
return walker
# Launch an interface for our detailed data
walker_rows = launch_pygwalker(df_for_ui, SPEC_PATH)
And just like that, you should see a beautiful, interactive panel appear below the code cell. On the left, you'll see all our columns—including the awesome new features we engineered. Now you can:
- Drag
sexto the Rows shelf. - Drag
survived(as a measure, using "mean") to the Columns shelf. - Change the mark type to a bar chart.
Boom. You have an instant chart showing the survival rate by gender. Want to break it down by passenger class? Just drag pclass to the Color shelf. No code needed. You can now explore, pivot, filter, and drill down into your data in real-time. It’s an incredibly fluid way to test hypotheses and follow your curiosity.
The best part? The launch_pygwalker function is set up to save your chart configurations. So if you create the perfect dashboard, close your notebook, and come back tomorrow, it’ll load right back up where you left off.
Sharing Your Masterpiece with the World
Okay, you’ve found some amazing insights. Now you want to share them with your boss or a colleague who doesn't live in a Python notebook. PyGWalker has you covered there, too.
You can export your entire interactive dashboard as a single, standalone HTML file.
HTML_PATH = Path("/content/pygwalker_titanic_dashboard.html")
def export_html_best_effort(df, spec_path, out_path):
# ... [function code from original article] ...
export_html_best_effort(df_for_ui, SPEC_PATH, HTML_PATH)
This generates a file you can email, host online, or just open in any web browser. Anyone can interact with your charts, filter the data, and explore the insights you found, all without needing to run a single line of code.
It's More Than a Tool, It's a Better Workflow
When you step back, you realize this is more than just a cool library. It’s a fundamental shift in the EDA workflow.
We moved from a rigid, code-first process to an interactive, visual-first one. By spending a little time upfront on thoughtful feature engineering, we unlocked a much faster and more intuitive way to explore the data. We're no longer just executing commands; we're having a conversation with our data.
This approach lets you stay in that creative "flow state" where you can ask a question, get an answer instantly, and let that answer inspire your next question. For me, that’s what data analysis is all about. Give it a try on your next project—I have a feeling you won't want to go back.




