Ever tried to plot a million data points with Matplotlib and just… watched your computer freeze? Yeah, we’ve all been there. You have this massive, beautiful dataset, and you just want to see it. But traditional plotting tools choke, giving you a mess of overplotted points or simply crashing your notebook. It’s frustrating.
What if I told you there’s a way to render billions of points in seconds, creating crisp, clear images that actually reveal the patterns hidden in your data?
That’s where Datashader comes in. It’s a Python library that completely changes the game for big data visualization. Instead of trying to draw every single point (which is impossible for large datasets), it cleverly aggregates your data into a fixed-size grid first and then renders it as an image. Think of it less like a scatter plot and more like a high-resolution photograph of your data.
In this walkthrough, we’re going to get hands-on with Datashader. I’ll show you exactly how it works, from the basic pipeline to advanced tricks, so you can stop fighting with your plotting tools and start exploring your data.
The Core Idea: Datashader’s Three-Step Dance
At its heart, Datashader follows a simple but powerful three-step process. It’s like an assembly line for your data points.
- Projection: First, we define a canvas. This is basically the empty frame for our picture. We tell it the size of the final image (e.g., 600x500 pixels) and the coordinate range of our data (the x and y limits).
- Aggregation: This is the magic step. Datashader goes through every single one of your data points and figures out which pixel on the canvas it falls into. It then runs a calculation for that pixel. The simplest one is just to
counthow many points land in each pixel. - Transformation (Shading): Finally, we take that grid of numbers (our aggregated data) and turn it into a colorful image. This is called shading. We map the counts in each pixel to a color, which is how we see the density of our data.
Let’s see it in action with 2 million points. First, let's get our environment set up.
import subprocess, sys
subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "datashader", "colorcet", "numba", "scipy"])
import numpy as np
import pandas as pd
import datashader as ds
import datashader.transfer_functions as tf
from datashader import reductions as rd
import colorcet as cc
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
from matplotlib.gridspec import GridSpec
from scipy.stats import multivariate_normal
import time, warnings
warnings.filterwarnings("ignore")
print("Datashader version:", ds.__version__)
def show(img, title="", ax=None, figsize=(6, 5)):
standalone = ax is None
if standalone:
fig, ax = plt.subplots(figsize=figsize)
rgba = img.to_pil()
ax.imshow(rgba, origin="upper", aspect="auto")
ax.set_title(title, fontsize=11, fontweight="bold")
ax.axis("off")
if standalone:
plt.tight_layout()
plt.show()
print("\n=== SECTION 1: Core Pipeline ===")
rng = np.random.default_rng(42)
N = 2_000_000
x = np.concatenate([rng.normal(-1, 0.5, N//3), rng.normal( 1, 0.5, N//3), rng.normal( 0, 1.5, N//3)])
y = np.concatenate([rng.normal(-1, 0.5, N//3), rng.normal( 1, 0.5, N//3), rng.normal( 0, 0.5, N//3)])
df_base = pd.DataFrame({"x": x, "y": y})
canvas = ds.Canvas(plot_width=600, plot_height=500, x_range=(-4, 4), y_range=(-4, 4))
agg = canvas.points(df_base, "x", "y", agg=rd.count())
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
combos = [
("Linear / blues", tf.shade(agg, cmap=cc.blues, how="linear")),
("Log / fire", tf.shade(agg, cmap=cc.fire, how="log" )),
("Eq-hist / bmy", tf.shade(agg, cmap=cc.bmy, how="eq_hist")),
]
for ax, (title, img) in zip(axes, combos):
show(img, title, ax=ax)
plt.suptitle("Section 1 – 2 M points: Linear vs Log vs Eq-Hist normalisation", fontsize=13, fontweight="bold")
plt.tight_layout()
plt.show()
Look at that! We just visualized 2 million points. Notice the difference between linear, log, and eq_hist shading. A linear scale is often washed out because a few very dense pixels dominate the color map. A log scale is way better, letting us see variations in both low-density and high-density areas. eq_hist (equalized histogram) is often the best of both worlds, really making the structure pop.
More Than Just Counting: Different Ways to Aggregate
Counting points per pixel is great, but what if your data has more dimensions? Maybe each point has a value, like temperature or price. Datashader can handle that, too.
Instead of just rd.count(), we can use other "reductions" like rd.sum('value'), rd.mean('value'), or even rd.std('value'). This means each pixel in your final image won't just represent the number of points, but the average value of all points that fell into it. This is incredibly powerful for finding spatial patterns.
Let's add a value column and a label column to our DataFrame and see what we can do.
print("\n=== SECTION 2: Reduction Types ===")
n_actual = len(df_base)
df_base["value"] = rng.exponential(scale=2, size=n_actual)
df_base["label"] = pd.Categorical(
rng.choice(["A", "B", "C"], size=n_actual), categories=["A", "B", "C"]
)
canvas2 = ds.Canvas(plot_width=400, plot_height=350, x_range=(-4, 4), y_range=(-4, 4))
reductions_cfg = [
("count()", rd.count(), cc.kbc),
("sum(value)", rd.sum("value"), cc.CET_L3),
("mean(value)", rd.mean("value"), cc.CET_D4),
("std(value)", rd.std("value"), cc.CET_L16),
("min(value)", rd.min("value"), cc.CET_L17),
("max(value)", rd.max("value"), cc.bgyw),
("var(value)", rd.var("value"), cc.CET_L18),
("count_cat(label)", rd.count_cat("label"), None),
]
fig, axes = plt.subplots(2, 4, figsize=(18, 9))
axes = axes.flat
for ax, (name, agg_fn, cmap) in zip(axes, reductions_cfg):
agg_r = canvas2.points(df_base, "x", "y", agg=agg_fn)
if cmap is None:
img = tf.shade(agg_r, color_key={"A":"#e41a1c","B":"#377eb8","C":"#4daf4a"})
else:
img = tf.shade(agg_r, cmap=cmap, how="eq_hist")
show(img, name, ax=ax)
plt.suptitle("Section 2 – All Reduction Types on 2 M points", fontsize=14, fontweight="bold")
plt.tight_layout()
plt.show()
Each of these little plots tells a different story about the same data. The count_cat one is also super useful—it aggregates by category, letting us see where different groups of data are concentrated.
Working with Categories
Speaking of categories, Datashader handles them beautifully. If you have data split into different clusters or types, you can assign a unique color to each. This avoids the overplotting mess you'd get with a normal scatter plot.
Here, we'll create four distinct clusters.
print("\n=== SECTION 3: Categorical Visualisation ===")
N_cat = 500_000
categories = ["Cluster A", "Cluster B", "Cluster C", "Cluster D"]
centers = [(-2, -2), (-2, 2), (2, -2), (2, 2)]
colors = {"Cluster A":"#e41a1c","Cluster B":"#377eb8", "Cluster C":"#4daf4a","Cluster D":"#ff7f00"}
frames = []
for cat, (cx, cy) in zip(categories, centers):
n = N_cat // len(categories)
frames.append(pd.DataFrame({
"x": rng.normal(cx, 0.8, n),
"y": rng.normal(cy, 0.8, n),
"cat": pd.Categorical([cat]*n, categories=categories),
}))
df_cat = pd.concat(frames, ignore_index=True)
canvas3 = ds.Canvas(plot_width=500, plot_height=500, x_range=(-5, 5), y_range=(-5, 5))
agg_cat = canvas3.points(df_cat, "x", "y", agg=rd.count_cat("cat"))
fig, axes = plt.subplots(1, 3, figsize=(16, 5))
img_raw = tf.shade(agg_cat, color_key=colors)
show(img_raw, "Raw (no spread)", ax=axes[0])
img_sp1 = tf.spread(tf.shade(agg_cat, color_key=colors), px=1)
show(img_sp1, "Spread px=1", ax=axes[1])
img_bg = tf.set_background(tf.shade(agg_cat, color_key=colors), color="black")
show(img_bg, "Black background", ax=axes[2])
for cat, col in colors.items():
axes[2].plot([], [], "o", color=col, label=cat, markersize=8)
axes[2].legend(loc="lower right", fontsize=8, framealpha=0.6)
plt.suptitle("Section 3 – Categorical Rendering (500 k points)", fontsize=13, fontweight="bold")
plt.tight_layout()
plt.show()
Notice that tf.spread() function? It’s a neat little trick that makes sparse points more visible by bleeding their color into neighboring empty pixels. It can really help your plots look more continuous and less spotty.
It’s Not Just for Points: Lines, Rasters, and Grids
So far we've only looked at points, but what about other data shapes?
Thousands of Lines
Imagine plotting 5,000 time series on a single chart. It would be a complete spaghetti monster. With Datashader's canvas.line(), it becomes a clear density map, showing you the most common paths.
print("\n=== SECTION 4: Line Rendering ===")
n_series, n_steps = 5_000, 500
t = np.linspace(0, 1, n_steps)
xs = np.tile(t, n_series)
walks = np.cumsum(rng.normal(0, 0.05, (n_series, n_steps)), axis=1)
ys = walks.ravel()
series_id = np.repeat(np.arange(n_series), n_steps)
df_lines = pd.DataFrame({"x": xs, "y": ys, "id": series_id})
canvas4 = ds.Canvas(plot_width=700, plot_height=450, x_range=(0, 1), y_range=(-6, 6))
agg_lines = canvas4.line(df_lines, "x", "y", agg=rd.count(), line_width=1)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
show(tf.shade(agg_lines, cmap=cc.fire, how="eq_hist"), "5 000 random walks – eq_hist / fire", ax=axes[0])
show(tf.shade(agg_lines, cmap=cc.blues, how="log"), "5 000 random walks – log / blues", ax=axes[1])
plt.suptitle("Section 4 – Line / Time-Series Rendering", fontsize=13, fontweight="bold")
plt.tight_layout()
plt.show()
Gridded and Raster Data
Datashader also plays nice with gridded data, like the kind you'd get from satellite imagery or climate models. If you have your data in an xarray DataArray, you can use canvas.raster() to plot it instantly.
print("\n=== SECTION 5: Raster / Grid Data ===")
import xarray as xr
res = 1000
lon = np.linspace(-180, 180, res)
lat = np.linspace(-90, 90, res)
LON, LAT = np.meshgrid(lon, lat)
z = (
multivariate_normal.pdf(np.stack([LON, LAT], -1), mean=[30, 30], cov=[[800,0],[0,500]]) +
multivariate_normal.pdf(np.stack([LON, LAT], -1), mean=[-60, -20], cov=[[600,0],[0,400]]) +
0.02 * rng.standard_normal((res, res))
)
da = xr.DataArray(z, dims=["y", "x"], coords={"x": lon, "y": lat})
canvas5 = ds.Canvas(plot_width=700, plot_height=400, x_range=(-180, 180), y_range=(-90, 90))
agg_raster = canvas5.raster(da)
fig, axes = plt.subplots(1, 2, figsize=(14, 4))
show(tf.shade(agg_raster, cmap=cc.CET_L18, how="eq_hist"), "Synthetic elevation – eq_hist", ax=axes[0])
show(tf.shade(agg_raster, cmap=cc.rainbow, how="linear"), "Synthetic elevation – linear", ax=axes[1])
plt.suptitle("Section 5 – Raster / Grid (xarray DataArray)", fontsize=13, fontweight="bold")
plt.tight_layout()
plt.show()
For non-uniform grids, there's canvas.quadmesh(), which is perfect for more complex scientific datasets. It lets you visualize data where the grid cells themselves have different sizes.
print("\n=== SECTION 6: QuadMesh / 2-D Grid Glyph ===")
lon6 = np.concatenate([np.linspace(-180, -60, 80), np.linspace(-60, 60, 30), np.linspace( 60, 180, 80)])
lat6 = np.concatenate([np.linspace(-90, -30, 40), np.linspace(-30, 30, 20), np.linspace( 30, 90, 40)])
LON6, LAT6 = np.meshgrid(lon6, lat6)
def vortex(lon0, lat0, amp=1.0):
return amp * np.exp(-((LON6-lon0)**2/1200 + (LAT6-lat0)**2/600))
field6 = vortex(-40, 30, 1.2) + vortex(120, -20, 0.9) + \
0.05 * rng.standard_normal(LON6.shape)
da6 = xr.DataArray(field6.astype(np.float32), dims=["y", "x"],
coords={"x": lon6, "y": lat6}, name="intensity")
canvas6 = ds.Canvas(plot_width=700, plot_height=380, x_range=(-180, 180), y_range=(-90, 90))
agg6 = canvas6.quadmesh(da6)
canvas6z = ds.Canvas(plot_width=500, plot_height=400, x_range=(-80, 0), y_range=(0, 60))
agg6z = canvas6z.quadmesh(da6)
field6_smooth = vortex(-40, 30, 1.0) + vortex(120, -20, 0.8)
da6_diff = xr.DataArray((field6 - field6_smooth).astype(np.float32), dims=["y","x"],
coords={"x": lon6, "y": lat6}, name="anomaly")
agg6d = canvas6.quadmesh(da6_diff)
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
show(tf.shade(agg6, cmap=cc.fire, how="eq_hist"), "Global field – eq_hist", ax=axes[0])
show(tf.shade(agg6z, cmap=cc.CET_L3, how="linear"), "N. Atlantic zoom – linear", ax=axes[1])
show(tf.shade(agg6d, cmap=cc.CET_D4, how="eq_hist"), "Residual (anomaly) – eq_hist",ax=axes[2])
plt.suptitle("Section 6 – canvas.quadmesh(): non-uniform 2-D grids", fontsize=13, fontweight="bold")
plt.tight_layout()
plt.show()
Layering and Performance: The Fun Stuff
Okay, now let's get into a couple of my favorite features: layering and performance.
Stacking Layers
You can create multiple Datashader images and stack them on top of each other. This is fantastic for adding context, like plotting a dense background layer with a sparser foreground layer on top.
print("\n=== SECTION 7: Spreading, Stack & Composite ===")
N7 = 300_000
x7 = rng.normal(0, 3, N7)
y7 = rng.normal(0, 3, N7)
df7 = pd.DataFrame({"x": x7, "y": y7})
canvas7 = ds.Canvas(plot_width=500, plot_height=500, x_range=(-10, 10), y_range=(-10, 10))
agg7 = canvas7.points(df7, "x", "y", agg=rd.count())
# First, let's just see spreading again
fig, axes = plt.subplots(1, 4, figsize=(20, 5))
for ax, (label, px) in zip(axes, [("No spread", 0), ("spread px=1", 1), ("spread px=2", 2), ("spread px=4", 4)]):
img = tf.shade(agg7, cmap=cc.fire, how="eq_hist")
if px > 0:
img = tf.spread(img, px=px)
show(img, label, ax=ax)
plt.suptitle("Section 7 – Effect of spread() on sparse data", fontsize=13, fontweight="bold")
plt.tight_layout()
plt.show()
N_fg = 50_000
x_fg = rng.normal(0, 0.5, N_fg)
y_fg = rng.normal(0, 0.5, N_fg)
df_fg = pd.DataFrame({"x": x_fg, "y": y_fg})
agg_bg = canvas7.points(df7, "x", "y", agg=rd.count())
agg_fg = canvas7.points(df_fg,"x", "y", agg=rd.count())
img_bg_shade = tf.shade(agg_bg, cmap=cc.blues, how="log", alpha=200)
img_fg_shade = tf.shade(agg_fg, cmap=cc.fire, how="eq_hist")
stacked = tf.stack(img_bg_shade, img_fg_shade)
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
show(img_bg_shade, "Background (blue / log)", ax=axes[0])
show(img_fg_shade, "Foreground (fire / eq_hist)", ax=axes[1])
show(stacked, "Stacked composite", ax=axes[2])
plt.suptitle("Section 7 – Stack / Composite two layers", fontsize=13, fontweight="bold")
plt.tight_layout()
plt.show()
How Fast Is It?
This is the part that always blows my mind. Datashader is fast. Because the aggregation step is written in Numba (which compiles Python to speedy machine code), it can chew through millions of points in milliseconds. Let's benchmark it.
print("\n=== SECTION 8: Performance Benchmark ===")
sizes = [10_000, 100_000, 1_000_000, 5_000_000, 20_000_000]
timings = []
for n in sizes:
xb = rng.normal(0, 1, n).astype(np.float32)
yb = rng.normal(0, 1, n).astype(np.float32)
dfb = pd.DataFrame({"x": xb, "y": yb})
cvb = ds.Canvas(plot_width=800, plot_height=700)
# Warm-up run
cvb.points(dfb, "x", "y", agg=rd.count())
t0 = time.perf_counter()
cvb.points(dfb, "x", "y", agg=rd.count())
elapsed = time.perf_counter() - t0
timings.append(elapsed)
print(f" {n:>12,} points → {elapsed*1000:6.1f} ms")
fig, ax = plt.subplots(figsize=(8, 4))
ax.loglog([s/1e6 for s in sizes], [t*1000 for t in timings], "o-", linewidth=2, markersize=8, color="#e41a1c")
ax.set_xlabel("Dataset size (millions of points)", fontsize=12)
ax.set_ylabel("Render time (ms)", fontsize=12)
ax.set_title("Section 8 – Datashader Render Performance (800×700 canvas)", fontsize=12, fontweight="bold")
ax.grid(True, which="both", alpha=0.4)
plt.tight_layout()
plt.show()
Rendering 20 million points took less than 200 milliseconds on my machine. That is absolutely wild. This linear scaling means you can confidently throw even bigger datasets at it.
Putting It All Together: Dashboards and Overlays
The real power of a tool like this comes from using it to build real analytical views.
Multi-Panel Dashboards
Because it's so fast, you can easily create complex dashboards with multiple views of your data. Here, we'll simulate some financial data and create a 6-panel dashboard to explore relationships between price, volume, and returns.
print("\n=== SECTION 10: Multi-Panel Dashboard ===")
N10 = 1_500_000
price = np.cumsum(rng.normal(0, 0.01, N10)) + 100
vol = np.abs(rng.normal(0, 1, N10)) * (1 + 0.5 * rng.exponential(1, N10))
ret = np.diff(price, prepend=price[0])
hour = (np.arange(N10) % 390) / 390
df10 = pd.DataFrame({
"price": price.astype(np.float32),
"vol": vol.astype(np.float32),
"ret": ret.astype(np.float32),
"hour": hour.astype(np.float32),
})
fig = plt.figure(figsize=(16, 12))
gs = GridSpec(2, 3, figure=fig, hspace=0.35, wspace=0.3)
panels = [
(gs[0,0], "price", "vol", "Price vs Volume", cc.fire),
(gs[0,1], "ret", "vol", "Return vs Volume", cc.bmy),
(gs[0,2], "hour", "price","Intraday Price Distribution",cc.CET_L3),
(gs[1,0], "ret", "price","Return vs Price", cc.CET_D4),
(gs[1,1], "hour", "ret", "Intraday Return Profile", cc.CET_L16),
(gs[1,2], "hour", "vol", "Intraday Volume Profile", cc.CET_L17),
]
for spec, xcol, ycol, title, cmap in panels:
ax = fig.add_subplot(spec)
xr_ = (float(df10[xcol].quantile(0.001)), float(df10[xcol].quantile(0.999)))
yr_ = (float(df10[ycol].quantile(0.001)), float(df10[ycol].quantile(0.999)))
cv = ds.Canvas(plot_width=300, plot_height=250, x_range=xr_, y_range=yr_)
ag = cv.points(df10, xcol, ycol, agg=rd.count())
img = tf.shade(ag, cmap=cmap, how="eq_hist")
show(img, title, ax=ax)
ax.set_axis_off()
fig.suptitle("Section 10 – Multi-Panel Dashboard: 1.5 M Synthetic Trades", fontsize=15, fontweight="bold", y=1.01)
plt.show()
The Magic of Zooming
One of the most important features is that Datashader never throws away data. When you want to zoom into a region, you just define a new canvas with smaller x/y ranges. It re-aggregates the points for that specific view, revealing details that were invisible before.
print("\n=== SECTION 11: Zoom / Sub-Region Magnification ===")
zoom_ranges = [
((-5, 5), (-5, 5), "Full extent"),
((-3, 0), (-3, 0), "Quadrant III"),
((-1.5, 0.5),(-1.5, 0.5),"Zoomed into cluster A"),
]
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
for ax, (xr, yr, title) in zip(axes, zoom_ranges):
cv = ds.Canvas(plot_width=400, plot_height=400, x_range=xr, y_range=yr)
ag = cv.points(df_cat, "x", "y", agg=rd.count_cat("cat"))
img = tf.shade(ag, color_key=colors)
show(img, title, ax=ax)
plt.suptitle("Section 11 – Zoom: No Data Loss at Any Scale", fontsize=13, fontweight="bold")
plt.tight_layout()
plt.show()
Best of Both Worlds: Overlaying with Matplotlib
Finally, you don't have to choose between Datashader and Matplotlib. You can use Datashader for the heavy lifting of rendering the dense data, and then use Matplotlib to add annotations, contour lines, or axes on top.
Here, we'll render our 2 million points with Datashader and then overlay contour lines calculated on a smaller sample of the data.
print("\n=== SECTION 12: Overlay with Matplotlib ===")
canvas12 = ds.Canvas(plot_width=600, plot_height=600, x_range=(-4, 4), y_range=(-4, 4))
agg12 = canvas12.points(df_base, "x", "y", agg=rd.count())
img12 = tf.shade(agg12, cmap=cc.fire, how="eq_hist")
from scipy.stats import gaussian_kde
sample_idx = rng.integers(0, len(df_base), 20_000)
kde = gaussian_kde(df_base.iloc[sample_idx][["x","y"]].values.T, bw_method=0.15)
gx = np.linspace(-4, 4, 80)
gy = np.linspace(-4, 4, 80)
GX, GY = np.meshgrid(gx, gy)
Z = kde(np.vstack([GX.ravel(), GY.ravel()])).reshape(GX.shape)
fig, ax = plt.subplots(figsize=(7, 6))
ax.imshow(img12.to_pil(), origin="upper", aspect="auto", extent=[-4, 4, -4, 4])
ax.contour(GX, GY, Z, levels=8, colors="white", linewidths=0.8, alpha=0.7)
ax.set_title("Section 12 – Datashader + Matplotlib Contour Overlay", fontsize=12, fontweight="bold")
ax.set_xlabel("x"); ax.set_ylabel("y")
plt.tight_layout()
plt.show()
print("\n All sections complete!")
And there you have it. You get the raw rendering power of Datashader with the fine-grained control of Matplotlib. It's a perfect combination.
So next time you're faced with a dataset that feels too big to plot, don't give up. With Datashader in your toolkit, you're ready to tackle it. You can finally stop waiting for plots to render and start actually exploring the stories hidden inside your data.




