Aicosoft - AI & Technology News, Insights & Innovation

So, you’ve got your hands on a single-cell RNA sequencing dataset. It’s exciting, right? It’s like being handed a snapshot of a bustling city, where every single cell is a person with a unique job and story. But looking at the raw data file can feel… well, a little intimidating. It’s just a massive table of numbers. How do you go from that to a beautiful map that tells you, "These are T-cells, those are B-cells, and over here are the monocytes"?

That’s where we come in. And our tool of choice for this adventure is Scanpy, a fantastic Python library that makes single-cell analysis feel way more intuitive than it has any right to be.

In this walkthrough, we’re going to build a complete analysis pipeline together, from start to finish. I’m not just going to throw code at you; I’ll explain the why behind each step. Think of me as your co-pilot. By the end, you’ll have a clear, annotated map of your cell populations and a solid foundation for your own projects. Let’s get started.

First Things First: Setting Up Our Workshop

Before we can analyze anything, we need to set up our environment. It’s like getting all your ingredients and tools ready before you start cooking. We need to install Scanpy and a few of its friends that help with things like clustering and plotting.

Don't worry, we can handle this right from our notebook.

import sys
import subprocess
import importlib

def pip_install(*packages):
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", *packages])

required = [
    "scanpy",
    "anndata",
    "leidenalg",
    "igraph",
    "harmonypy",
    "seaborn"
]
pip_install(*required)

import os
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scanpy as sc
import anndata as ad

sc.settings.verbosity = 2
sc.settings.set_figure_params(dpi=110, facecolor="white", frameon=False)
np.random.seed(42)

print("Scanpy version:", sc.__version__)

With our libraries imported, let’s load up our data. We'll use a classic dataset that Scanpy provides for us: 3,000 peripheral blood mononuclear cells (PBMCs). It’s a great starting point because it contains a nice mix of common immune cells.

adata = sc.datasets.pbmc3k()
adata.var_names_make_unique()

print("\nInitial AnnData:")
print(adata)

You'll see something called an AnnData object. This is Scanpy's magic container. It holds not just our main data (the gene expression matrix) but also all the metadata about our cells (observations, obs) and genes (variables, var) that we'll generate along the way. It keeps everything neat and tidy.

The Quality Control Step: Weeding the Garden

Not every cell we sequence is a healthy, happy cell. Some might be broken or dying, and they can mess up our analysis. So, our first real job is to do some quality control (QC) to filter out the low-quality cells. Think of it as weeding the garden before you start planting.

We’ll look at a few key metrics:

n_genes_by_counts: The number of genes detected in each cell. Too few, and the cell might be dead. Too many, and it might actually be two cells stuck together (a "doublet").
total_counts: The total number of gene molecules (UMIs) in a cell. This is related to sequencing depth.
pct_counts_mt: The percentage of counts from mitochondrial genes. A high percentage often means the cell is stressed or dying.

Let’s calculate and look at these metrics.

adata.layers["counts"] = adata.X.copy()
adata.var["mt"] = adata.var_names.str.upper().str.startswith("MT-")
sc.pp.calculate_qc_metrics(adata, qc_vars=["mt"], percent_top=None, log1p=False, inplace=True)

print("\nQC summary:")
display(
    adata.obs[["n_genes_by_counts", "total_counts", "pct_counts_mt"]].describe().T
)

Numbers are great, but pictures are better. Let’s visualize these QC metrics to get a feel for the distributions.

fig, axs = plt.subplots(1, 3, figsize=(15, 4))
sc.pl.violin(adata, ["n_genes_by_counts"], jitter=0.4, ax=axs[0], show=False)
sc.pl.violin(adata, ["total_counts"], jitter=0.4, ax=axs[1], show=False)
sc.pl.violin(adata, ["pct_counts_mt"], jitter=0.4, ax=axs[2], show=False)
plt.tight_layout()
plt.show()

sc.pl.scatter(adata, x="total_counts", y="n_genes_by_counts", color="pct_counts_mt")

Based on these plots, we can set some reasonable cutoffs. We'll remove cells with very few genes, cells with an unusually high number of genes, and cells with too much mitochondrial expression. We'll also filter out genes that are only expressed in a tiny number of cells.

adata = adata[adata.obs["n_genes_by_counts"] >= 200].copy()
adata = adata[adata.obs["n_genes_by_counts"] <= 5000].copy()
adata = adata[adata.obs["pct_counts_mt"] < 10].copy()
sc.pp.filter_genes(adata, min_cells=3)

print("\nAfter filtering:")
print(adata)

Much better! Our dataset is now cleaner and ready for the next steps.

Normalization and Finding the Key Players

Right now, the gene counts are all over the place. A cell with more total RNA molecules will naturally have higher counts for every gene, which can hide the real biological differences. To fix this, we normalize the data, essentially making it so every cell has the same total count. Then we apply a log transformation to stabilize the variance.

sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
adata.raw = adata.copy() # Save the raw data for later

Now, not all genes are equally interesting. Some are just "housekeeping" genes that are always on. The ones that really tell us about cell identity are the ones whose expression varies a lot across different cells. These are our "highly variable genes." Let's find them.

sc.pp.highly_variable_genes(
    adata,
    flavor="seurat",
    min_mean=0.0125,
    max_mean=3,
    min_disp=0.5
)

print("\nHighly variable genes selected:", int(adata.var["highly_variable"].sum()))
sc.pl.highly_variable_genes(adata)

adata = adata[:, adata.var["highly_variable"]].copy()

By focusing on these highly variable genes, we reduce the noise and focus our analysis on the signals that actually matter.

Making Sense of the Chaos: Finding Patterns with PCA and UMAP

Even with just the highly variable genes, we're still dealing with a lot of dimensions (one for each gene). It's impossible for our human brains to visualize. This is where dimensionality reduction comes in.

First, we'll do a bit more data tidying. We'll scale the data so that each gene has a mean of 0 and a standard deviation of 1. We'll also "regress out" technical noise from things like total counts and mitochondrial percentage.

sc.pp.regress_out(adata, ["total_counts", "pct_counts_mt"])
sc.pp.scale(adata, max_value=10)

Now for the main event: Principal Component Analysis (PCA). Think of PCA as finding the main "themes" or "axes of variation" in our dataset. It condenses thousands of gene dimensions into just a handful of principal components that capture most of the important information.

sc.tl.pca(adata, svd_solver="arpack")
sc.pl.pca_variance_ratio(adata, log=True)
sc.pl.pca(adata, color=None)

The variance ratio plot helps us decide how many principal components to use. We want enough to capture the biology, but not so many that we start capturing noise. Looks like the first 30 or so are doing most of the heavy lifting.

With our PCA results, we can now build a "neighborhood graph." This is where we figure out which cells are most similar to each other in this reduced PCA space. Based on this graph, we can finally do two amazing things: clustering and visualization.

Leiden Clustering: This algorithm looks at the neighborhood graph and groups cells into tight-knit communities, or clusters.
UMAP: This is a fantastic visualization technique that takes the high-dimensional neighborhood graph and squishes it down into a 2D plot we can actually look at, while trying its best to preserve the local relationships between cells.

sc.pp.neighbors(adata, n_neighbors=12, n_pcs=30, metric="euclidean")
sc.tl.umap(adata, min_dist=0.35, spread=1.0)
sc.tl.leiden(adata, resolution=0.6, key_added="leiden")

print("\nCluster counts:")
display(adata.obs["leiden"].value_counts().sort_index().rename("cells_per_cluster").to_frame())

sc.pl.umap(adata, color=["leiden"], legend_loc="on data", title="PBMC 3k - Leiden clusters")

Look at that! We have a beautiful UMAP plot with distinct islands of cells. These are our clusters. Each color represents a group of cells that are transcriptionally similar to each other. But what are they?

Giving Our Cells an Identity: Who's Who in the Zoo?

To figure out what each cluster represents, we need to find "marker genes." These are genes that are highly expressed in one cluster compared to all the others. Scanpy makes this super easy.

sc.tl.rank_genes_groups(adata, groupby="leiden", method="wilcoxon")
sc.pl.rank_genes_groups(adata, n_genes=20, sharey=False)

marker_table = sc.get.rank_genes_groups_df(adata, group=None)
print("\nTop marker rows:")
display(marker_table.head(20))

This gives us a ranked list of genes for each cluster. Now we can play detective. We can take these marker genes and compare them to known markers for different immune cell types.

For example, if a cluster's top markers include CD79A and MS4A1, that's a huge clue that it's a population of B-cells. If we see NKG7 and GNLY, we're likely looking at Natural Killer (NK) cells.

Let's visualize some of these known markers to see how their expression patterns line up with our clusters.

candidate_markers = [
    "IL7R", "LTB", "MALAT1", "CCR7", "NKG7", "GNLY", "PRF1", "MS4A1",
    "CD79A", "CD79B", "LYZ", "S100A8", "FCER1A", "CST3", "PPBP",
    "FCGR3A", "LGALS3", "CTSS", "CD3D", "TRBC1", "TRAC"
]

candidate_markers = [g for g in candidate_markers if g in adata.var_names]

if candidate_markers:
    sc.pl.dotplot(
        adata,
        var_names=candidate_markers,
        groupby="leiden",
        standard_scale="var",
        dendrogram=True
    )

The dot plot is fantastic. The size of the dot tells you the percentage of cells in the cluster expressing the gene, and the color tells you the average expression level. You can already start to see clear patterns matching known cell types.

A Simple Strategy for Annotation

We can automate this "detective work" with a simple scoring method. We’ll create a list of known markers for major cell types and then score each cell based on how strongly it expresses those genes.

cluster_marker_reference = {
    "T_cells": ["IL7R", "LTB", "CCR7", "CD3D", "TRBC1", "TRAC"],
    "NK_cells": ["NKG7", "GNLY", "PRF1"],
    "B_cells": ["MS4A1", "CD79A", "CD79B"],
    "Monocytes": ["LYZ", "FCGR3A", "LGALS3", "CTSS", "S100A8", "CST3"],
    "Dendritic_cells": ["FCER1A", "CST3"],
    "Platelets": ["PPBP"]
}

# Score each cell for each cell type's gene list
for celltype, genes in cluster_marker_reference.items():
    # Make sure the genes are actually in our filtered data
    valid_genes = [g for g in genes if g in adata.var_names]
    if valid_genes:
        sc.tl.score_genes(adata, gene_list=valid_genes, score_name=f"{celltype}_score", use_raw=False)

# Get the average score for each cluster
score_cols = [f"{ct}_score" for ct in cluster_marker_reference.keys()]
cluster_scores = adata.obs.groupby("leiden")[score_cols].mean()
display(cluster_scores)

Now we have a neat table showing the average "T-cell score," "B-cell score," etc., for each cluster. To annotate, we can just assign each cluster to the cell type with the highest score.

cluster_to_celltype = {}
for cluster in cluster_scores.index:
    best_match = cluster_scores.loc[cluster].idxmax().replace("_score", "")
    cluster_to_celltype[cluster] = best_match

adata.obs["cell_type"] = adata.obs["leiden"].map(cluster_to_celltype).astype("category")

print("\nCluster to cell-type mapping:")
display(pd.DataFrame.from_dict(cluster_to_celltype, orient="index", columns=["assigned_cell_type"]))

sc.pl.umap(
    adata,
    color=["leiden", "cell_type"],
    legend_loc="on data",
    wspace=0.45
)

And there we have it! Our UMAP is now labeled with cell type annotations. We've successfully turned that giant table of numbers into a meaningful biological map.

Wrapping Up and Saving Our Work

The last step is to see what we've got and save everything for future use. Let's check the proportions of each cell type we found.

cluster_prop = (
    adata.obs["cell_type"]
    .value_counts(normalize=True)
    .mul(100)
    .round(2)
    .rename("percent")
    .to_frame()
)

print("\nCell-type proportions (%):")
display(cluster_prop)

plt.figure(figsize=(7, 4))
cluster_prop["percent"].sort_values().plot(kind="barh")
plt.xlabel("Percent of cells")
plt.ylabel("Cell type")
plt.title("Estimated cell-type composition")
plt.tight_layout()
plt.show()

Finally, let’s save our annotated AnnData object and the marker gene tables. This way, we can easily come back to our analysis later without having to rerun everything.

output_dir = "scanpy_pbmc3k_outputs"
os.makedirs(output_dir, exist_ok=True)

adata.write(os.path.join(output_dir, "pbmc3k_scanpy_advanced.h5ad"))
marker_table.to_csv(os.path.join(output_dir, "cluster_markers.csv"), index=False)
cluster_scores.to_csv(os.path.join(output_dir, "cluster_score_matrix.csv"))

print(f"\nSaved outputs to: {output_dir}")

We went from a raw count matrix to a fully processed, clustered, and annotated single-cell dataset. We cleaned the data, found the most important patterns, grouped the cells into meaningful populations, and gave them biological labels.

This pipeline is a powerful and flexible starting point. You can now take this framework, apply it to your own data, and start asking all sorts of interesting biological questions. Pretty cool, right? Happy analyzing

Your Step-by-Step Guide to Building a Single-Cell RNA-Seq Pipeline in Scanpy

First Things First: Setting Up Our Workshop

The Quality Control Step: Weeding the Garden

Normalization and Finding the Key Players

Making Sense of the Chaos: Finding Patterns with PCA and UMAP

Giving Our Cells an Identity: Who's Who in the Zoo?

A Simple Strategy for Annotation

Wrapping Up and Saving Our Work

Tags

Source

Stay Updated

Related Articles

How to Analyze the Huge TaskTrove Dataset Without Downloading It

Stop Moving Your Data: Build In-Database Feature Pipelines with Ibis and DuckDB

Supercharge Your Python Simulations with NVIDIA Warp Kernels

Your Step-by-Step Guide to Building a Single-Cell RNA-Seq Pipeline in Scanpy

First Things First: Setting Up Our Workshop

The Quality Control Step: Weeding the Garden

Normalization and Finding the Key Players

Making Sense of the Chaos: Finding Patterns with PCA and UMAP

Giving Our Cells an Identity: Who's Who in the Zoo?

A Simple Strategy for Annotation

Wrapping Up and Saving Our Work

Tags

Source

Stay Updated

Related Articles

How to Analyze the Huge TaskTrove Dataset Without Downloading It

Stop Moving Your Data: Build In-Database Feature Pipelines with Ibis and DuckDB

Supercharge Your Python Simulations with NVIDIA Warp Kernels

Cookie Settings