Breast Reference Projection

This page describes the breast Xenium WTA + Perturb-seq reference projection workflow implemented in SpatialPerturb.

Goal

The workflow upgrades the analysis from “spatial data already has perturbation labels” to:

unperturbed Xenium WTA tissue + Perturb-seq reference atlas -> spatial perturbation-like program scores

The main biological question is not whether a tissue cell was perturbed. Instead, it asks whether spatial cell states resemble transcriptional programs learned from Perturb-seq references.

Reference Datasets

Primary reference: gse241115_breast_cropseq

  • GEO accession: GSE241115

  • Biological role: breast cancer CROP-seq reference.

  • Prepared output: standard AnnData.

  • Perturbation metadata:

    • guide_id: original sgRNA/protospacer call.

    • perturbation: guide name after removing _sgRNA\d+ or _sg\d+.

    • perturbation_status: single, multiple, or unassigned.

    • controls are normalized to control.

  • Guide/intergenic features are recorded as uns["spatialperturb"]["barcode_columns"] and excluded from expression DE and program gene selection.

Optional reference: gse281048_pathway_atlas

  • GEO accession: GSE281048

  • Biological role: pathway Perturb-seq atlas, with MCF7 used as the default breast cancer subset.

  • Preparation requires Rscript and Seurat because the raw files are Seurat .rds.gz objects.

  • If R/Seurat is unavailable, A100 scripts record GSE281048_BLOCKED_RSCRIPT_MISSING and continue with the primary reference.

Xenium Input

read_xenium() supports real 10x Xenium outs:

import spatialperturb as sp

adata = sp.read_xenium(
    "/data/taobo.hu/SpatialPerturb/inputs/xenium_wta_breast",
    cell_group_path="/data/taobo.hu/SpatialPerturb/inputs/xenium_wta_breast/WTA_Preview_FFPE_Breast_Cancer_cell_groups.csv",
    roi_geojson_path="/data/taobo.hu/SpatialPerturb/inputs/xenium_wta_breast/xenium_explorer_annotations.geojson",
    sample_name="xenium_wta_breast",
)

The reader uses:

  • cell_feature_matrix.h5 for cell-by-gene counts.

  • cells.csv.gz for centroids and cell metadata.

  • optional cell-group CSV for obs["cell_type"].

  • optional Xenium Explorer GeoJSON ROI polygons for obs["roi"].

ROI assignment uses matplotlib.path.Path.contains_points, so no shapely or geopandas dependency is required.

Python API

results = sp.run_reference_projection_benchmark(
    adata,
    reference_datasets=["gse241115_breast_cropseq"],
    config={
        "cache_dir": "/data/taobo.hu/SpatialPerturb/cache",
        "k": 15,
        "groupby": ["cell_type", "roi"],
        "reference_effect_size_only": True,
    },
    output_dir="/data/taobo.hu/SpatialPerturb/reports/breast_reference_projection",
)

Default strategy:

  • Xenium graph: mode="knn", k=15.

  • Spatial aggregation: cell_type and roi.

  • Primary reference: gse241115_breast_cropseq.

  • Optional pathway reference: gse281048_pathway_atlas, filtered to cell_line == "MCF7" when available.

  • Reference DE: pseudobulk when sufficient samples exist; otherwise simple.

  • Full-scale A100 run: reference_effect_size_only=True to rank programs by log2 fold-change without expensive per-gene significance tests.

CLI

spatialperturb prepare-xenium \
  /data/taobo.hu/SpatialPerturb/inputs/xenium_wta_breast \
  /data/taobo.hu/SpatialPerturb/prepared/xenium_wta_breast.h5ad \
  --cell-group-path /data/taobo.hu/SpatialPerturb/inputs/xenium_wta_breast/WTA_Preview_FFPE_Breast_Cancer_cell_groups.csv \
  --roi-geojson-path /data/taobo.hu/SpatialPerturb/inputs/xenium_wta_breast/xenium_explorer_annotations.geojson \
  --sample-name xenium_wta_breast

spatialperturb run-reference-benchmark \
  /data/taobo.hu/SpatialPerturb/prepared/xenium_wta_breast.h5ad \
  /data/taobo.hu/SpatialPerturb/reports/breast_reference_projection \
  --cache-dir /data/taobo.hu/SpatialPerturb/cache \
  --reference-datasets gse241115_breast_cropseq

A100 Run

The A100 run uses:

/data/taobo.hu/SpatialPerturb/cache
/data/taobo.hu/SpatialPerturb/inputs/xenium_wta_breast
/data/taobo.hu/SpatialPerturb/prepared
/data/taobo.hu/SpatialPerturb/reports/breast_reference_projection

Launch:

tmux new -d -s sp_breast_ref \
  "bash /data/taobo.hu/SpatialPerturb/code/SpatialPerturb/scripts/a100_run_breast_reference_projection.sh 2>&1 | tee /data/taobo.hu/SpatialPerturb/reports/breast_reference_projection/run.log"

bash /data/taobo.hu/SpatialPerturb/code/SpatialPerturb/scripts/a100_monitor_breast_reference_projection.sh --watch

The monitor writes:

  • status.json

  • status.md

Expected final states:

  • COMPLETE

  • COMPLETE_WITH_BLOCKED_OPTIONAL_REFERENCES

  • FAILED

Outputs

The report directory contains:

  • input_spatial.h5ad

  • references/gse241115_breast_cropseq.h5ad

  • tables/program_scores_cell_level.tsv.gz

  • tables/program_scores_by_group.tsv

  • tables/neighbor_program_scores_cell_level.tsv.gz

  • tables/neighbor_program_scores_by_group.tsv

  • tables/reference_de.tsv

  • tables/reference_program_membership.tsv

  • tables/top_programs_by_roi_cell_type.tsv

  • tables/top_neighbor_programs.tsv

  • figures/program_scores_heatmap.png

  • figures/neighbor_program_scores_heatmap.png

  • manifest.json

  • biological_interpretation.md

Biological Interpretation

Interpretation defaults:

  • Cell-level score: transcriptional similarity to a Perturb-seq-derived program.

  • ROI/cell-type aggregation: where a reference-like state localizes in the tissue.

  • Neighbor score: whether adjacent cells show similar program activity.

For the breast Xenium WTA run, the strongest cleaned programs localized to 11q13 invasive tumor cell states and were enriched for luminal/secretory breast epithelial genes such as TFF1, TFF3, AGR2, MUCL1, IGFBP5, DSCAM-AS1, TIMP3, and SPTSSB.

Important caveats:

  • Projection is not causal perturbation evidence.

  • Perturb-seq references are cell-line-derived, while Xenium is FFPE tissue.

  • ROI and cell-group annotation quality directly affects interpretation.

  • Full-scale effect-size-only runs rank genes by log2 fold-change; do not use placeholder p-values/FDR for statistical claims.