Breast Reference Projection¶
This page describes the breast Xenium WTA + Perturb-seq reference projection workflow implemented in SpatialPerturb.
Goal¶
The workflow upgrades the analysis from “spatial data already has perturbation labels” to:
unperturbed Xenium WTA tissue + Perturb-seq reference atlas -> spatial perturbation-like program scores
The main biological question is not whether a tissue cell was perturbed. Instead, it asks whether spatial cell states resemble transcriptional programs learned from Perturb-seq references.
Reference Datasets¶
Primary reference: gse241115_breast_cropseq¶
GEO accession:
GSE241115Biological role: breast cancer CROP-seq reference.
Prepared output: standard
AnnData.Perturbation metadata:
guide_id: original sgRNA/protospacer call.perturbation: guide name after removing_sgRNA\d+or_sg\d+.perturbation_status:single,multiple, orunassigned.controls are normalized to
control.
Guide/intergenic features are recorded as
uns["spatialperturb"]["barcode_columns"]and excluded from expression DE and program gene selection.
Optional reference: gse281048_pathway_atlas¶
GEO accession:
GSE281048Biological role: pathway Perturb-seq atlas, with
MCF7used as the default breast cancer subset.Preparation requires
Rscriptand Seurat because the raw files are Seurat.rds.gzobjects.If R/Seurat is unavailable, A100 scripts record
GSE281048_BLOCKED_RSCRIPT_MISSINGand continue with the primary reference.
Xenium Input¶
read_xenium() supports real 10x Xenium outs:
import spatialperturb as sp
adata = sp.read_xenium(
"/data/taobo.hu/SpatialPerturb/inputs/xenium_wta_breast",
cell_group_path="/data/taobo.hu/SpatialPerturb/inputs/xenium_wta_breast/WTA_Preview_FFPE_Breast_Cancer_cell_groups.csv",
roi_geojson_path="/data/taobo.hu/SpatialPerturb/inputs/xenium_wta_breast/xenium_explorer_annotations.geojson",
sample_name="xenium_wta_breast",
)
The reader uses:
cell_feature_matrix.h5for cell-by-gene counts.cells.csv.gzfor centroids and cell metadata.optional cell-group CSV for
obs["cell_type"].optional Xenium Explorer GeoJSON ROI polygons for
obs["roi"].
ROI assignment uses matplotlib.path.Path.contains_points, so no shapely or geopandas dependency is required.
Python API¶
results = sp.run_reference_projection_benchmark(
adata,
reference_datasets=["gse241115_breast_cropseq"],
config={
"cache_dir": "/data/taobo.hu/SpatialPerturb/cache",
"k": 15,
"groupby": ["cell_type", "roi"],
"reference_effect_size_only": True,
},
output_dir="/data/taobo.hu/SpatialPerturb/reports/breast_reference_projection",
)
Default strategy:
Xenium graph:
mode="knn",k=15.Spatial aggregation:
cell_typeandroi.Primary reference:
gse241115_breast_cropseq.Optional pathway reference:
gse281048_pathway_atlas, filtered tocell_line == "MCF7"when available.Reference DE:
pseudobulkwhen sufficient samples exist; otherwisesimple.Full-scale A100 run:
reference_effect_size_only=Trueto rank programs by log2 fold-change without expensive per-gene significance tests.
CLI¶
spatialperturb prepare-xenium \
/data/taobo.hu/SpatialPerturb/inputs/xenium_wta_breast \
/data/taobo.hu/SpatialPerturb/prepared/xenium_wta_breast.h5ad \
--cell-group-path /data/taobo.hu/SpatialPerturb/inputs/xenium_wta_breast/WTA_Preview_FFPE_Breast_Cancer_cell_groups.csv \
--roi-geojson-path /data/taobo.hu/SpatialPerturb/inputs/xenium_wta_breast/xenium_explorer_annotations.geojson \
--sample-name xenium_wta_breast
spatialperturb run-reference-benchmark \
/data/taobo.hu/SpatialPerturb/prepared/xenium_wta_breast.h5ad \
/data/taobo.hu/SpatialPerturb/reports/breast_reference_projection \
--cache-dir /data/taobo.hu/SpatialPerturb/cache \
--reference-datasets gse241115_breast_cropseq
A100 Run¶
The A100 run uses:
/data/taobo.hu/SpatialPerturb/cache
/data/taobo.hu/SpatialPerturb/inputs/xenium_wta_breast
/data/taobo.hu/SpatialPerturb/prepared
/data/taobo.hu/SpatialPerturb/reports/breast_reference_projection
Launch:
tmux new -d -s sp_breast_ref \
"bash /data/taobo.hu/SpatialPerturb/code/SpatialPerturb/scripts/a100_run_breast_reference_projection.sh 2>&1 | tee /data/taobo.hu/SpatialPerturb/reports/breast_reference_projection/run.log"
bash /data/taobo.hu/SpatialPerturb/code/SpatialPerturb/scripts/a100_monitor_breast_reference_projection.sh --watch
The monitor writes:
status.jsonstatus.md
Expected final states:
COMPLETECOMPLETE_WITH_BLOCKED_OPTIONAL_REFERENCESFAILED
Outputs¶
The report directory contains:
input_spatial.h5adreferences/gse241115_breast_cropseq.h5adtables/program_scores_cell_level.tsv.gztables/program_scores_by_group.tsvtables/neighbor_program_scores_cell_level.tsv.gztables/neighbor_program_scores_by_group.tsvtables/reference_de.tsvtables/reference_program_membership.tsvtables/top_programs_by_roi_cell_type.tsvtables/top_neighbor_programs.tsvfigures/program_scores_heatmap.pngfigures/neighbor_program_scores_heatmap.pngmanifest.jsonbiological_interpretation.md
Biological Interpretation¶
Interpretation defaults:
Cell-level score: transcriptional similarity to a Perturb-seq-derived program.
ROI/cell-type aggregation: where a reference-like state localizes in the tissue.
Neighbor score: whether adjacent cells show similar program activity.
For the breast Xenium WTA run, the strongest cleaned programs localized to 11q13 invasive tumor cell states and were enriched for luminal/secretory breast epithelial genes such as TFF1, TFF3, AGR2, MUCL1, IGFBP5, DSCAM-AS1, TIMP3, and SPTSSB.
Important caveats:
Projection is not causal perturbation evidence.
Perturb-seq references are cell-line-derived, while Xenium is FFPE tissue.
ROI and cell-group annotation quality directly affects interpretation.
Full-scale effect-size-only runs rank genes by log2 fold-change; do not use placeholder p-values/FDR for statistical claims.