stGPT Upgrade Plan#

This note captures how spatho can evolve from an AI-assisted Xenium spatial pathology workflow into an evidence-generating agentic workbench over stGPT: a spatial transcriptomics foundation-model evidence engine that can enrich structure discovery, cluster annotation, missing-gene reasoning, and pathology review.

The concise platform narrative is:

stGPT learns reusable contour/region morpho-molecular representations; spatho plans, validates, and turns them into auditable spatial pathology evidence.

Source Read#

Primary sources reviewed:

  • scGPT official repository: https://github.com/bowang-lab/scGPT

  • scGPT paper: https://www.nature.com/articles/s41592-024-02201-0

  • scGPT-spatial official repository: https://github.com/bowang-lab/scGPT-spatial

  • scGPT-spatial preprint record: https://sciety.org/articles/activity/10.1101/2025.02.05.636714

  • Hugging Face paper search for adjacent ST foundation models, including ST-Align, SToFM, HESCAPE, and SEAL.

The strongest practical path is not to rewrite scGPT from scratch. It is to treat scGPT-spatial as the reference upgrade path from single-cell GPT to spatial transcriptomics GPT, then wrap it behind a spatho-owned interface.

What scGPT Gives Us#

scGPT models one cell as a gene-token sequence with expression values. Its core value for spatho is a pretrained gene and cell representation space that can support:

  • cell or spot embeddings

  • cell type annotation

  • batch and multi-omic integration

  • perturbation and gene network tasks

  • transfer learning from a large single-cell corpus

The official scGPT repository provides pretrained checkpoint links and a Python package, but its model zoo is still primarily distributed through external downloads rather than as ordinary Hugging Face model repositories.

What scGPT-spatial Adds#

scGPT-spatial is the more direct stGPT substrate. It continues pretraining from scGPT on spatial transcriptomics profiles and adds:

  • spatial corpus pretraining across Visium, Visium HD, Xenium, and MERFISH

  • a Mixture-of-Experts expression decoder for protocol-aware prediction

  • spatially aware sampling

  • neighborhood-based reconstruction for local tissue context

  • zero-shot embeddings for multi-slide and cross-modality integration

  • downstream support for deconvolution and contextualized missing-gene imputation

The public scGPT-spatial code currently exposes an inference path around scgpt_spatial.tasks.embed_data(), expecting a checkpoint folder with vocab.json, args.json, best_model.pt, and gene statistics.

Current spatho Gap#

spatho currently owns workflow UX, organ packs, artifact manifests, H&E contour evidence, local pathology AI review, and Xenium alignment notes. It does not yet own a transcriptomic foundation-model encoder.

Today the molecular evidence flow is mostly:

  1. read Xenium-derived cluster and differential-expression artifacts

  2. build marker and spatial structure evidence

  3. ask a heuristic, OpenAI, or local pathology-AI backend to review

  4. produce reports and manifests

The missing stGPT layer is a learned embedding and reconstruction layer between raw Xenium expression data and the existing cluster/structure/review surfaces.

Proposed Architecture#

        flowchart TD
    A["Xenium / pyXenium outputs"] --> B["AnnData adapter"]
    B --> C["stGPT embedding backend"]
    C --> D["Contour / region embeddings"]
    C --> E["Missing-gene / neighborhood predictions"]
    D --> F["Evidence graph and structure summaries"]
    E --> F
    F --> G["Existing spatho evidence bundles"]
    G --> H["Heuristic / OpenAI / pathology-ai review"]
    H --> I["Reports and artifact manifest"]
    

The adapter should keep spatho stable even if the upstream scGPT-spatial package changes.

The repo-level architecture should be described as:

  • stGPT Foundation: training, model architecture, checkpoint loading, embedding, and model packaging.

  • stGPT Evidence Suite: QC, deterministic splits, benchmarks, ablations, failure analysis, and domain-shift checks.

  • stGPT Runtime / Tool API: callable tools such as embed_cells, evaluate_checkpoint, package_model, and export_spatho_artifacts.

  • spatho Agentic Workbench: guardrailed workflow orchestration and human-review handoff.

  • spatho Reports: reproducible evidence reports that distinguish measured data from model-derived evidence.

The agentic fusion loop should be:

Plan -> Tool Calls -> QC/Critic -> Evidence Graph -> Report -> Human Review -> Model Improvement

In this loop, stGPT is a callable, schema-first evidence toolchain. spatho plans valid analysis routes, runs readiness checks, calls the toolchain or consumes precomputed artifacts, evaluates QC, and only then turns compact summaries into report language. The LLM or local pathology reviewer should receive structured evidence bundles, not raw vectors.

Implementation Phases#

Phase 0: Keep Boundaries Clean#

Keep the package name spatho. Add an optional module namespace:

  • spatho.stgpt

  • optional extra: spatho[stgpt]

  • no hard dependency on CUDA, flash-attn, scanpy, or scGPT-spatial for normal installs

  • user-supplied checkpoint folder rather than bundled weights

Initial config fields can be added once behavior exists:

  • stgpt_enabled

  • stgpt_model_dir

  • stgpt_device

  • stgpt_gene_column

  • stgpt_batch_size

  • stgpt_max_length

Phase 1: Xenium to AnnData Contract#

Create a deterministic conversion layer from the current Xenium inputs to AnnData.

Minimum contract:

  • adata.X: cell-by-gene expression matrix

  • adata.obs["cell_id"]

  • adata.obs["cluster_id"]

  • adata.obs["x_um"] and adata.obs["y_um"]

  • adata.obsm["spatial"]

  • adata.var["feature_name"]

  • optional adata.obs["structure_id"] after structure assignment

Protein features should remain traceable as a separate modality and should not be flattened silently into gene names.

Phase 2: Region-First stGPT Embeddings MVP#

Build the first useful integration around region- and structure-level stGPT artifacts. Cell embeddings remain useful as a compatibility and provenance layer, but pathology review should primarily consume contour, region, and structure summaries.

Current stable handshake:

  • stgpt.runtime.export_spatho_artifacts(config, checkpoint, output_dir, batch_size=32, device="auto")

Preferred target artifacts:

  • region_embeddings.parquet

  • region_cell_membership.parquet

  • region_molecular_summary.parquet

  • region_image_manifest.json

  • region_qc_report.json

  • evidence_manifest.json

Compatibility artifacts:

  • stgpt/cell_embeddings.parquet

  • stgpt/structure_embedding_summary.csv

  • stgpt/qc_report.json

MVP acceptance criteria:

  • runs without changing existing spatho run

  • works on a small Xenium fixture or generated AnnData fixture

  • writes deterministic artifact manifests and evidence IDs

  • can be disabled cleanly when optional dependencies are absent

  • preserves clear measured-vs-model-derived labels for reconstruction or imputation

Phase 3: Feed stGPT Evidence Into Reviews#

After embeddings exist, inject summaries into the existing evidence bundles:

  • region and structure centroid embeddings

  • nearest-neighbor relationships across structures, contours, and slides

  • embedding coherence per cluster, region, and structure

  • outlier regions, ambiguous boundary cells, or weakly supported contours

  • spatial-neighborhood agreement score

  • optional missing-gene predictions for markers absent from targeted panels

The pathology reviewer should receive compact structured evidence, not raw vectors. A preferred evidence object should include evidence_id, unit, unit_id, source, evidence_type, measured, model_derived, qc_status, summary, and supporting_artifacts.

Phase 4: Spatial Fine-Tuning on Local Data#

Only after the zero-shot path is useful, add a PDC/HPC fine-tuning recipe:

  • initialize from scGPT-spatial V1 checkpoint

  • curate local Xenium, Visium, MERFISH, or WTA spatial data into a shared AnnData layout

  • keep protocol labels for MoE routing or batch embeddings

  • use spatial-neighborhood reconstruction for context

  • evaluate held-out slides and organs

Candidate metrics:

  • cell-type ARI/NMI against curated labels

  • batch mixing and biological conservation from scIB-style metrics

  • masked-gene imputation Pearson/Spearman and MSE

  • neighborhood label agreement

  • downstream report stability and reviewer confidence changes

Phase 5: True stGPT Product Layer#

The long-term stGPT layer should become a product capability, not just a model wrapper.

Target capabilities:

  • foundation-model enhanced structure discovery

  • cross-slide and cross-modality alignment

  • targeted-panel missing-gene support

  • spatial neighborhood and niche summarization

  • optional H&E and transcriptome alignment, likely through a later multimodal model rather than scGPT-spatial alone

Risks#

  • Upstream scGPT-spatial has no packaged release at the time of review.

  • Model weights are on figshare rather than ordinary Hugging Face model repos.

  • Processed SpatialHuman30M data availability depends on original dataset licenses.

  • flash-attn and CUDA constraints can make installation fragile.

  • Xenium targeted panels may have limited overlap with the pretrained vocabulary.

  • Raw cell-level embeddings are large; summaries need to stay compact.

  • Medical/pathology review must treat stGPT outputs as evidence, not diagnosis.