stGPT Upgrade Plan

stGPT Upgrade Plan#

This note captures how spatho can evolve from an AI-assisted Xenium spatial pathology workflow into an evidence-generating agentic workbench over stGPT: a spatial transcriptomics foundation-model evidence engine that can enrich structure discovery, cluster annotation, missing-gene reasoning, and pathology review.

The concise platform narrative is:

stGPT learns reusable contour/region morpho-molecular representations; spatho plans, validates, and turns them into auditable spatial pathology evidence.

Source Read#

Primary sources reviewed:

scGPT official repository: https://github.com/bowang-lab/scGPT
scGPT paper: https://www.nature.com/articles/s41592-024-02201-0
scGPT-spatial official repository: https://github.com/bowang-lab/scGPT-spatial
scGPT-spatial preprint record: https://sciety.org/articles/activity/10.1101/2025.02.05.636714
Hugging Face paper search for adjacent ST foundation models, including ST-Align, SToFM, HESCAPE, and SEAL.

The strongest practical path is not to rewrite scGPT from scratch. It is to treat scGPT-spatial as the reference upgrade path from single-cell GPT to spatial transcriptomics GPT, then wrap it behind a spatho-owned interface.

What scGPT Gives Us#

scGPT models one cell as a gene-token sequence with expression values. Its core value for spatho is a pretrained gene and cell representation space that can support:

cell or spot embeddings
cell type annotation
batch and multi-omic integration
perturbation and gene network tasks
transfer learning from a large single-cell corpus

The official scGPT repository provides pretrained checkpoint links and a Python package, but its model zoo is still primarily distributed through external downloads rather than as ordinary Hugging Face model repositories.

What scGPT-spatial Adds#

scGPT-spatial is the more direct stGPT substrate. It continues pretraining from scGPT on spatial transcriptomics profiles and adds:

spatial corpus pretraining across Visium, Visium HD, Xenium, and MERFISH
a Mixture-of-Experts expression decoder for protocol-aware prediction
spatially aware sampling
neighborhood-based reconstruction for local tissue context
zero-shot embeddings for multi-slide and cross-modality integration
downstream support for deconvolution and contextualized missing-gene imputation

The public scGPT-spatial code currently exposes an inference path around scgpt_spatial.tasks.embed_data(), expecting a checkpoint folder with vocab.json, args.json, best_model.pt, and gene statistics.

Current `spatho` Gap#

spatho currently owns workflow UX, organ packs, artifact manifests, H&E contour evidence, local pathology AI review, and Xenium alignment notes. It does not yet own a transcriptomic foundation-model encoder.

Today the molecular evidence flow is mostly:

read Xenium-derived cluster and differential-expression artifacts
build marker and spatial structure evidence
ask a heuristic, OpenAI, or local pathology-AI backend to review
produce reports and manifests

The missing stGPT layer is a learned embedding and reconstruction layer between raw Xenium expression data and the existing cluster/structure/review surfaces.

Proposed Architecture#

        flowchart TD
    A["Xenium / pyXenium outputs"] --> B["AnnData adapter"]
    B --> C["stGPT embedding backend"]
    C --> D["Contour / region embeddings"]
    C --> E["Missing-gene / neighborhood predictions"]
    D --> F["Evidence graph and structure summaries"]
    E --> F
    F --> G["Existing spatho evidence bundles"]
    G --> H["Heuristic / OpenAI / pathology-ai review"]
    H --> I["Reports and artifact manifest"]

The adapter should keep spatho stable even if the upstream scGPT-spatial package changes.

The repo-level architecture should be described as:

stGPT Foundation: training, model architecture, checkpoint loading, embedding, and model packaging.
stGPT Evidence Suite: QC, deterministic splits, benchmarks, ablations, failure analysis, and domain-shift checks.
stGPT Runtime / Tool API: callable tools such as embed_cells, evaluate_checkpoint, package_model, and export_spatho_artifacts.
spatho Agentic Workbench: guardrailed workflow orchestration and human-review handoff.
spatho Reports: reproducible evidence reports that distinguish measured data from model-derived evidence.

The agentic fusion loop should be:

Plan -> Tool Calls -> QC/Critic -> Evidence Graph -> Report -> Human Review -> Model Improvement

In this loop, stGPT is a callable, schema-first evidence toolchain. spatho plans valid analysis routes, runs readiness checks, calls the toolchain or consumes precomputed artifacts, evaluates QC, and only then turns compact summaries into report language. The LLM or local pathology reviewer should receive structured evidence bundles, not raw vectors.

Implementation Phases#

Phase 0: Keep Boundaries Clean#

Keep the package name spatho. Add an optional module namespace:

spatho.stgpt
optional extra: spatho[stgpt]
no hard dependency on CUDA, flash-attn, scanpy, or scGPT-spatial for normal installs
user-supplied checkpoint folder rather than bundled weights

Initial config fields can be added once behavior exists:

stgpt_enabled
stgpt_model_dir
stgpt_device
stgpt_gene_column
stgpt_batch_size
stgpt_max_length

Phase 1: Xenium to AnnData Contract#

Create a deterministic conversion layer from the current Xenium inputs to AnnData.

Minimum contract:

adata.X: cell-by-gene expression matrix
adata.obs["cell_id"]
adata.obs["cluster_id"]
adata.obs["x_um"] and adata.obs["y_um"]
adata.obsm["spatial"]
adata.var["feature_name"]
optional adata.obs["structure_id"] after structure assignment

Protein features should remain traceable as a separate modality and should not be flattened silently into gene names.

Phase 2: Region-First stGPT Embeddings MVP#

Build the first useful integration around region- and structure-level stGPT artifacts. Cell embeddings remain useful as a compatibility and provenance layer, but pathology review should primarily consume contour, region, and structure summaries.

Current stable handshake:

stgpt.runtime.export_spatho_artifacts(config, checkpoint, output_dir, batch_size=32, device="auto")

Preferred target artifacts:

region_embeddings.parquet
region_cell_membership.parquet
region_molecular_summary.parquet
region_image_manifest.json
region_qc_report.json
evidence_manifest.json

Compatibility artifacts:

stgpt/cell_embeddings.parquet
stgpt/structure_embedding_summary.csv
stgpt/qc_report.json

MVP acceptance criteria:

runs without changing existing spatho run
works on a small Xenium fixture or generated AnnData fixture
writes deterministic artifact manifests and evidence IDs
can be disabled cleanly when optional dependencies are absent
preserves clear measured-vs-model-derived labels for reconstruction or imputation

Phase 3: Feed stGPT Evidence Into Reviews#

After embeddings exist, inject summaries into the existing evidence bundles:

region and structure centroid embeddings
nearest-neighbor relationships across structures, contours, and slides
embedding coherence per cluster, region, and structure
outlier regions, ambiguous boundary cells, or weakly supported contours
spatial-neighborhood agreement score
optional missing-gene predictions for markers absent from targeted panels

The pathology reviewer should receive compact structured evidence, not raw vectors. A preferred evidence object should include evidence_id, unit, unit_id, source, evidence_type, measured, model_derived, qc_status, summary, and supporting_artifacts.

Phase 4: Spatial Fine-Tuning on Local Data#

Only after the zero-shot path is useful, add a PDC/HPC fine-tuning recipe:

initialize from scGPT-spatial V1 checkpoint
curate local Xenium, Visium, MERFISH, or WTA spatial data into a shared AnnData layout
keep protocol labels for MoE routing or batch embeddings
use spatial-neighborhood reconstruction for context
evaluate held-out slides and organs

Candidate metrics:

cell-type ARI/NMI against curated labels
batch mixing and biological conservation from scIB-style metrics
masked-gene imputation Pearson/Spearman and MSE
neighborhood label agreement
downstream report stability and reviewer confidence changes

Phase 5: True stGPT Product Layer#

The long-term stGPT layer should become a product capability, not just a model wrapper.

Target capabilities:

foundation-model enhanced structure discovery
cross-slide and cross-modality alignment
targeted-panel missing-gene support
spatial neighborhood and niche summarization
optional H&E and transcriptome alignment, likely through a later multimodal model rather than scGPT-spatial alone

Risks#

Upstream scGPT-spatial has no packaged release at the time of review.
Model weights are on figshare rather than ordinary Hugging Face model repos.
Processed SpatialHuman30M data availability depends on original dataset licenses.
flash-attn and CUDA constraints can make installation fragile.
Xenium targeted panels may have limited overlap with the pretrained vocabulary.
Raw cell-level embeddings are large; summaries need to stay compact.
Medical/pathology review must treat stGPT outputs as evidence, not diagnosis.

Recommended Next Step#

Implement Phase 1 and Phase 2 as an optional, read-only evidence path. That gives spatho immediate value from scGPT-spatial without disturbing current workflows, and it creates the artifact contract needed for later fine-tuning or full stGPT productization.