AI-Driven Spatial Pathologist Development Guide#
This document is the detailed handoff for spatho as it exists today.
It is written for future development work: new contributors should be able to read this file and understand what the project currently owns, what is still delegated to histoseg, how the major interfaces fit together, and where the next sensible development seams are.
Current snapshot:
package:
spathocurrent public version:
0.1.0role: public-facing product layer for the AI-driven spatial pathologist workflow
current engine dependency:
histoseg>=0.1.9.1current primary input assumption: Xenium-style output directories plus a base pipeline config
1. Product Intent#
spatho is not the low-level segmentation engine.
It is the product layer above the engine.
The intended user experience is:
initialize a workflow for a case
validate environment and inputs
run a full analysis
inspect human-readable reports and machine-readable artifacts
reuse the same workflow contract across organ-specific presets
Today that product contract already exists, but a meaningful part of execution still happens in histoseg.
2. Repository Boundaries#
The current system spans multiple repositories and services. Understanding those boundaries is the most important prerequisite for future work.
spatho repo#
Repo:
D:\GitHub\AI-Driven-Spatial-Pathologist
Owns:
public package name and packaging
CLI and Python API
workflow schema
workflow templates
organ packs
artifact manifest
public docs, roadmap, commercialization plan
release and PyPI publishing setup
Does not yet own:
low-level segmentation logic
full cluster annotation implementation
pathology review engine internals
H&E overlay generation
histoseg repo#
Repo:
D:\GitHub\HistoSeg
Currently owns the actual execution path behind spatho run, including:
annotation pipeline orchestration
evidence-pack building
OpenAI-driven annotation
base pipeline execution
structure discovery and report generation
pathology review backends
Key modules:
D:\GitHub\HistoSeg\src\histoseg\annotationD:\GitHub\HistoSeg\src\histoseg\spatial_pathologist
segmentation_methods project repo#
Repo:
D:\GitHub\sfplot\segmentation_methods
This still provides case configs, references, and some pipeline code/assets used during real runs. It is still part of the effective runtime surface even though it is not the public product repo.
pathology-ai service#
Repo/service:
C:\Users\taobo.hu\Projects\pathology-ai
This is an optional pathology review backend used through HTTP when pathology_review_backend = "pathology_ai_api".
It is not bundled into spatho; it is treated as an external local service.
3. Current High-Level Architecture#
flowchart TD
A["User / customer"] --> B["spatho CLI or Python API"]
B --> C["Workflow schema validation"]
C --> D["Organ pack defaults and workflow template"]
D --> E["histoseg full auto workflow"]
E --> F["Cluster annotation pipeline"]
E --> G["Base structure pipeline"]
E --> H["Pathology review"]
H --> I["heuristic backend"]
H --> J["OpenAI backend"]
H --> K["pathology-ai API backend"]
F --> L["Annotation artifacts"]
G --> M["Pipeline artifacts"]
H --> N["Pathology review artifacts"]
L --> O["artifact_manifest.json"]
M --> O
N --> O
The most important architectural reality today is:
spathodefines the public contracthistosegexecutes that contract
That is acceptable for v0.1.x, but the long-term direction is to invert that relationship so histoseg becomes a narrower engine dependency.
4. Current Public Interfaces#
CLI#
Entry point:
D:\GitHub\AI-Driven-Spatial-Pathologist\src\spatho\cli.py
Commands currently exposed:
spatho runspatho init-workflowspatho doctorspatho list-organ-packsspatho config-schemaspatho build-manifest
Current CLI philosophy:
keep the user contract simple
accept a workflow JSON as the main unit of execution
print JSON results for composability
Python API#
Primary surface:
D:\GitHub\AI-Driven-Spatial-Pathologist\src\spatho\api.py
Exported functions:
run_workflowworkflow_doctor_reportlist_available_organ_packsinit_workflowwrite_schemabuild_manifest
Current behavior:
validate config with Pydantic
delegate workflow execution to
histoseg.spatial_pathologist.full_autowrite an artifact manifest after workflow completion
Legacy deployment surface#
Legacy app:
D:\GitHub\AI-Driven-Spatial-Pathologist\main.py
This should be treated as a deployment surface, not the product definition.
The core public interface should continue to be the package in src/spatho.
5. Workflow Contract#
Canonical config model#
Schema source:
D:\GitHub\AI-Driven-Spatial-Pathologist\src\spatho\schema.py
Formal exported schema:
D:\GitHub\AI-Driven-Spatial-Pathologist\schemas\workflow.schema.json
The WorkflowConfig model currently defines:
case identity
study context
base pipeline config path
output root
organ taxonomy
pathology review backend settings
OpenAI settings
annotation thresholds
structure review thresholds
Important design choices:
extra="forbid"keeps the config strictrelative paths are resolved relative to the workflow JSON location
annotation_taxonomyis validated against registered organ packspathology_review_backendis explicitly constrained to:heuristicopenaipathology_ai_api
Workflow templates#
Template builder:
D:\GitHub\AI-Driven-Spatial-Pathologist\src\spatho\templates.py
Current template behavior:
auto-fills Xenium differential expression CSV path
auto-fills Xenium UMAP projection path
applies organ-pack workflow defaults
defaults pathology review backend to
openaienables recomputation by default in generated templates
This is intentionally opinionated and currently optimized for internal/project use rather than maximum generality.
6. Organ Packs#
Registry:
D:\GitHub\AI-Driven-Spatial-Pathologist\src\spatho\organ_packs\registry.py
Current bundled packs:
lungbreast
Backing data:
D:\GitHub\AI-Driven-Spatial-Pathologist\src\spatho\organ_packs\data\lung.jsonD:\GitHub\AI-Driven-Spatial-Pathologist\src\spatho\organ_packs\data\breast.json
Each organ pack currently carries:
iddisplay_nameannotation_taxonomydescriptiondefault_study_contextsupported_input_layoutworkflow_defaultsartifact_contract
Why this matters:
organ packs are the current public abstraction for organ-specific defaults
they are the natural place to keep public-safe rules and metadata
they are also the natural future extension point for additional organs or disease programs
Important current limitation:
organ packs are metadata packs, not full plugin modules
the actual downstream biology/pathology logic still lives mostly in
histosegand the sibling project assets
7. End-to-End Execution Path#
Public entry#
spatho run and run_workflow() both end up here:
D:\GitHub\AI-Driven-Spatial-Pathologist\src\spatho\api.py
Execution delegation#
run_workflow() currently does three high-level things:
validate workflow JSON using
WorkflowConfigoptionally disable OpenAI if
heuristic_only=Truecall
histoseg.spatial_pathologist.full_auto.run_full_auto_spatial_pathologist
histoseg full-auto runner#
Current implementation:
D:\GitHub\HistoSeg\src\histoseg\spatial_pathologist\full_auto.py
The current full-auto sequence is:
load base pipeline config
infer or resolve
differential_expression.csvinfer or resolve
projection.csvrun cluster annotation pipeline
write a generated runtime base config
ensure base pipeline outputs exist
run structure-level pathology review
write
workflow_summary.json
Annotation step#
Delegated to:
histoseg.annotation.run_cluster_annotation_pipeline
Inputs include:
cluster CSV
differential expression CSV
projection CSV
taxonomy choice
OpenAI settings
Outputs include:
cluster evidence JSON
cluster annotations JSON/CSV
compatibility annotation CSV
annotation case review JSON
annotation HTML report
Base spatial pipeline step#
Still driven through the generated runtime base config and current pipeline assets. This is where structure assignment, clustermap generation, and validation overlays happen.
Pathology review step#
Delegated to:
D:\GitHub\HistoSeg\src\histoseg\spatial_pathologist\runner.py
This builds a case bundle and then performs:
cluster review
structure review
case summary
HTML report writing
8. Pathology Review Backends#
This is currently the most important configurable branch in the product.
heuristic#
Behavior:
fully local
no LLM dependency
deterministic baseline summaries and review priorities
Use case:
smoke tests
offline operation
fallback mode when model/API access is unavailable
openai#
Behavior:
uses structured JSON outputs
cluster review, structure review, and case summary all go through the OpenAI backend
falls back to heuristics if calls fail
Current implementation path:
D:\GitHub\HistoSeg\src\histoseg\spatial_pathologist\openai_client.pyD:\GitHub\HistoSeg\src\histoseg\spatial_pathologist\runner.py
Important notes:
spathoitself does not directly call the OpenAI APIit passes provider settings through to
histosegopenai_storeis already part of the contract and defaults tofalse
pathology_ai_api#
Behavior:
uses the local
pathology-aiservice for structure-level and case-level pathology interpretationcluster cell-type annotation can also use local LLM refinement when
cluster_annotation_backend = "pathology_ai_api"without that explicit annotation backend, cluster review remains the existing heuristic/OpenAI path
merges textbook-grounded answers and citations into the structure and case results
Current implementation path:
D:\GitHub\AI-Driven-Spatial-Pathologist\src\spatho\local_annotation.pyD:\GitHub\AI-Driven-Spatial-Pathologist\src\pathology_ai_service\server.pyD:\GitHub\HistoSeg\src\histoseg\spatial_pathologist\pathology_ai_api.pyD:\GitHub\HistoSeg\src\histoseg\spatial_pathologist\runner.py
Current assumptions:
service base URL defaults to
http://127.0.0.1:8000service exposes
/healthcluster annotation requests use
/annotations/clusterreview requests are phrased as pathology questions over a structure or whole-case evidence bundle
This backend is the newest review path and should be considered an active development area.
9. Artifact Contract#
Artifact manifest implementation:
D:\GitHub\AI-Driven-Spatial-Pathologist\src\spatho\manifest.py
Primary output:
artifact_manifest.json
Current required artifact categories:
workflow
annotation
pathology
pipeline
Representative tracked artifacts:
workflow_summary.jsongenerated runtime config
cluster evidence JSON
cluster annotations JSON/CSV
compatibility annotation CSV
annotation HTML report
pathology HTML report
structure reviews JSON
case summary JSON
structure clustermap PDF
cluster-structure lookup CSV
structure assignment summary JSON
Xenium explorer annotation summary CSV
Why the manifest matters:
it makes the workflow outputs inspectable programmatically
it is the current best foundation for future service-mode delivery
it is the natural place to attach future metadata such as billing, provenance, and schema versions
Current limitation:
manifest versioning is still minimal
report schemas and workflow compatibility rules are not yet formally versioned end-to-end
10. stGPT Agentic Evidence Workbench#
The stGPT integration should be developed as an agentic evidence workbench, not as a direct raw-embedding interpreter inside spatho.
Unified platform statement:
stGPT learns reusable contour/region morpho-molecular representations; spatho plans, validates, and turns them into auditable spatial pathology evidence.
Workbench responsibility#
spatho owns the orchestration and review surface:
planner: decide which evidence routes are valid for a case, such as H&E contour evidence, RNA foundation evidence, pathway evidence, pyXenium topology, or stGPT artifacts
executor: call deterministic tools and optional model backends, then collect their output artifacts
critic: run readiness checks, QC guardrails, coverage checks, and warning-to-report language
reporter: convert compact evidence bundles into manifest entries and human-readable report sections
human handoff: mark low-confidence, conflicting, novel, or QC-flagged outputs for expert review
spatho should not make biological claims from raw embedding vectors. It should consume stGPT-exported evidence artifacts and summaries, then attach provenance, QC status, and review state to every downstream claim.
Current stGPT backend contract#
The current workflow config supports two stGPT evidence modes:
stgpt_backend="precomputed_artifacts": the default read-only mode.spathovalidates and consumes an existing artifact directory without importingstgpt.stgpt_backend="local_stgpt": the local runtime mode.spathocallsstgpt.runtime.export_spatho_artifacts(config, checkpoint, output_dir, batch_size=32, device="auto")whenstgpt_model_pathandstgpt_config_pathare configured.
The first stable handshake between the repositories is export_spatho_artifacts. Its output should be treated as a signed evidence package rather than a free-form model dump.
Preferred stGPT artifacts for the next upgrade:
region_embeddings.parquetregion_cell_membership.parquetregion_molecular_summary.parquetregion_image_manifest.jsonregion_qc_report.jsonevidence_manifest.json
Compatibility artifacts should remain supported:
cell_embeddings.parquetstructure_embedding_summary.csvqc_report.json
Evidence chain rule#
No biological conclusion should be emitted without a traceable evidence chain. Every stGPT-derived statement in a spatho report should link to:
input artifact paths and evidence IDs
checkpoint and config references
QC verdicts and warnings
tool-call provenance or runtime metadata
imputation/reconstruction flags when present
human review status when escalation is required
The intended loop is:
Plan -> Tool Calls -> QC/Critic -> Evidence Graph -> Report -> Human Review -> Model Improvement
11. Testing and CI#
Current tests:
D:\GitHub\AI-Driven-Spatial-Pathologist\tests\test_api.py
What is currently covered:
schema validation behavior
doctor checks for missing inputs
workflow template generation
organ-pack exposure
schema export
artifact manifest generation
What is not yet sufficiently covered:
real CLI subprocess smoke tests
cross-repo integration runs
pathology backend mocking
regression tests for generated report shape/content
fixture-based tiny cases
Current CI:
package/test workflow under
.github/workflowsPyPI publish workflow under
.github/workflows/publish-pypi.yml
12. Packaging and Release State#
Packaging config:
D:\GitHub\AI-Driven-Spatial-Pathologist\pyproject.toml
Important current facts:
package name on PyPI:
spathocurrent version source:
src/spatho/__init__.pyrelease workflow uses GitHub Actions Trusted Publishing
current license marker is non-commercial research use oriented
Current production reality:
public install works
the package is still alpha
runtime behavior still depends heavily on
histosegand existing project configuration patterns
13. Current Known Technical Debt#
These are the most important current debt items.
Cross-repo coupling#
spatho is public-facing, but large parts of real execution still depend on:
histosegsfplot/segmentation_methods
This makes public onboarding harder and complicates reproducibility for outside users.
Public contract vs execution ownership mismatch#
spatho owns the public interface, but not enough of the implementation.
That is acceptable temporarily, but over time it increases maintenance cost and confusion.
Legacy deployment surface ambiguity#
main.py still exists and can confuse future contributors about what the primary product surface is.
Limited fixture coverage#
The repo currently lacks a small, stable public fixture set for repeatable workflow tests.
Provider abstraction is incomplete#
The workflow contract already expresses different review backends, but provider logic is not yet abstracted at the spatho layer in a clean interface.
Config compatibility policy is not fully formalized#
There is a schema, but not yet a clearly documented migration/versioning policy for workflow files.
14. Recommended Next Development Priorities#
These are the most leverage-positive next steps.
Priority 1: document and stabilize compatibility boundaries#
Do next:
define which fields in
WorkflowConfigare public and stabledefine compatibility rules for new config versions
add a
workflow_schema_version
Priority 2: reduce runtime dependence on sibling project structure#
Do next:
move public-safe pipeline configuration helpers into
spathoprogressively reduce assumptions that configs live under internal project layouts
Priority 3: formalize provider abstraction#
Do next:
define a stable review-provider interface
make
openai,heuristic, andpathology_ai_apifirst-class interchangeable providers at the product layerprepare for future providers such as Anthropic or local models
Priority 4: add fixture-backed integration tests#
Do next:
create tiny public-safe test fixtures
add smoke tests for
spatho init-workflow,spatho doctor, andspatho runmock OpenAI and pathology-ai responses
Priority 5: improve artifact and report versioning#
Do next:
add explicit manifest schema version evolution rules
add report schema/version metadata
distinguish required vs optional artifacts more formally
Priority 6: formalize stGPT evidence graph integration#
Do next:
make stGPT evidence bundles explicit in the manifest contract
preserve region-first artifacts while keeping cell-level compatibility outputs
add evidence IDs, QC status, checkpoint references, and human-review state to stGPT-derived report entries
keep
precomputed_artifactsandlocal_stgptas the only supported stGPT backend modes until the runtime API is stable
Priority 7: separate community and commercial layers cleanly#
Do next:
keep the local CLI and organ packs public
move organization, billing, hosted inference, and deployment surfaces into a separate service layer
15. Suggested Contributor Workflow#
When changing spatho, future contributors should follow this order:
decide whether the change belongs in
spathoorhistosegif it changes public behavior, update
WorkflowConfigand docs firstif it changes outputs, update the artifact manifest logic or contract notes
add or extend tests in
tests/test_api.pykeep the CLI and Python API aligned
Useful heuristic:
if the change is about public workflow UX, packaging, config, organ packs, manifests, or docs, it probably belongs in
spathoif the change is about segmentation, evidence extraction, report generation internals, or provider execution details, it may still belong in
histosegtoday
16. Practical Commands#
Local editable install:
pip install -e D:\GitHub\HistoSeg
pip install -e D:\GitHub\AI-Driven-Spatial-Pathologist
Run tests:
python -m pytest D:\GitHub\AI-Driven-Spatial-Pathologist\tests
Check workflow readiness:
spatho doctor --config D:\GitHub\HistoSeg\workflows\breast_s1_top_graphclust_full_auto_openai.json
Run a workflow:
spatho run --config D:\GitHub\HistoSeg\workflows\breast_s1_top_graphclust_full_auto_openai.json
Export schema:
spatho config-schema --output D:\GitHub\AI-Driven-Spatial-Pathologist\schemas\workflow.schema.json
17. Short Summary#
The current spatho project is already a real public package, but it is still a product-layer wrapper over histoseg.
That is the central fact future development needs to respect.
The right near-term strategy is not to rewrite everything at once.
It is to keep strengthening spatho as the public contract while gradually migrating product-owned logic out of histoseg and reducing dependence on internal project layouts.