Spatial Coordinates as a Cell Language: A Multi-Sentence Framework for Imaging Mass Cytometry Analysis
Chi-Jane Chen*, Yuhang Chen*, Sukwon Yun*, Natalie Stanley, Tianlong Chen
The University of North Carolina at Chapel Hill
*Equal contribution
Image mass cytometry (IMC) enables high-dimensional spatial profiling by combining mass cytometry's analytical power with spatial distributions of cell phenotypes. Recent studies leverage large language models (LLMs) to extract cell states by translating gene or protein expression into biological context. However, existing single-cell LLMs face two major challenges: (1) Integration of spatial information: they struggle to generalize spatial coordinates and effectively encode spatial context as text, and (2) Treating each cell independently: they overlook cell-cell interactions, limiting their ability to capture biological relationships. To address these limitations, we propose Spatial2Sentence, a novel framework that integrates single-cell expression and spatial information into natural language using a multi-sentence approach. Spatial2Sentence constructs expression similarity and distance matrices, pairing spatially adjacent and expressionally similar cells as positive pairs while using distant and dissimilar cells as negatives. These multi-sentence representations enable LLMs to learn cellular interactions in both expression and spatial contexts. Equipped with multi-task learning, Spatial2Sentence outperforms existing single-cell LLMs on preprocessed IMC datasets, improving cell-type classification by 5.98% and clinical status prediction by 4.18% on the diabetes dataset while enhancing interpretability.
ours/: preprocessing, training, and inference scripts for the papersrc/cell2sentence/: core library code (data conversion, prompt formatting, model wrapper)src/cell2sentence/prompts/: prompt templates for cell-type, status, and multi-task settingsdata/: released datasets (CSV adjacency and processed h5ad)docs/,tutorials/: legacy documentation/examples from the base code
We keep the original adjacency CSVs and regenerate processed h5ad files via the preprocessing scripts:
- Diabetes IMC CSVs:
data/diabete_csv_adjacency_v2/train,data/diabete_csv_adjacency_v2/test - Brain IMC CSVs:
data/brain_csv_adjacency_v2/train,data/brain_csv_adjacency_v2/test
Create a Python environment (3.8+ recommended), then install dependencies:
pip install -e .Convert CSV adjacency files into h5ad (if you want to regenerate):
python ours/diabete_pre.py
python ours/brain_pre.pyFine-tune a model with multi-sentence prompts. Required arguments are --task_name, --model_name, --method, --bs, and --dataset.
Example:
python ours/finetune.py \
--task_name both_pred \
--model_name <hf_model_or_local_path> \
--method s2s \
--bs 4 \
--dataset diabetes \
--model_from pretrained--method options:
c2s: single-sentence baselines2swos: Spatial2Sentence w/o spatial pairings2s: Spatial2Sentence (spatial+expression pairing)
Run inference with a fine-tuned checkpoint:
python ours/inference.py \
--task_name both_pred \
--dataset diabetes \
--method s2s \
--model_path <path_to_finetuned_model>The script prints cell-type and status accuracy and writes predictions to predictions.txt.
- All prompts live in
src/cell2sentence/prompts/. - The multi-sentence spatial pairing is implemented in
src/cell2sentence/prompt_formatter.py.
See LICENSE.
This project builds on the Cell2Sentence codebase (https://github.com/vandijklab/cell2sentence). We thank the authors for releasing their work and open-source tools that enabled this research.