Pre-built models

transmorph provides three pre-built models which allow to carry out data integration in different scenarios. This section describes, for each of them, its basic functioning as well as practical recommended use cases.

Low dimensional joint embedding with EmbedMNN

../_images/embedmnn_model.png

EmbedMNN is a data integration pipeline which outputs a 2D or 3D joint embedding of all batches. It requires batches in input to be in overlapping spaces. This result can subsequently be used for plotting purposes to make interpretations and draw hypotheses, as well as for clustering and cell type inference. First of all, EmbedMNN projects all batches in their largest common gene space. Then, it performs dimensionality reduction using PCA on the set of batches, and uses this representation to carry out a matching algorithm between all pairs of batches (MNN inspired from [HLMM18] or batch KNN inspired from [PolanskiYM+20]). The last step is to build a joint graph of batches, combining matching edges with internal KNN edges, and weighting them following UMAP [BMH+19] so that for every cell in batch A, its strongest match in match B has weight 1. This joint graph is in the final step embedded in a 2D or 3D space using UMAP or MDE [AAB+21] optimizer.

Counts correction with MNNCorrection

../_images/mnncorrection_model.png

MNNCorrection is a data integration pipeline which outputs a corrected counts matrix or a corrected PCA matrix of all batches with respect to a reference batch. It requires batches in input to be in overlapping spaces. This model is inspired from Seurat [SBH+19], and works as follows. First of all, MNNCorrection projects all batches in their largest common genes space. Then, it performs PCA dimensionality reduction on the set of batches, and uses this representation to carry out a matching algorithm between all batches and the reference batch (MNN inspired from [HLMM18] or batch KNN inspired from [PolanskiYM+20]). It eventually computes correction vectors between cells and their estimated position in the reference batch, as the barycenter of their matches. Unmatched cells are then associated with the correction vector of the closest corrected cell, in the geodesic sense along the nearest neighbor graph. When all correction vectors are computed, cells are moved accordingly.

Optimal transport-based integration with TransportCorrection

../_images/transportcorrection_model.png

TransportCorrection is a data integration pipeline which outputs a corrected counts matrix or a corrected PCA matrix of all batches with respect to a reference batch. It is inspired from SCOT [DSS+20], and uses transportation theory to assess cell-cell similarity between batches, as introduced in [SST+19] to disentangle cell fate. Transportation theory is an optimization topic interested in finding the cheapest way to transport mass from a set of sources to a set of targets, cost being proportional to mass moved and distance traveled [PeyreC+19]. TransportCorrection requires a reference dataset, but does not in theory need batches to be in overlapping spaces (for instance, non-intersecting gene expression spaces, or vertical integration between technologies like ATAC-seq vs RNA-seq, given a cost matrix can be provided). If no explicit cost matrix is provided, and batches can be expressed in a common gene space, a dissimilarity metric can be used as cost. Otherwise, if every batch representation can be endowed with a dissimilarity metric, the Gromov-Wasserstein problem can be solved instead of optimal transport in order to guess inter-batch matchings. The final step consists in projecting every cell to the barycenter of its matches in the reference space.

Bibliography

[AAB+21]

Akshay Agrawal, Alnur Ali, Stephen Boyd, and others. Minimum-distortion embedding. Foundations and Trends® in Machine Learning, 14(3):211–378, 2021.

[BMH+19]

Etienne Becht, Leland McInnes, John Healy, Charles-Antoine Dutertre, Immanuel WH Kwok, Lai Guan Ng, Florent Ginhoux, and Evan W Newell. Dimensionality reduction for visualizing single-cell data using umap. Nature biotechnology, 37(1):38–44, 2019.

[DSS+20]

Pinar Demetci, Rebecca Santorella, Bjorn Sandstede, William Stafford Noble, and Ritambhara Singh. Gromov-Wasserstein optimal transport to align single-cell multi-omics data. BioRxiv, 2020.

[HLMM18] (1,2)

Laleh Haghverdi, Aaron TL Lun, Michael D Morgan, and John C Marioni. Batch effects in single-cell rna-sequencing data are corrected by matching mutual nearest neighbors. Nature biotechnology, 36(5):421–427, 2018.

[PeyreC+19]

Gabriel Peyré, Marco Cuturi, and others. Computational optimal transport with applications to data science. Foundations and Trends in Machine Learning, 11(5-6):355–607, 2019.

[PolanskiYM+20] (1,2)

Krzysztof Polański, Matthew D Young, Zhichao Miao, Kerstin B Meyer, Sarah A Teichmann, and Jong-Eun Park. Bbknn: fast batch alignment of single cell transcriptomes. Bioinformatics, 36(3):964–965, 2020.

[SST+19]

Geoffrey Schiebinger, Jian Shu, Marcin Tabaka, Brian Cleary, Vidya Subramanian, Aryeh Solomon, Joshua Gould, Siyan Liu, Stacie Lin, Peter Berube, and others. Optimal-transport analysis of single-cell gene expression identifies developmental trajectories in reprogramming. Cell, 176(4):928–943, 2019.

[SBH+19]

Tim Stuart, Andrew Butler, Paul Hoffman, Christoph Hafemeister, Efthymia Papalexi, William M Mauck III, Yuhan Hao, Marlon Stoeckius, Peter Smibert, and Rahul Satija. Comprehensive integration of single-cell data. Cell, 177(7):1888–1902, 2019.