A Quantitative Computational Framework for Allopolyploid 
Single-Cell Data Integration and Core Gene Ranking in 
Development
Meiyue Wang  ,1,2,3,† Zijuan Li  ,4,† Haoyu Wang  ,5,† Junwei  Zhao,1 Yuyun Zhang  ,4

Kande Lin  ,6 Shusong Zheng  ,7 Yilong Feng  ,6 Yu’e Zhang  ,7 Wan Teng  ,8 Yiping Tong  ,8

Wenli Zhang  ,6 Yongbiao Xue  ,7 Hude Mao  ,9 Hao Li  ,5 Bo Zhang  ,10 Awais Rasheed,11,12

Sridhar Bhavani,13 Chenghong Liu  ,3 Hong-Qing Ling  ,7,14,* Yue-Qing Hu  ,2,* Yijing Zhang  2,*

1Beijing Life Science Academy, Beijing, China
2State Key Laboratory of Genetic Engineering, Department of Biochemistry, Collaborative Innovation Center of Genetics and 

Development, School of Life Sciences, Institute of Plant Biology, Fudan University, Shanghai 200438, China
3Biotechnology Research Institute, Shanghai Academy of Agricultural Sciences\Shanghai Key Laboratory of Agricultural 

Genetics and Breeding, Shanghai 201106, China
4National Key Laboratory of Plant Molecular Genetics, CAS Center for Excellence in Molecular Plant Sciences, Shanghai 

Institute of Plant Physiology and Ecology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 
200032, China

5State Key Laboratory of Crop Stress Adaptation and Improvement, School of Life Sciences, College of Agriculture, Henan 
University, Kaifeng, Henan 457004, China

6National Key Laboratory for Crop Genetics and Germplasm Enhancement and Utilization, CIC-MCP, Nanjing Agricultural 
University, Nanjing, Jiangsu 210095, China

7State Key Laboratory of Plant Cell and Chromosome Engineering, Institute of Genetics and Developmental Biology, Chinese 
Academy of Sciences, Beijing 100101, China

8Key Lab of Seed Innovation, Institute of Genetics and Developmental Biology, Chinese Academy of Sciences, Beijing 100101, 
China

9State Key Laboratory for Crop Stress Resistance and High-Efficiency Production, College of Agronomy, Northwest A&F 
University, Yangling, Shaanxi 712100, China
10Key Laboratory of Adaptation and Evolution of Plateau Biota, Northwest Institute of Plateau Biology, Chinese Academy of 
Sciences, Xining, Qinghai 81008, China
11Department of Plant Sciences, Quaid-i-Azam University, Islamabad 45320, Pakistan
12International Maize and Wheat Improvement Center (CIMMYT), China Office, c/o CAAS, Beijing, 100081, China
13International Maize and Wheat Improvement Center (CIMMYT), Km. 45, Carretera, México-Veracruz, El Batán, Texcoco CP 
56237E do. de México, Mexico
14Yazhouwan National Laboratory, Sanya, Hainan 572025, China
†These authors contributed equally to this work.

*Corresponding authors: E-mails: hqling@genetics.ac.cn; yuehu@fudan.edu.cn; zhangyijing@fudan.edu.cn.
Associate editor: Emily Josephs

Abstract
Polyploidization drives regulatory and phenotypic innovation. How the merger of different genomes contributes to 
polyploid development is a fundamental issue in evolutionary developmental biology and breeding research. 
Clarifying this issue is challenging because of genome complexity and the difficulty in tracking stochastic subgenome 
divergence during development. Recent single-cell sequencing techniques enabled probing subgenome-divergent regu-
lation in the context of cellular differentiation. However, analyzing single-cell data suffers from high error rates due to 
high dimensionality, noise, and sparsity, and the errors stack up in polyploid analysis due to the increased dimension-
ality of comparisons between subgenomes of each cell, hindering deeper mechanistic understandings. In this study, we 
develop a quantitative computational framework, called “pseudo-genome divergence quantification” (pgDQ), for 
quantifying and tracking subgenome divergence directly at the cellular level. Further comparing with cellular 

Received: April 28, 2024. Revised: July 20, 2024. Accepted: August 13, 2024
© The Author(s) 2024. Published by Oxford University Press on behalf of Society for Molecular Biology and Evolution. 
This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial License (https:// 
creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, 
provided the original work is properly cited. For commercial re-use, please contact reprints@oup.com for reprints and translation rights 
for reprints. All other permissions can be obtained through our RightsLink service via the Permissions link on the article page on our site 
—for further information please contact journals.permissions@oup.com. Open Access

M
ethods 

Mol. Biol. Evol. 41(9):msae178 https://doi.org/10.1093/molbev/msae178 Advance Access publication August 30, 2024 1

D
ow

nloaded from
 https://academ

ic.oup.com
/m

be/article/41/9/m
sae178/7746018 by C

entro Internacional de M
ejoram

iento de M
aiz y Trigo user on 14 N

ovem
ber 2024

https://orcid.org/0000-0003-4999-4432
https://orcid.org/0000-0003-1694-1019
https://orcid.org/0009-0000-3511-1454
https://orcid.org/0000-0003-1341-8310
https://orcid.org/0000-0003-2013-7269
https://orcid.org/0000-0001-8074-1009
https://orcid.org/0009-0004-7460-2925
https://orcid.org/0000-0003-3618-8706
https://orcid.org/0000-0001-6874-7854
https://orcid.org/0000-0002-0586-6853
https://orcid.org/0000-0003-0710-1966
https://orcid.org/0000-0002-6895-8472
https://orcid.org/0000-0002-2484-9952
https://orcid.org/0000-0001-8271-9163
https://orcid.org/0000-0003-0007-5312
https://orcid.org/0000-0002-3556-5648
https://orcid.org/0000-0001-9988-2282
https://orcid.org/0000-0002-5730-0317
https://orcid.org/0000-0001-9568-9389
mailto:hqling@genetics.ac.cn
mailto:yuehu@fudan.edu.cn
mailto:zhangyijing@fudan.edu.cn
https://creativecommons.org/licenses/by-nc/4.0/
https://creativecommons.org/licenses/by-nc/4.0/
https://doi.org/10.1093/molbev/msae178


differentiation trajectories derived from single-cell RNA sequencing data allows for an examination of the relationship 
between subgenome divergence and the progression of development. pgDQ produces robust results and is insensitive 
to data dropout and noise, avoiding high error rates due to multiple comparisons of genes, cells, and subgenomes. A 
statistical diagnostic approach is proposed to identify genes that are central to subgenome divergence during develop-
ment, which facilitates the integration of different data modalities, enabling the identification of factors and pathways 
that mediate subgenome-divergent activity during development. Case studies have demonstrated that applying pgDQ 
to single-cell and bulk tissue transcriptomic data promotes a systematic and deeper understanding of how dynamic 
subgenome divergence contributes to developmental trajectories in polyploid evolution.

Key words: pgDQ, scRNA-seq, allopolyploid, subgenome diversity, evo-devo.

Introduction
Allopolyploidization, a process involving hybridization and 
subsequent genome doubling, has the potential to fix hy-
brid vigor and drive adaptive evolution (Doyle et al. 2008; 
Schranz et al. 2012; Van de Peer et al. 2017; Wendel et al. 
2018). Unraveling the effects of allopolyploidization on 
transcriptional and developmental innovation is essential 
to elucidate the causal chain from the merger of genomes 
to phenotypic evolution (Roulin et al. 2013). Many studies 
have shown subgenome-divergent transcription across 
various tissues (Pfeifer et al. 2014; Ramírez-González 
et al. 2018; Li et al. 2019; Wang et al. 2021). However, the 
bulk populations of millions of cells meant that the expres-
sion value for each gene was an average of its expression 
across a population of input cells, which is insufficient 
for providing the necessary resolution to conclusively de-
cipher molecular or cellular behavior. Moreover, ensemble 
measures are inappropriate for investigating the stochastic 
regulation of subgenome divergence during development, 
which is the prerequisite for elucidating how genome di-
vergence contributes to allopolyploid development, a fun-
damental question in the evolutionary developmental 
biology (evo-devo) of polyploidy (De Storme and Mason 
2014; Doyle and Coate 2019).

Recent advancements in single-cell RNA sequencing 
(scRNA-seq) have emerged as a powerful approach for recon-
structing cellular differentiation trajectories (Marioni and 
Arendt 2017; Zhai and Xu 2021; Zhou and Troyanskaya 
2021; Yang et al. 2023), offering great potential for elucidating 
complex regulatory mechanisms underlying developmental 
dynamics of divergent subgenomes in allopolyploids. 
Unfortunately, to fully harness the rich information of single- 
cell data in polyploidy studies, several important technical 
challenges must be addressed. Single-cell data analysis entails 
high-dimensional comparisons of genes and cells, resulting in 
elevated error rates (Argelaguet et al. 2021). These errors are 
further compounded in polyploids due to an additional 
layer of subgenome comparisons (Hu et al. 2021). 
Coupled with data sparsity, fully utilizing single-cell data 
in polyploids poses significant challenges. Furthermore, 
while the integration of single-cell data with existing 
information-rich datasets—such as gene functional annota-
tions including Gene Ontology (GO) terms, Pfam protein 
domain annotations, and the transcription factor (TF) tar-
get gene sets—offers valuable insights for elucidating under-
lying biological mechanisms, effective methods of data 

integration are still lacking. Addressing these challenges 
could further enhance our comprehension of the intricate 
relationship between genomic diversity and the process of 
cellular differentiation in allopolyploid species.

A critical requirement for allopolyploid scRNA-seq data 
analysis is the robust quantification of subgenome transcrip-
tional divergence during cellular differentiation. In this study, 
we establish a quantitative computational framework de-
signed to analyze dynamic subgenome divergence of tran-
scriptome regulation using both single-cell and bulk tissue 
transcriptomic data. Our framework demonstrates robust-
ness and effectiveness in identifying key genes contributing 
to subgenome divergence during development, which facil-
itates further integration with different data modalities and 
efficiently connects genome divergence with developmental 
processes in allopolyploid wheat.

Results
Challenges of Single-Cell Data Analysis in Polyploids
We highlight the limitations of single-cell data analysis based 
on conventional methods using real data. scRNA-seq data 
from the root tip of the common wheat cultivar “Chinese 
Spring” were prefiltered at the cellular and gene levels, yield-
ing a matrix of gene expression levels across 4,935 cells clus-
tered into 20 groups (Fig. 1a and b, supplementary figs. S1 
and S2, Supplementary Material online). To detect subge-
nome divergence, conventional methods first identify differ-
entially expressed triad genes (with 1:1:1 correspondence 
across the A, B, and D subgenomes) between subgenomes 
in each sample (tissue, cell cluster, or cell), followed by com-
paring the divergence across genes or samples as previously 
described (Pfeifer et al. 2014; Ramírez-González et al. 2018) 
(Fig. 1c and d). Within the dataspace, the relative expression 
of a homoeolog in one subgenome against the overall three- 
subgenome expression determines a triad’s position in the 
ternary plot. The relative distance of each point (a triad 
gene) to the balance point [1/3, 1/3, 1/3] representing a sub-
genome dominance level is used to construct a distance ma-
trix for triad genes across cells (Fig. 1e). For qualitative 
analysis, seven categories of homoeolog expression bias are 
defined based on this distance matrix, as shown by the 
bulk root RNA-seq data (Fig. 1f), which are further compared 
across cells. For quantitative analysis, cellular subgenome di-
vergence is determined by calculating the average for all ex-
pressed genes in each cell (Fig. 1g). Consequently, scRNA-seq 

Wang et al. · https://doi.org/10.1093/molbev/msae178                                                                                        MBE

2

D
ow

nloaded from
 https://academ

ic.oup.com
/m

be/article/41/9/m
sae178/7746018 by C

entro Internacional de M
ejoram

iento de M
aiz y Trigo user on 14 N

ovem
ber 2024

http://academic.oup.com/mbe/article-lookup/doi/10.1093/molbev/msae178#supplementary-data
http://academic.oup.com/mbe/article-lookup/doi/10.1093/molbev/msae178#supplementary-data


data analysis based on conventional methods necessitates 
multifaceted comparisons across genes, cells, and subge-
nomes. However, such an analytical process, even within a 
single sample, can result in a high false-positive rate when 
scrutinizing the subgenome divergence across a multitude 
of genes (Fig. 1h; Kvam et al. 2012; Hu et al. 2021). In the situ-
ation of n independent tests for n respective triad genes, if 
the chance of observing a false positive is 0.05 given that 
there is no subgenome bias for a triad gene, then a family- 
wise error rate, the probability of at least one false rejection 
among n tests, is 1−(1−0.05)n. Specifically, this probability 
escalates to roughly 9.75% upon the independent evaluation 
of two genes and further ascends to ∼14.26% with the as-
sessment of three genes. The implementation of multiple 
test correction techniques, such as the Bonferroni correction 
method, leads to a marked reduction in the number of genes 
identified as exhibiting subgenome bias. Moreover, single- 
cell data inherently provide limited detectable signals 
(Fig. 1i), resulting in the detection of fewer genes associated 
with subgenome divergence (Fig. 1j, on average, 985 triad 
genes, accounting for 6% of all triad genes, exhibit subge-
nome bias per cell) compared with bulk tissue transcrip-
tomic data (Fig. 1f, 6,702 triad genes, representing 40% of 
all triad genes). Consequently, the data sparsity and exten-
sive comparisons between subgenomes and across cells 
lead to a considerable decrease in statistical power.

Framework for the Quantification of Stochastic 
Cell-wise and Gene-wise Subgenome Divergence 
During Development
To address the above-mentioned challenges, we propose a 
computational workflow, called “pseudo-genome diver-
gence quantification” (pgDQ), for quantifying dynamic sub-
genome divergence of transcriptome regulation in the 
context of cell differentiation, followed by the identification 
of essential genes on the basis of the contribution to subge-
nome divergence during development (Fig. 2). We demon-
strate the framework using the wheat root tip single-cell 
data. The original data matrix is partitioned by subgenomes 
and reshaped into a 4,935 × 16,427 × 3 expression tensor, 
corresponding to the expression profiles of 16,427 triad 
genes across three subgenomes within 4,935 cells. For each 
cell, we construct three pseudo-genome expression vectors 
Q
A , Q

B , Q
D of all 16,427 triad genes according to the subge-

nome origin. In the dataspace, the Euclidean distance among 
these three high-dimensional vectors preserves the subge-
nome divergence in the original data. In this way, the com-
parison of subgenome transcriptome divergence per cell 
transforms to a comparison of the Euclidean distance be-
tween pseudo-samples in a high-dimensional space. This 
method directly reflects the overall subgenome divergence 
within cells, circumventing the high error rates associated 
with averaging subgenome divergence across thousands of 
genes. The quantitative measure of subgenome divergence 
remains consistent irrespective of the mapping strategy se-
lected (supplementary fig. S3, Supplementary Material on-
line). Moreover, by scoring the subgenome distance as a 

whole rather than relying on the expression of individual 
genes, this approach is expected to be robust against drop-
outs and noise. Further mapping the quantitative 
subgenome-divergent transcription within each cell onto 
the pseudotime developmental trajectory of single-cell 
data helps elucidate subgenome divergence during cellular 
differentiation, providing deeper insights into polyploid evo-
lution and polyploid development.

As part of pgDQ, we developed the divGene algorithm 
(details in Fig. 5a) to score the contribution of each gene to 
subgenome divergence during cell differentiation using a 
statistical diagnostics approach. The resulting quantitative 
measurement of the contribution of each gene to subge-
nome diversity is useful for the subsequent integration 
of other data types. For example, ranking genes according 
to the contribution score and comparing the enrichment 
of GO terms or the binding of TFs may lead to rapid iden-
tification of the upstream regulators and downstream 
pathways associated with the top-ranked genes.

Evaluation of the Utility of pgDQ for Quantifying 
Subgenome Divergence at the Cellular Level
To evaluate the capability of pgDQ in handling the inher-
ent sparsity and noise in scRNA-seq data (Fig. 3a), we com-
pared its performance with the conventional method 
using the full dataset and simulated datasets to mimic low- 
coverage or high-noise conditions. The subgenome diver-
gence measured using each dataset was compared with 
the result for the full dataset.

Using simulated missing values (10% to 90%), we calcu-
lated Pearson’s correlation coefficient to clarify the associ-
ation between the resulting divergence and the divergence 
revealed by the original scRNA-seq data. As shown in 
Fig. 3b, pgDQ consistently performed better than the con-
ventional method. Notably, Pearson’s correlation coefficient 
was still >0.85 when the simulated missing value was 50%. 
We next added simulated technical noise (Poisson’s distribu-
tion, λ = 1) to the scRNA-seq data (Fig. 3a, right panel). The 
comparison with the original dataset indicated pgDQ was in-
sensitive to the added noise (Fig. 3c). Thus, in terms of the 
robustness against data dropout and noise, which are typical 
problems associated with single-cell data analyses, pgDQ 
outperforms the conventional approach.

pgDQ Quantifies Subgenome Divergence at 
Single-Cell Resolution Along the Evolutionary 
Trajectory
We are now able to robustly quantify and compare subge-
nome divergence at the single-cell level. Understanding how 
this divergence dynamically changes during cellular differen-
tiation is central to understanding the linkage between sub-
genome divergence and polyploid development. There are 
already well-established algorithms, such as pseudotime tra-
jectory (Trapnell et al. 2014), that utilize single-cell data to 
infer the quantitative trajectory of cellular differentiation 
during development. Based on this approach, the initial 
cell cluster, hypothesized to be the meristem, was identified 

Allopolyploid Single-cell Data Integration and gene mining · https://doi.org/10.1093/molbev/msae178                      MBE

3

D
ow

nloaded from
 https://academ

ic.oup.com
/m

be/article/41/9/m
sae178/7746018 by C

entro Internacional de M
ejoram

iento de M
aiz y Trigo user on 14 N

ovem
ber 2024

http://academic.oup.com/mbe/article-lookup/doi/10.1093/molbev/msae178#supplementary-data


(a) (b)

(c) (d) (e)

(f) (g)

(h) (i) (j)

Fig. 1. Limitations of the conventional method for detecting subgenome divergence using allopolyploid wheat single-cell data. a) An UMAP plot 
depicting 20 cell clusters from wheat root tips, with each dot representing an individual cell and colors distinguishing the clusters. b) A single-cell 
expression matrix where each row corresponds to a single cell and each column corresponds to a triad gene from one subgenome. c) A heatmap 
displaying the expression levels of triad genes across the three subgenomes in a single cell, with gray indicating genes not detected in a given cell, 
highlighting the data sparsity inherent to scRNA-seq. d) An enlarged heatmap displaying the subgenome-biased expression groups of triad genes. 
e) A ternary plot illustrating quantifying subgenome dominance based on the relative expression abundance of 16,427 triads (49,281 genes) in 
hexaploid wheat. Each dot representing a triad gene and the relative distance to three vertices represents its relative expression level in one 
subgenome in comparison with the total expression across the three subgenomes. The axes represent the relative expression levels across 
the three subgenomes. f) A ternary plot illustrating the classification of subgenome-biased transcription groups based on the bulk root tran-
scriptomic data. A total of 6,702 triad genes, representing 40% of all triad genes, show subgenome bias in bulk data. g) The matrix of distance 
of each triad gene expression to the balance point, derived from panel e, illustrates subgenome divergence across individual genes within all cells. 
Cross-cell comparisons are based on averaged gene expression to evaluate collective subgenome divergence. h) A graphical representation il-
lustrating the exponential rise in the family-wise error rate as the number of comparisons increases. i) A ternary plot showing the subgenome- 
biased transcription of triad genes across subgenomes in a single cell. On average, 985 traid genes, accounting for 6% of all triad genes, exhibit 
subgenome bias per cell. The plot reveals a limited number of detectable genes per cell due to the sparsity of single-cell data. j) A comparative 
chart of the proportion of divergent triads detected in root tip single cells and bulk root tissue.

Wang et al. · https://doi.org/10.1093/molbev/msae178                                                                                        MBE

4

D
ow

nloaded from
 https://academ

ic.oup.com
/m

be/article/41/9/m
sae178/7746018 by C

entro Internacional de M
ejoram

iento de M
aiz y Trigo user on 14 N

ovem
ber 2024


from previously published wheat root tip single-cell data 
(Zhang et al. 2023; Fig. 4a). For each cell, there were two 
quantitative metrics, namely the “pseudotime ordering 

score,” representing the cellular differentiation during devel-
opment (Fig. 4b), and the “subgenome divergence score” as 
determined by pgDQ (Fig. 4c). Clear association was 

Fig. 2. A pgDQ workflow. Separation by subgenome: the original matrix is reshaped into an expression tensor with expression values separated 
by subgenome, resulting in three distinct pseudo-genomes. For each cell (take cell 1 for example), these pseudo-genomes are represented as 
points in a multidimensional space, where subgenome divergence is directly indicated by the Euclidean distance between these points. 
Quantification of subgenome divergence in development: for each cell, the Euclidean distance between the three pseudo-genomes was calcu-
lated, providing a direct measure of subgenome divergence for that cell. The ⇀ symbol represents a vector of gene expression values in each 
pseudo-genome. This transformation converts the complex statistical issue of detecting subgenome divergence into a simple calculation of spa-
tial distances, minimizing errors from extensive multiple comparisons. By mapping the quantitative subgenome divergence onto cell differen-
tiation trajectories ordered by pseudo-time, the stochastic subgenome divergence could be linked to cell differentiation during development. 
Key gene detection and data integration: the quantitative cell-level subgenome divergence could be further applied for quantifying the contri-
bution of genes to subgenome divergence during cell differentiation (the divGene algorithm, detailed description in Fig. 5a), which serves as the 
basis for integrating different data types, facilitating mechanism insights into evo-devo.

Allopolyploid Single-cell Data Integration and gene mining · https://doi.org/10.1093/molbev/msae178                      MBE

5

D
ow

nloaded from
 https://academ

ic.oup.com
/m

be/article/41/9/m
sae178/7746018 by C

entro Internacional de M
ejoram

iento de M
aiz y Trigo user on 14 N

ovem
ber 2024


detected between these two quantitative features (Fig. 4d), 
which aligns with the evo-devo theory that certain develop-
mental stages echo the evolutionary trajectory (Irie and 
Kuratani 2011) at the cellular transcriptional level. The asso-
ciation between subgenome divergence and differentiation 
was also detected within a single-cell cluster (meristem) dis-
tinguished by Uniform Manifold Approximation and 
Projection (UMAP) (Fig. 4e). These results highlight the sig-
nificance of quantifying subgenome divergence at the cellu-
lar level, offering a detailed understanding of how evolution 
and development are interconnected.

divGene Detects Genes Indispensable for Subgenome 
Divergence During Differentiation
Genes that contribute to progressive subgenome diver-
gence may drive or be closely related to cellular differenti-
ation. However, comparing gene expression across cells is 
challenging because of data sparsity; only 2% to 35% of 
genes can be detected in a single cell (Fig. 1i and j). This 
coupled with large-scale multiple comparisons across sub-
genomes and cells leads to unreliable gene-wise results.

The divGene algorithm utilizes cellular divergence and dif-
ferentiation information to guide gene identification and is 
insensitive to data dropout. The basis is that if leaving out 
one gene significantly affects the correlation between subge-
nome divergence and cell differentiation as reflected by 
pseudotime ordering, then the gene is potentially indispens-
able for divergence-mediated differentiation. Briefly, the cor-
relation between subgenome divergence and cell 

differentiation across all 4,935 cells was determined before 
and after leaving out one triad gene (Fig. 5a). Because of 
the nonlinear correlation between the divergence score 
and the pseudotime ordering score (Fig. 4d), we applied 
Spearman’s rank correlation coefficient (ρ). The ρ value 
was calculated for each of the 16,427 triads. An indispensabil-
ity score (IS) was calculated as ln 2/(1 + r)

( 􏼁
(Fig. 5b). 

Figure 5c illustrates the subgenome-divergent expression of 
TaTIP1 (a top-ranked gene) whose overexpression can pro-
mote development (Rodrigues et al. 2016). If the top 2.5% 
of the triad genes were left out, the subgenome divergence 
trend during differentiation was no longer apparent 
(Fig. 5d). In summary, divGene effectively detected key genes 
associated with subgenome divergence during development.

Ranking Indispensable Genes to Integrate 
Multimodal Data and Facilitate Mechanism Insights
The major challenge in single-cell data analyses lies in inter-
preting the results and integrating multimodal data to gain 
insights into biological mechanisms. A gene list-centric 
analysis which involves comparing lists of biologically signifi-
cant genes curated from different data modalities is a 
powerful strategy broadly applicable for integrating data 
(Subramanian et al. 2005; Yi et al. 2013). This threshold-free 
approach can overcome the limitations of conventional ana-
lyses that rely on a single arbitrary P-value threshold that 
does not preserve the weight of different genes and may dis-
card relevant biological information.

(a)

(b) (c)

Fig. 3. Robustness of pgDQ against data dropout and noise. a) A scheme with simulated data dropout and noise. Distribution of the correlation 
between the original data and the datasets with simulated sparsity (b) and noise (c). The x axis represents the fraction of genes with adjusted 
expression values due to introduced sparsity or noise in the simulation.

Wang et al. · https://doi.org/10.1093/molbev/msae178                                                                                        MBE

6

D
ow

nloaded from
 https://academ

ic.oup.com
/m

be/article/41/9/m
sae178/7746018 by C

entro Internacional de M
ejoram

iento de M
aiz y Trigo user on 14 N

ovem
ber 2024


In this study, genes were ranked according to the IS, re-
flecting their contribution to genome divergence during 
development and serving as a promising anchor for inte-
grating data. By comparing the ranked genes with lists of 
genes characterized according to the associated pathways 
and functions in previously annotated databases, we deter-
mined that genes with a high IS were mainly related to cell 
wall biogenesis, membrane dynamics, and cell tip growth 
(Fig. 6a and b), which are well-characterized processes me-
diating differentiation (Barberon et al. 2016; Somssich et al. 
2016). While the overall trend shows a positive correlation 

between subgenome divergence and pseudotime ordering 
(Fig. 4), it is observed that a few functional gene sets 
display decreased subgenome divergence during the devel-
opmental process (supplementary fig. S4, Supplementary 
Material online). Genes in the top enriched pathways 
displayed increased subgenome divergence during differ-
entiation (Fig. 6b). Their candidate function in linking sub-
genome divergence and polyploid development is an 
interesting issue for further study.

TFs play key roles in differentiation (Vaquerizas et al. 
2009). The hundreds of TF-binding target lists in 

(a)

(b)

(c)

(d)

(e)

Fig. 4. Superimposing quantitative subgenome divergence onto the pseudotime developmental trajectory. a) An expression pattern of the meri-
stem marker gene TaH2AX in UMAP. b) Cells plotted along the pseudotime trajectory, with colors indicating the developmental hierarchy based 
on the pseudotime ordering score. The progression from less differentiated to more differentiated states is visualized, originating from the meri-
stem. c) The cellular subgenome divergence scores, as calculated by the pgDQ method, are overlaid on the UMAP plot, with colors representing 
the degree of subgenome divergence within individual cells. d) The distribution of subgenome divergence scores across all 20 cell clusters is 
depicted, with clusters ranked according to their average pseudotime ordering score. This ranking illustrates the relationship between cellular 
differentiation and subgenome divergence. e) The Spearman’s rank correlation coefficient is calculated to assess the association between the 
pseudotime ordering score and the subgenome divergence score for each cell within the meristem cluster. The scatter plot displays the cells 
as dots, with the strength and direction of the correlation evident through the distribution and clustering patterns.

Allopolyploid Single-cell Data Integration and gene mining · https://doi.org/10.1093/molbev/msae178                      MBE

7

D
ow

nloaded from
 https://academ

ic.oup.com
/m

be/article/41/9/m
sae178/7746018 by C

entro Internacional de M
ejoram

iento de M
aiz y Trigo user on 14 N

ovem
ber 2024

http://academic.oup.com/mbe/article-lookup/doi/10.1093/molbev/msae178#supplementary-data
http://academic.oup.com/mbe/article-lookup/doi/10.1093/molbev/msae178#supplementary-data


Wheat-RegNet profiled using DNA affinity purification 
sequencing (DAP-seq) (Tang et al. 2023) were com-
pared with a gene list ranked by the IS. A rank-based 
statistical test suggested that bZIP and NAC TF families 
are mostly likely involved in regulating subgenome 
divergence during differentiation (Fig. 6c and d). 
Members of these TF families are responsible for 
phloem, xylem, and root hair development reported 
previously (Yin et al. 1997; Xie et al. 2000; Yamaguchi 
et al. 2011). Together, ranking indispensable genes facil-
itates the integration of other data modalities and the 
interpretation of biological significance, allowing a dee-
per understanding of the mechanisms linking polyploid 
evolution and development.

Applying pgDQ to High-Dimensional Tissue 
Developmental Data
In addition to single-cell data, pgDQ is readily applicable to 
quantifying, integrating, and interpreting other types of 
high-dimensional data in polyploids. Subgenome- 
divergent transcription differs to varying degrees in 
different tissues. How divergent expression relates and 
contributes to development has long been a research focus 
of polyploid research (Pfeifer et al. 2014; Wicker et al. 
2018). However, comparing subgenome divergence across 
a large amount of tissues is statistically difficult, limiting 
further data integration and mechanism elucidation. 
Here, we applied pgDQ to study the effects of wheat 

(a)

(b) (c)

(d)

Fig. 5. divGene identifies essential genes potentially driving subgenome divergence during development. a) A scheme depicting the divGene 
algorithm for calculation of the IS. The Spearman’s rank correlation coefficient (ρ) for the subgenome divergence in cells after leaving out 
one triad gene was calculated. The IS was then calculated as ln 2/(1 + ρ)

( 􏼁
, followed by Z-score normalization. b) Distribution of the IS for 

all triad genes. The top-ranked functional genes were annotated. c) An expression pattern of the top-ranked triad gene TaTIP1. d) Cellular sub-
genome divergence in an UMAP before and after leaving out the top 2.5% of genes.

Wang et al. · https://doi.org/10.1093/molbev/msae178                                                                                        MBE

8

D
ow

nloaded from
 https://academ

ic.oup.com
/m

be/article/41/9/m
sae178/7746018 by C

entro Internacional de M
ejoram

iento de M
aiz y Trigo user on 14 N

ovem
ber 2024


subgenome divergence on tissue developmental processes. 
RNA-seq data from 209 wheat developmental stages of 
roots, leaves, shoots, spikes, and grains were downloaded 
from the Toronto International Data Release (Toronto 
International Data Release Workshop Authors et al. 2009). 
Samples with different tissues and developmental stages 
could be distinguished by UMAP (Fig. 7a). A pseudotime 
analysis was performed and the developmental trajec-
tory was projected onto UMAP (Fig. 7b). pgDQ was ap-
plied to quantify subgenome-divergent scores in each 
tissue, which were also superimposed upon UMAP. 
The dynamic subgenome-divergent transcription coin-
cided well with the progression of tissue development 
(Fig. 7b and c), consistent with the finding at the cellular 
level (Fig. 4).

To detect genes contributing to subgenome divergence 
during development, ISs for all triad genes were calculated 
based on the divGene algorithm. The top-ranked genes in-
clude well-characterized factors that control various devel-
opmental processes (Fig. 7d). For example, BEI and GluA 
regulate grain development (Akasaka et al. 2009; Yang 
et al. 2019), FTIP7 regulates flower development (Song 
et al. 2018), and HTD1 and HL6 control plant vegetative 

growth (Zou et al. 2005; Sun et al. 2017). Further integration 
with functional pathways and TF-binding profiles detected 
similar results with single-cell analysis, with cell wall organ-
ization and NAC families among the top enriched pathways 
and TF-mediating subgenome-divergent development 
(Fig. 7e and f). Overall, pgDQ uncovers a significant correl-
ation between subgenome divergence and cell differenti-
ation in common wheat, facilitating the identification of 
essential genes contributing to subgenome divergence dur-
ing development and enhancing the understanding of de-
velopmental mechanisms in polyploids.

Discussion
As the cost of high-throughput data continues to decrease, 
single-cell and large-scale bulk tissue transcriptomic data 
are gradually becoming prevalent for evo-devo studies, 
posing significant challenges to data comparison, integra-
tion, and interpretation. The analysis of polyploids is far 
more difficult because of the additional layer of compari-
son across subgenomes in each individual cell. The lack 
of available computational tools has impeded the wide-
spread use of single-cell and large-scale bulk sequencing 

(a)

(c)

(b)

(d)

Fig. 6. Leveraging IS for multimodal data integration. a) Top enriched functional terms for genes ranked by their ISs using GSEA (Subramanian 
et al. 2005). The x axis represents the genes ordered by the normalized IS, with the distribution of IS shown in the bottom panel. The middle 
heatmap visualizes the ranked list position for genes sharing one GO term. The top panel depicts the enrichment score, assessing the degree of 
correlation between gene set membership and gene ranking based on IS. The enrichment plot for the top five enriched terms is shown, which are 
detailed in b. b) Analysis of enriched functional terms and cellular subgenome divergence. Left, list of top ten enriched functional terms as iden-
tified in a. The normalized enrichment score (NES), P-value (P), and false discovery rate (FDR) are determined by GSEA package. Right, a heatmap 
depicting cellular subgenome divergence, quantified by equation (1) in the “Materials and Methods” section, with n denoting the number of 
genes associated with each specified GO term. The enriched functional processes contributed to subgenome divergence during cellular devel-
opment. c) and d) Analogous to a) and b), these panels show the enrichment analysis for gene sets based on TF target genes identified by 
DAP-seq (Zhang, et al. 2022) and curated in the Wheat-RegNet database (Tang et al. 2023).

Allopolyploid Single-cell Data Integration and gene mining · https://doi.org/10.1093/molbev/msae178                      MBE

9

D
ow

nloaded from
 https://academ

ic.oup.com
/m

be/article/41/9/m
sae178/7746018 by C

entro Internacional de M
ejoram

iento de M
aiz y Trigo user on 14 N

ovem
ber 2024


data in polyploid research. In this study, we designed a 
quantitative computational framework to overcome these 
limitations and address key issues related to polyploid 
research.

First, to robustly detect subgenome divergence at the 
cellular level, we restructured the data by segregating the 
expression values of genes from each subgenome, effect-
ively transforming the complex, multidimensional diver-
gence analysis into a more tractable comparison of 
Euclidean distances between pseudo-genomes within a 
high-dimensional space. This transformation preserved 
the original data information, while avoiding the high error 
rate caused by the high-dimensional comparison, noise, 
and sparsity, as illustrated in Fig. 2. Second, to robustly de-
tect the contribution of genes to subgenome divergence 

during development, we adopted a top-down strategy 
coupled with statistical diagnostics to overcome specific 
issues (i.e. data noise and sparsity) presented by single-cell 
data. Third, the results of the above quantitative compar-
isons were easily integrated with other data modalities, 
avoiding the loss of biologically significant genes due to 
the use of an arbitrary P-value threshold, which is a serious 
issue as the number of comparisons increases for integrat-
ing data.

It should be noted that quantifying subgenome diver-
gence by pgDQ relies on the clear distinction between in-
dividual subgenomes. It is not suitable for processing 
polyploids without a phased subgenome annotation. We 
also expanded the application of the pgDQ method be-
yond hexaploid wheat to include tetraploids. By simulating 

(a) (d)

(b)

(c)

(e)

(f)

Fig. 7. The application of pgDQ in tissue-wise subgenome-divergent detection and data integration. a) An UMAP plot illustrating the develop-
mental stages of various wheat tissues. b) An UMAP plot with samples colored by pseudotime, indicating progression from immature to mature 
states. c) An UMAP plot with samples colored by subgenome divergence scores, as determined by pgDQ, with color intensity reflecting the 
divergence magnitude. d) Distribution of the divGene-calculated ISs for triad genes. The top enriched functional genes are labeled. The 
most enriched functional terms (e) and TFs (f) linked to genes ordered by their ISs.

Wang et al. · https://doi.org/10.1093/molbev/msae178                                                                                        MBE

10

D
ow

nloaded from
 https://academ

ic.oup.com
/m

be/article/41/9/m
sae178/7746018 by C

entro Internacional de M
ejoram

iento de M
aiz y Trigo user on 14 N

ovem
ber 2024


three pseudo-tetraploid genomes from the hexaploid 
wheat, we obtained results that were consistently in line 
with those from the hexaploid wheat (supplementary fig. 
S5, Supplementary Material online). The method’s effect-
iveness grows with the increase in ploidy, which is pertin-
ent considering the prevalence of polyploidy among 
cultivated crops.

As the scale and availability of polyploids rapidly in-
crease, we anticipate that pgDQ, which has diverse appli-
cations and is simple to use, will become a valuable tool 
that can be rapidly adopted by researchers focused on 
polyploids, potentially leading to insightful evo-devo 
discoveries.

Materials and Methods
Plant Materials and Growth Conditions
Common wheat (Triticum aestivum cultivar “Chinese 
Spring”) seeds were surface-sterilized by a 10-min incuba-
tion in 30% H2O2 and then thoroughly washed five times 
with distilled water. The seeds were germinated in water 
for 3 d at 22 °C. The root tips (0.5 cm) of the seedlings 
were harvested for scRNA-seq experiment.

Single-Cell RNA-seq Library Construction and 
Sequencing
The root tips from 80 to 100 wheat plants were harvested 
and digested in digestion buffer at room temperature for 
2 h (1.5% cellulase R10, 1.5% macerozyme R10, 0.3 M man-
nitol, 3 mM β-mercaptoethanol, 10 mM CaCl2, and 0.3% 
Bovine Serum Albumin (BSA)). Single-cell suspensions 
were loaded onto 10× Genomics Chromium to capture 
∼6,000 single cells following the manufacturer’s instructions 
for the 10× Genomics Chromium Single-Cell 3′ kit (V3). 
cDNA amplification, library construction, and sequencing 
on the Illumina NovaSeq 6000 platform (paired-end multi-
plexing run, 150 bp) were performed by LC-Bio Technology 
(Hangzhou, China).

Processing of Single-Cell RNA-seq Data
Data from 10× Genomics were analyzed by Cell Ranger 
(v7.0.0) (https://www.10xgenomics.com/support/software/ 
cell–ranger/latest) to perform alignment, barcode count-
ing, and unique molecular identifier (UMI) counting. 
The reference genome (v1.0) and GTF annotation files 
(v1.1) of T. aestivum were downloaded from the 
International Wheat Genome Sequencing Consortium 
(IWGSC). Uniquely mapped reads were used for UMI 
counting, which account for 95% of total aligned reads 
(supplementary fig. S6, Supplementary Material online). 
To assess whether the inclusion of the remaining 5% of 
multimapped reads affects the quantification of subge-
nome divergence, we compared various algorithms de-
signed for handling multimapped reads, including the 
Uniform, PropUnique, EM, and Rescue algorithms 
available in STARsolo (v2.7.11b) (Dobin et al. 2013). 
The Uniform algorithm allocates multigene UMIs 

evenly across all genes within the multigene set. The 
PropUnique algorithm assigns multigene UMIs propor-
tionally based on the number of unique UMIs per gene. 
The EM algorithm utilizes maximum likelihood estima-
tion to distribute multigene UMIs among their respective 
genes (Jin et al. 2015). The Rescue algorithm distributes 
multigene UMIs to their gene set proportionally to the 
sum of the number of unique-gene UMIs and uniformly 
distributed multigene UMIs in each gene (Mortazavi 
et al. 2008).

Cell Clustering and Reconstruction of Cell 
Trajectories
The initial steps involved preprocessing the gene–cell ma-
trices obtained from Cell Ranger to eliminate ambient or 
leaked RNA by “autoEstCont” and “adjustCounts” func-
tions in SoupX (v1.6.2) package (Young and Behjati 
2020). The processed data were imported into the 
Seurat (v4.0) R package (Stuart et al. 2019) for further qual-
ity control and analysis. The low-quality cells and genes 
were filtered according to the following criteria: 

1) Cells with an expressed gene count between 500 and 
10,000 were kept (“subset” function with parameters 
“nFeature_RNA <10,000 and nFeature_RNA 
>500”).

2) Cells with UMIs between 500 and 20,000 were kept 
(“subset” function with “nCount_RNA <20,000 and 
nCount_RNA >500”).

3) Genes expressed in three or fewer cells were 
removed.

4) Cells with a mitochondrial gene exceeding 2% 
were excluded (“subset” function with “percent.mt 
<2”).

5) DoubletFinder (v2.0.4) (McGinnis et al. 2019) was 
employed for the detection of doublets, retaining 
only those cells identified as singlets.

“NormalizeData” function with the method 
“LogNormalize” and a scaling factor 10,000 was used for 
normalization. This process involved dividing the feature 
counts for each cell by the total counts for that cell and 
then multiplying by the scale factor. Harmony (v0.1.0) 
(Korsunsky et al. 2019) was used to integrate replicates. 
Highly variable genes were identified using the 
“FindVariableFeatures” function with the “method” set 
to “vst” and “nfeatures” specified as 2,000.

For principal component analysis, the scaled data were 
reduced to 100 principal components (set npcs = 100). 
Clustering of cells was identified using “FindClusters” func-
tion with a resolution of 0.5. The data structures were vi-
sualized by UMAP (“RunUMAP” function with “dims set 
to 1:50”).

The R package Monocle (v2.22.0) was used to analyze 
the process of cell differentiation (pseudo-time ordering) 
and the determination of cell fate. First, marker genes 
with an adjusted P-value <0.05 were defined as develop-
mental progress specific genes. Second, “reduceDimension” 

Allopolyploid Single-cell Data Integration and gene mining · https://doi.org/10.1093/molbev/msae178                      MBE

11

D
ow

nloaded from
 https://academ

ic.oup.com
/m

be/article/41/9/m
sae178/7746018 by C

entro Internacional de M
ejoram

iento de M
aiz y Trigo user on 14 N

ovem
ber 2024

http://academic.oup.com/mbe/article-lookup/doi/10.1093/molbev/msae178#supplementary-data
http://academic.oup.com/mbe/article-lookup/doi/10.1093/molbev/msae178#supplementary-data
https://www.10xgenomics.com/support/software/cell-ranger/latest
https://www.10xgenomics.com/support/software/cell-ranger/latest
http://academic.oup.com/mbe/article-lookup/doi/10.1093/molbev/msae178#supplementary-data


function with parameters “max_components = 2” and 
“method = ‘DDRTree’” was used to reduce the dimensional-
ity of the data. Last, “orderCells” function was then used to 
order cells in pseudotime along a trajectory. The state with 
clusters expressing marker genes of meristem cells was set as 
the root state.

Processing of RNA-seq Data
We downloaded root RNA-seq data from the NCBI GEO 
database (accession number GSE198284). The sequen-
cing reads were trimmed with fastp (v0.22.0) (Chen 
et al. 2018), which eliminated bases with low quality 
scores (<20) as well as short reads (<20 bp) and sequen-
cing adapters. The remaining clean reads were aligned to 
the IWGSC RefSeq genome assembly version 1.0 using 
HISAT2 program (v2.1.0) (Kim et al. 2015). SortMeRNA 
(Kopylova et al. 2012) was used to remove reads origin-
ating from chloroplast, mitochondria, and rRNA. The 
featureCount program of the Subread package (v1.5.3) 
(Liao et al. 2014) was used to determine the RNA-seq 
read density of the high-confidence genes from version 
1.1.

Definition of Subgenome Expression Divergence 
Using Conventional Method in Common Wheat
OrthoFinder (Emms and Kelly 2015) was used to identify 
the orthologous genes between subgenomes. The 
orthogroups with only one copy in each subgenome 
(1:1:1) were defined as triad genes. A total of 16,427 triad 
genes were used for downstream analysis. The relative ex-
pression was determined by the relative proportion of read 
counts in one subgenome to the overall counts from three 
subgenomes. The subgenome-biased and unbiased expres-
sions were defined by the relative position to the seven 
predefined standard coordinates in the ternary plot. 
Specifically, the proportion triad [1/3, 1/3, 1/3] represents 
a balanced expression or binding. The proportion triads 
[0.5, 0.5, 0], [0, 0, 1], [0.5, 0, 0.5], [0, 1, 0], [0, 0.5, 0.5], and 
[1, 0, 0] represent divergent patterns (D suppressed, D 
dominant, B suppressed, B dominant, A suppressed, and 
A dominant, respectively). Subgenome gene expression di-
vergence pattern was determined by the nearest standard 
proportion pattern.

Quantification of Subgenome Divergence at Cellular 
Level Using pgDQ
Each subgenome was treated as an individual pseudo- 
genome. For each cell, subgenome divergence (distance, 
Dist0) across pseudo-genomes in dataspace was calculated 
using the following formula:

Dist0 =
�����������������
􏽘n

i=1
(Ai − Bi)

2
􏽲

+
�����������������
􏽘n

i=1
(Ai − Di)

2
􏽲

+
�����������������
􏽘n

i=1
(Bi − Di)

2
􏽲 (1) 

where Ai, Bi, and Di are the expression values of triad gene i 
and n is the total number of triad genes. The Euclidean dis-
tance is implemented in R package rdist (v0.0.5).

Simulated Datasets and Evaluation
To compare the robustness between the pgDQ and the 
conventional methods, we simulated scRNA-seq datasets 
with 10% to 90% missing values using “sample” and “re-
place” function in R. The Poisson distributed noise was 
also randomly added to the original data. Each simulation 
was performed 150 times, and the distribution of the simu-
lated variables is shown in supplementary fig. S7, 
Supplementary Material online. For pgDQ, the Pearson’s 
correlation coefficient between the Dist0 of all cells of si-
mulated data and original data was calculated. For the 
conventional method, the distance of relative expression 
of each triad gene to balance point [0.33, 0.33, 0.33] was 
calculated for each cell. Then, the correlations between 
the distances of all cells of simulated data and original 
data for all triad genes were compared with correlations 
generated by pgDQ.

Calculating Gene ISs Using pgDQ
For each fixed triad gene j, the expression values Aj, Bj, and 
Dj were set to zero, and then, subgenome divergence (dis-
tance, Distj) for each cell was calculated using formula (1). 
Pseudotime for all cells (T0) was extracted from the cell dif-
ferentiation trajectory that had been constructed earlier. 
Then, the Spearman’s rank correlation coefficient (ρ) be-
tween the T0 and the Distj of all cells was calculated using 
cor function in R package stats (v3.6.2). The IS of the triad 
gene was calculated using the following formula:

IS = ln
2

1 + ρ

􏼒 􏼓

(2) 

Next, the IS was Z-score-normalized, and we ranked all 
genes according to the normalized IS for further analysis.

Gene Set Enrichment Analysis of Ranked 
Indispensable Genes
Genes ranked by the normalized IS were used to calculate 
the enrichment score of GO terms or the targets of TFs 
using gene set enrichment analysis (GSEA) (v4.3.2) 
(Subramanian et al. 2005). The GO terms were collected 
from GOMAP (Wimalanathan and Lawrence-Dill 2021), 
and those with fewer than 15 triad genes were filtered out.

Quantification of Subgenome Divergence at Cellular 
Level Using pgDQ in Simulated Pseudo-tetraploids
We simulated three pseudo-tetraploid datasets by ex-
tracting data from any two subgenomes of hexaploid 
wheat dataset. For each cell in each pseudo-tetraploid, 
subgenome divergence (distance, Dist) across pseudo- 
genomes in dataspace was calculated using the following 
formula: 

Wang et al. · https://doi.org/10.1093/molbev/msae178                                                                                        MBE

12

D
ow

nloaded from
 https://academ

ic.oup.com
/m

be/article/41/9/m
sae178/7746018 by C

entro Internacional de M
ejoram

iento de M
aiz y Trigo user on 14 N

ovem
ber 2024

http://academic.oup.com/mbe/article-lookup/doi/10.1093/molbev/msae178#supplementary-data
http://academic.oup.com/mbe/article-lookup/doi/10.1093/molbev/msae178#supplementary-data


DistAB =
�����������������
􏽘n

i=1
(Ai − Bi)

2
􏽲

(3) 

DistAD =
�����������������
􏽘n

i=1
(Ai − Di)

2
􏽲

(4) 

DistBD =
�����������������
􏽘n

i=1
(Bi − Di)

2
􏽲

(5) 

where Ai, Bi, and Di are the expression values of homoeo-
logous gene i and n is the total number of homoeologous 
genes. The Euclidean distance is implemented in R package 
rdist (v0.0.5).

Supplementary Material
Supplementary material is available at Molecular Biology 
and Evolution online.

Funding
This study was supported by the Open Project Program of the 
State Key Laboratory for Crop Stress Resistance and High- 
Efficiency Production (Y.J.Z), CAS “Light of West China” for 
the Interdisciplinary Innovation Team (No. xbzg-zdsys- 
202109)(B.Z), National Natural Science Foundation of 
China (32270628) (Y.J.Z), National Key R&D Program of 
China (2023YFF1205101) (Y.Q.H.), State Key Laboratory of 
Crop Stress Adaptation and Improvement (2024KF03) 
(Y.J.Z), Yangfan Project of Shanghai Science and Technology 
Commission (23YF1403100) (M.Y.W), State Key Laboratory 
of Plant Cell and Chromosome Engineering, Institute of 
Genetics and Developmental Biology (PCCE-KF-2023-03) 
(Y.J.Z), and grant from Shanghai Key Laboratory of 
Agricultural Genetics and Breeding (21DZ2271900) 
(M.Y.W), Beijing Life Science Academy (BLSA), No: 
2024400CD0100 (Y.J.Z), State Key Laboratory of Genetic 
Engineering (SKLGE-2312) (W.L.Z), National Natural Science 
Foundation of China (31290210) (H.-Q.L).

Author Contributions
Y.J.Z., Y.Q.H., H.Q.L., C.H.L., B.Z., H.L., H.D.M., Y.B.X., and 
W.L.Z. conceived and designed the experiments. Z.J.L., 
K.D.L., S.S.Z., Y.L.F., Y.Z., and W.T. performed the experiments. 
M.Y.W., H.Y.W., Y.Y.Z, Y.P.T., A.R., S.B., and Y.J.Z. analyzed the 
data. Y.J.Z. wrote the manuscript with input from all authors.

Conflict of Interest
The authors declare no conflict of interest.

Data Availability
The source code for pgDQ is released under an MIT open 
source license and is available from GitHub (https:// 
github.com/MeiyueWang/pgDQ). The sequencing data 
are available in the Gene Expression Omnibus repository 
https://www.ncbi.nlm.nih.gov/geo/ under accession num-
ber GSE224228.

References
Akasaka T, Vu NT, Chaen K, Nishi A, Satoh H, Ida H, Omori T, Kimura 

M. The action of rice branching enzyme I (BEI) on starches. Biosci 
Biotechnol Biochem. 2009:73(11):2516–2518. https://doi.org/10. 
1271/bbb.90352.

Argelaguet R, Cuomo ASE, Stegle O, Marioni JC. Computational prin-
ciples and challenges in single-cell data integration. Nat 
Biotechnol. 2021:39(10):1202–1215. https://doi.org/10.1038/ 
s41587-021-00895-7.

Barberon M, Vermeer JE, De Bellis D, Wang P, Naseer S, Andersen TG, 
Humbel BM, Nawrath C, Takano J, Salt DE, et al. Adaptation of 
root function by nutrient-induced plasticity of endodermal dif-
ferentiation. Cell. 2016:164(3):447–459. https://doi.org/10.1016/ 
j.cell.2015.12.021.

Chen SF, Zhou YQ, Chen YR, Gu J. Fastp: an ultra-fast all-in-one 
FASTQ preprocessor. Bioinformatics. 2018:34(17):i884–i890. 
https://doi.org/10.1093/bioinformatics/bty560.

De Storme N, Mason A. Plant speciation through chromosome in-
stability and ploidy change: cellular mechanisms, molecular fac-
tors and evolutionary relevance. Curr Plant Biol. 2014:1:10–33. 
https://doi.org/10.1016/j.cpb.2014.09.002.

Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, 
Chaisson M, Gingeras TR. STAR: ultrafast universal RNA-seq 
aligner. Bioinformatics. 2013:29(1):15–21. https://doi.org/10. 
1093/bioinformatics/bts635.

Doyle JJ, Coate JE. Polyploidy, the nucleotype, and novelty: the im-
pact of genome doubling on the biology of the cell. Int J Plant 
Sci. 2019:180(1):1–52. https://doi.org/10.1086/700636.

Doyle JJ, Flagel LE, Paterson AH, Rapp RA, Soltis DE, Soltis PS, Wendel 
JF. Evolutionary genetics of genome merger and doubling in 
plants. Annu Rev Genet. 2008:42(1):443–461. https://doi.org/ 
10.1146/annurev.genet.42.110807.091524.

Emms DM, Kelly S. OrthoFinder: solving fundamental biases in whole 
genome comparisons dramatically improves orthogroup infer-
ence accuracy. Genome Biol. 2015:16(1):157. https://doi.org/10. 
1186/s13059-015-0721-2.

Hu G, Grover CE, Arick MA, Liu M, Peterson DG, Wendel JF. 
Homoeologous gene expression and co-expression network ana-
lyses and evolutionary inference in allopolyploids. Brief Bioinform. 
2021:22(2):1819–1835. https://doi.org/10.1093/bib/bbaa035.

Irie N, Kuratani S. Comparative transcriptome analysis reveals verte-
brate phylotypic period during organogenesis. Nat Commun. 
2011:2(1):248. https://doi.org/10.1038/ncomms1248.

Jin Y, Tam OH, Paniagua E, Hammell M. TEtranscripts: a package for 
including transposable elements in differential expression ana-
lysis of RNA-seq datasets. Bioinformatics. 2015:31(22): 
3593–3599. https://doi.org/10.1093/bioinformatics/btv422.

Kim D, Langmead B, Salzberg SL. HISAT: a fast spliced aligner with 
low memory requirements. Nat Methods. 2015:12(4):357–360. 
https://doi.org/10.1038/nmeth.3317.

Kopylova E, Noé L, Touzet H. SortMeRNA: fast and accurate filtering 
of ribosomal RNAs in metatranscriptomic data. Bioinformatics. 
2012:28(24):3211–3217. https://doi.org/10.1093/bioinformatics/ 
bts611.

Korsunsky I, Millard N, Fan J, Slowikowski K, Zhang F, Wei K, Baglaenko 
Y, Brenner M, Loh PR, Raychaudhuri S. Fast, sensitive and accurate 
integration of single-cell data with Harmony. Nat Methods. 2019: 
16(12):1289–1296. https://doi.org/10.1038/s41592-019-0619-0.

Kvam VM, Liu P, Si Y. A comparison of statistical methods for detect-
ing differentially expressed genes from RNA-seq data. Am J Bot. 
2012:99(2):248–256. https://doi.org/10.3732/ajb.1100340.

Li Z, Wang M, Lin K, Xie Y, Guo J, Ye L, Zhuang Y, Teng W, Ran X, 
Tong Y, et al. The bread wheat epigenomic map reveals distinct 
chromatin architectural and evolutionary features of functional 
genetic elements. Genome Biol. 2019:20(1):139. https://doi.org/ 
10.1186/s13059-019-1746-8.

Liao Y, Smyth GK, Shi W. featureCounts: an efficient general purpose 
program for assigning sequence reads to genomic features. 

Allopolyploid Single-cell Data Integration and gene mining · https://doi.org/10.1093/molbev/msae178                      MBE

13

D
ow

nloaded from
 https://academ

ic.oup.com
/m

be/article/41/9/m
sae178/7746018 by C

entro Internacional de M
ejoram

iento de M
aiz y Trigo user on 14 N

ovem
ber 2024

http://academic.oup.com/mbe/article-lookup/doi/10.1093/molbev/msae178#supplementary-data
https://github.com/MeiyueWang/pgDQ
https://github.com/MeiyueWang/pgDQ
https://www.ncbi.nlm.nih.gov/geo/
https://doi.org/10.1271/bbb.90352
https://doi.org/10.1271/bbb.90352
https://doi.org/10.1038/s41587-021-00895-7
https://doi.org/10.1038/s41587-021-00895-7
https://doi.org/10.1016/j.cell.2015.12.021
https://doi.org/10.1016/j.cell.2015.12.021
https://doi.org/10.1093/bioinformatics/bty560
https://doi.org/10.1016/j.cpb.2014.09.002
https://doi.org/10.1093/bioinformatics/bts635
https://doi.org/10.1093/bioinformatics/bts635
https://doi.org/10.1086/700636
https://doi.org/10.1146/annurev.genet.42.110807.091524
https://doi.org/10.1146/annurev.genet.42.110807.091524
https://doi.org/10.1186/s13059-015-0721-2
https://doi.org/10.1186/s13059-015-0721-2
https://doi.org/10.1093/bib/bbaa035
https://doi.org/10.1038/ncomms1248
https://doi.org/10.1093/bioinformatics/btv422
https://doi.org/10.1038/nmeth.3317
https://doi.org/10.1093/bioinformatics/bts611
https://doi.org/10.1093/bioinformatics/bts611
https://doi.org/10.1038/s41592-019-0619-0
https://doi.org/10.3732/ajb.1100340
https://doi.org/10.1186/s13059-019-1746-8
https://doi.org/10.1186/s13059-019-1746-8


Bioinformatics. 2014:30(7):923–930. https://doi.org/10.1093/ 
bioinformatics/btt656.

Marioni JC, Arendt D. How single-cell genomics is changing evolu-
tionary and developmental biology. Annu Rev Cell Dev Biol. 
2017:33(1):537–553. https://doi.org/10.1146/annurev-cellbio- 
100616-060818.

McGinnis CS, Murrow LM, Gartner ZJ. DoubletFinder: doublet de-
tection in single-cell RNA sequencing data using artificial nearest 
neighbors. Cell Syst. 2019:8(4):329–337.e4. https://doi.org/10. 
1016/j.cels.2019.03.003.

Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and 
quantifying mammalian transcriptomes by RNA-Seq. Nat 
Methods. 2008:5(7):621–628. https://doi.org/10.1038/nmeth.1226.

Pfeifer M, Kugler KG, Sandve SR, Zhan BJ, Rudi H, Hvidsten TR, Mayer 
KFX, Olsen OA, Rogers J, Dolezel J, et al. Genome interplay in the 
grain transcriptome of hexaploid bread wheat. Science. 2014: 
345(6194):1250091. https://doi.org/10.1126/science.1250091.

Ramírez-González RH, Borrill P, Lang D, Harrington SA, Brinton J, 
Venturini L, Davey M, Jacobs J, van Ex F, Pasha A, et al. The tran-
scriptional landscape of polyploid wheat. Science. 2018: 
361(6403):eaar6089. https://doi.org/10.1126/science.aar6089.

Rodrigues MI, Takeda AA, Bravo JP, Maia IG. The eucalyptus tono-
plast intrinsic protein (TIP) gene subfamily: genomic organiza-
tion, structural features, and expression profiles. Front Plant 
Sci. 2016:7:1810. https://doi.org/10.3389/fpls.2016.01810.

Roulin A, Auer PL, Libault M, Schlueter J, Farmer A, May G, Stacey G, 
Doerge RW, Jackson SA. The fate of duplicated genes in a poly-
ploid plant genome. Plant J. 2013:73(1):143–153. https://doi. 
org/10.1111/tpj.12026.

Schranz ME, Mohammadin S, Edger PP. Ancient whole genome du-
plications, novelty and diversification: the WGD radiation lag- 
time model. Curr Opin Plant Biol. 2012:15(2):147–153. https:// 
doi.org/10.1016/j.pbi.2012.03.011.

Somssich M, Khan GA, Persson S. Cell wall heterogeneity in root de-
velopment of Arabidopsis. Front Plant Sci. 2016:7:1242. https:// 
doi.org/10.3389/fpls.2016.01242.

Song S, Chen Y, Liu L, See YHB, Mao C, Gan Y, Yu H. OsFTIP7 deter-
mines auxin-mediated anther dehiscence in rice. Nat Plants. 
2018:4(7):495–504. https://doi.org/10.1038/s41477-018-0175-0.

Stuart T, Butler A, Hoffman P, Hafemeister C, Papalexi E, Mauck WM 
3rd, Hao Y, Stoeckius M, Smibert P, Satija R. 2019. 
Comprehensive integration of single-cell data. Cell. 177(7): 
1888–1902.e21. https://doi.org/10.1016/j.cell.2019.05.031.

Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, 
Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, 
et al. Gene set enrichment analysis: a knowledge-based approach 
for interpreting genome-wide expression profiles. Proc Natl Acad 
Sci U S A. 2005:102(43):15545–15550. https://doi.org/10.1073/ 
pnas.0506580102.

Sun W, Gao D, Xiong Y, Tang X, Xiao X, Wang C, Yu S. Hairy leaf 6, an 
AP2/ERF transcription factor, interacts with OsWOX3B and reg-
ulates trichome formation in rice. Mol Plant. 2017:10(11): 
1417–1433. https://doi.org/10.1016/j.molp.2017.09.015.

Tang T, Tian S, Wang H, Lv X, Xie Y, Liu J, Wang M, Zhao F, Zhang W, 
Li H, et al. Wheat-RegNet: an encyclopedia of common wheat 
hierarchical regulatory networks. Mol Plant. 2023:16(2):318. 
https://doi.org/10.1016/j.molp.2022.12.018.

Toronto International Data Release Workshop Authors; Birney E, 
Hudson TJ, Green ED, Gunter C, Eddy S, Rogers J, Harris JR, 
Ehrlich SD, Apweiler R, et al. Prepublication data sharing. Nature. 
2009:461(7261):168–170. https://doi.org/10.1038/461168a.

Trapnell C, Cacchiarelli D, Grimsby J, Pokharel P, Li S, Morse M, 
Lennon NJ, Livak KJ, Mikkelsen TS, Rinn JL. The dynamics and reg-
ulators of cell fate decisions are revealed by pseudotemporal or-
dering of single cells. Nat Biotechnol. 2014:32(4):381–386. 
https://doi.org/10.1038/nbt.2859.

Van de Peer Y, Mizrachi E, Marchal K. The evolutionary significance 
of polyploidy. Nat Rev Genet. 2017:18(7):411–424. https://doi. 
org/10.1038/nrg.2017.26.

Vaquerizas JM, Kummerfeld SK, Teichmann SA, Luscombe NM. A 
census of human transcription factors: function, expression 
and evolution. Nat Rev Genet. 2009:10(4):252–263. https://doi. 
org/10.1038/nrg2538.

Wang M, Li Z, Zhang Y, Zhang Y, Xie Y, Ye L, Zhuang Y, Lin K, Zhao F, 
Guo J, et al. An atlas of wheat epigenetic regulatory elements re-
veals subgenome divergence in the regulation of development 
and stress responses. Plant Cell. 2021:33(4):865–881. https:// 
doi.org/10.1093/plcell/koab028.

Wendel JF, Lisch D, Hu G, Mason AS. The long and short of doubling 
down: polyploidy, epigenetics, and the temporal dynamics of 
genome fractionation. Curr Opin Genet Dev. 2018:49:1–7. 
https://doi.org/10.1016/j.gde.2018.01.004.

Wicker T, Gundlach H, Spannagl M, Uauy C, Borrill P, 
Ramírez-González RH, De Oliveira R; International Wheat 
Genome Sequencing Consortium; Mayer KFX, Paux E, et al. 
Impact of transposable elements on genome structure and evo-
lution in bread wheat. Genome Biol. 2018:19(1):103. https://doi. 
org/10.1186/s13059-018-1479-0.

Wimalanathan K, Lawrence-Dill CJ. Gene ontology meta annotator 
for plants (GOMAP). Plant Methods. 2021:17(1):54. https://doi. 
org/10.1186/s13007-021-00754-1.

Xie Q, Frugis G, Colgan D, Chua NH. Arabidopsis NAC1 transduces 
auxin signal downstream of TIR1 to promote lateral root devel-
opment. Genes Dev. 2000:14(23):3024–3036. https://doi.org/10. 
1101/gad.852200.

Yamaguchi M, Mitsuda N, Ohtani M, Ohme-Takagi M, Kato K, 
Demura T. VASCULAR-RELATED NAC-DOMAIN7 directly regu-
lates the expression of a broad range of genes for xylem vessel 
formation. Plant J. 2011:66(4):579–590. https://doi.org/10.1111/ 
j.1365-313X.2011.04514.x.

Yang JW, Chen YM, Jing Y, Green MR, Han L. Advancing CAR T cell 
therapy through the use of multidimensional omics data. Nat 
Rev Clin Oncol. 2023:20(4):211–228. https://doi.org/10.1038/ 
s41571-023-00729-2.

Yang YH, Guo M, Sun SY, Zou YL, Yin SY, Liu YN, Tang SZ, Gu MH, 
Yang ZF, Yan CJ. Natural variation of OsGluA2 is involved in 
grain protein content regulation in rice. Nat Commun. 
2019:10(1):1949. https://doi.org/10.1038/s41467-019-09919-y.

Yi X, Du Z, Su Z. PlantGSEA: a gene set enrichment analysis toolkit 
for plant community. Nucleic Acids Res. 2013:41(W1): 
W98–W103. https://doi.org/10.1093/nar/gkt281.

Yin Y, Zhu Q, Dai S, Lamb C, Beachy RN. RF2a, a bZIP transcriptional 
activator of the phloem-specific rice tungro bacilliform virus pro-
moter, functions in vascular development. EMBO J. 1997:16(17): 
5247–5259. https://doi.org/10.1093/emboj/16.17.5247.

Young MD, Behjati S. Soupx removes ambient RNA contamination 
from droplet-based single-cell RNA sequencing data. Gigascience. 
2020:9(12):giaa151. https://doi.org/10.1093/gigascience/giaa151.

Zhai N, Xu L. Pluripotency acquisition in the middle cell layer of cal-
lus is required for organ regeneration. Nat Plants. 2021:7(11): 
1453–1460. https://doi.org/10.1038/s41477-021-01015-8.

Zhang L, He C, Lai Y, Wang Y, Kang L, Liu A, Lan C, Su H, Gao Y, Li Z, 
et al. Asymmetric gene expression and cell-type-specific regula-
tory networks in the root of bread wheat revealed by single-cell 
multiomics analysis. Genome Biol. 2023:24(1):65. https://doi.org/ 
10.1186/s13059-023-02908-x.

Zhang Y, Li Z, Liu J, Zhang Y, Ye L, Peng Y, Wang H, Diao H, Ma Y, 
Wang M, et al. Transposable elements orchestrate subgenome- 
convergent and -divergent transcription in common wheat. 
Nat Commun. 2022:13(1):6940. https://doi.org/10.1038/s41467- 
022-34290-w.

Zhou J, Troyanskaya OG. An analytical framework for interpretable 
and generalizable single-cell data analysis. Nat Methods. 2021: 
18(11):1317–1321. https://doi.org/10.1038/s41592-021-01286-1.

Zou J, Chen Z, Zhang S, Zhang W, Jiang G, Zhao X, Zhai W, Pan X, Zhu 
L. Characterizations and fine mapping of a mutant gene for high 
tillering and dwarf in rice (Oryza sativa L). Planta. 2005:222(4): 
604–612. https://doi.org/10.1007/s00425-005-0007-0.

Wang et al. · https://doi.org/10.1093/molbev/msae178                                                                                        MBE

14

D
ow

nloaded from
 https://academ

ic.oup.com
/m

be/article/41/9/m
sae178/7746018 by C

entro Internacional de M
ejoram

iento de M
aiz y Trigo user on 14 N

ovem
ber 2024

https://doi.org/10.1093/bioinformatics/btt656
https://doi.org/10.1093/bioinformatics/btt656
https://doi.org/10.1146/annurev-cellbio-100616-060818
https://doi.org/10.1146/annurev-cellbio-100616-060818
https://doi.org/10.1016/j.cels.2019.03.003
https://doi.org/10.1016/j.cels.2019.03.003
https://doi.org/10.1038/nmeth.1226
https://doi.org/10.1126/science.1250091
https://doi.org/10.1126/science.aar6089
https://doi.org/10.3389/fpls.2016.01810
https://doi.org/10.1111/tpj.12026
https://doi.org/10.1111/tpj.12026
https://doi.org/10.1016/j.pbi.2012.03.011
https://doi.org/10.1016/j.pbi.2012.03.011
https://doi.org/10.3389/fpls.2016.01242
https://doi.org/10.3389/fpls.2016.01242
https://doi.org/10.1038/s41477-018-0175-0
https://doi.org/10.1016/j.cell.2019.05.031
https://doi.org/10.1073/pnas.0506580102
https://doi.org/10.1073/pnas.0506580102
https://doi.org/10.1016/j.molp.2017.09.015
https://doi.org/10.1016/j.molp.2022.12.018
https://doi.org/10.1038/461168a
https://doi.org/10.1038/nbt.2859
https://doi.org/10.1038/nrg.2017.26
https://doi.org/10.1038/nrg.2017.26
https://doi.org/10.1038/nrg2538
https://doi.org/10.1038/nrg2538
https://doi.org/10.1093/plcell/koab028
https://doi.org/10.1093/plcell/koab028
https://doi.org/10.1016/j.gde.2018.01.004
https://doi.org/10.1186/s13059-018-1479-0
https://doi.org/10.1186/s13059-018-1479-0
https://doi.org/10.1186/s13007-021-00754-1
https://doi.org/10.1186/s13007-021-00754-1
https://doi.org/10.1101/gad.852200
https://doi.org/10.1101/gad.852200
https://doi.org/10.1111/j.1365-313X.2011.04514.x
https://doi.org/10.1111/j.1365-313X.2011.04514.x
https://doi.org/10.1038/s41571-023-00729-2
https://doi.org/10.1038/s41571-023-00729-2
https://doi.org/10.1038/s41467-019-09919-y
https://doi.org/10.1093/nar/gkt281
https://doi.org/10.1093/emboj/16.17.5247
https://doi.org/10.1093/gigascience/giaa151
https://doi.org/10.1038/s41477-021-01015-8
https://doi.org/10.1186/s13059-023-02908-x
https://doi.org/10.1186/s13059-023-02908-x
https://doi.org/10.1038/s41467-022-34290-w
https://doi.org/10.1038/s41467-022-34290-w
https://doi.org/10.1038/s41592-021-01286-1
https://doi.org/10.1007/s00425-005-0007-0

	A Quantitative Computational Framework for Allopolyploid Single-Cell Data Integration and Core Gene Ranking in Development
	Introduction
	Results
	Challenges of Single-Cell Data Analysis in Polyploids
	Framework for the Quantification of Stochastic Cell-wise and Gene-wise Subgenome Divergence During Development
	Evaluation of the Utility of pgDQ for Quantifying Subgenome Divergence at the Cellular Level
	pgDQ Quantifies Subgenome Divergence at Single-Cell Resolution Along the Evolutionary Trajectory
	divGene Detects Genes Indispensable for Subgenome Divergence During Differentiation
	Ranking Indispensable Genes to Integrate Multimodal Data and Facilitate Mechanism Insights
	Applying pgDQ to High-Dimensional Tissue Developmental Data

	Discussion
	Materials and Methods
	Plant Materials and Growth Conditions
	Single-Cell RNA-seq Library Construction and Sequencing
	Processing of Single-Cell RNA-seq Data
	Cell Clustering and Reconstruction of Cell Trajectories
	Processing of RNA-seq Data
	Definition of Subgenome Expression Divergence Using Conventional Method in Common Wheat
	Quantification of Subgenome Divergence at Cellular Level Using pgDQ
	Simulated Datasets and Evaluation
	Calculating Gene ISs Using pgDQ
	Gene Set Enrichment Analysis of Ranked Indispensable Genes
	Quantification of Subgenome Divergence at Cellular Level Using pgDQ in Simulated Pseudo-tetraploids

	Supplementary Material
	Funding
	Author Contributions
	Conflict of Interest
	Data Availability
	References