A Quantitative Computational Framework for Allopolyploid Single-Cell Data Integration and Core Gene Ranking in Development Meiyue Wang ,1,2,3,† Zijuan Li ,4,† Haoyu Wang ,5,† Junwei Zhao,1 Yuyun Zhang ,4 Kande Lin ,6 Shusong Zheng ,7 Yilong Feng ,6 Yu’e Zhang ,7 Wan Teng ,8 Yiping Tong ,8 Wenli Zhang ,6 Yongbiao Xue ,7 Hude Mao ,9 Hao Li ,5 Bo Zhang ,10 Awais Rasheed,11,12 Sridhar Bhavani,13 Chenghong Liu ,3 Hong-Qing Ling ,7,14,* Yue-Qing Hu ,2,* Yijing Zhang 2,* 1Beijing Life Science Academy, Beijing, China 2State Key Laboratory of Genetic Engineering, Department of Biochemistry, Collaborative Innovation Center of Genetics and Development, School of Life Sciences, Institute of Plant Biology, Fudan University, Shanghai 200438, China 3Biotechnology Research Institute, Shanghai Academy of Agricultural Sciences\Shanghai Key Laboratory of Agricultural Genetics and Breeding, Shanghai 201106, China 4National Key Laboratory of Plant Molecular Genetics, CAS Center for Excellence in Molecular Plant Sciences, Shanghai Institute of Plant Physiology and Ecology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200032, China 5State Key Laboratory of Crop Stress Adaptation and Improvement, School of Life Sciences, College of Agriculture, Henan University, Kaifeng, Henan 457004, China 6National Key Laboratory for Crop Genetics and Germplasm Enhancement and Utilization, CIC-MCP, Nanjing Agricultural University, Nanjing, Jiangsu 210095, China 7State Key Laboratory of Plant Cell and Chromosome Engineering, Institute of Genetics and Developmental Biology, Chinese Academy of Sciences, Beijing 100101, China 8Key Lab of Seed Innovation, Institute of Genetics and Developmental Biology, Chinese Academy of Sciences, Beijing 100101, China 9State Key Laboratory for Crop Stress Resistance and High-Efficiency Production, College of Agronomy, Northwest A&F University, Yangling, Shaanxi 712100, China 10Key Laboratory of Adaptation and Evolution of Plateau Biota, Northwest Institute of Plateau Biology, Chinese Academy of Sciences, Xining, Qinghai 81008, China 11Department of Plant Sciences, Quaid-i-Azam University, Islamabad 45320, Pakistan 12International Maize and Wheat Improvement Center (CIMMYT), China Office, c/o CAAS, Beijing, 100081, China 13International Maize and Wheat Improvement Center (CIMMYT), Km. 45, Carretera, México-Veracruz, El Batán, Texcoco CP 56237E do. de México, Mexico 14Yazhouwan National Laboratory, Sanya, Hainan 572025, China †These authors contributed equally to this work. *Corresponding authors: E-mails: hqling@genetics.ac.cn; yuehu@fudan.edu.cn; zhangyijing@fudan.edu.cn. Associate editor: Emily Josephs Abstract Polyploidization drives regulatory and phenotypic innovation. How the merger of different genomes contributes to polyploid development is a fundamental issue in evolutionary developmental biology and breeding research. Clarifying this issue is challenging because of genome complexity and the difficulty in tracking stochastic subgenome divergence during development. Recent single-cell sequencing techniques enabled probing subgenome-divergent regu- lation in the context of cellular differentiation. However, analyzing single-cell data suffers from high error rates due to high dimensionality, noise, and sparsity, and the errors stack up in polyploid analysis due to the increased dimension- ality of comparisons between subgenomes of each cell, hindering deeper mechanistic understandings. In this study, we develop a quantitative computational framework, called “pseudo-genome divergence quantification” (pgDQ), for quantifying and tracking subgenome divergence directly at the cellular level. Further comparing with cellular Received: April 28, 2024. Revised: July 20, 2024. Accepted: August 13, 2024 © The Author(s) 2024. Published by Oxford University Press on behalf of Society for Molecular Biology and Evolution. This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial License (https:// creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact reprints@oup.com for reprints and translation rights for reprints. All other permissions can be obtained through our RightsLink service via the Permissions link on the article page on our site —for further information please contact journals.permissions@oup.com. Open Access M ethods Mol. Biol. Evol. 41(9):msae178 https://doi.org/10.1093/molbev/msae178 Advance Access publication August 30, 2024 1 D ow nloaded from https://academ ic.oup.com /m be/article/41/9/m sae178/7746018 by C entro Internacional de M ejoram iento de M aiz y Trigo user on 14 N ovem ber 2024 https://orcid.org/0000-0003-4999-4432 https://orcid.org/0000-0003-1694-1019 https://orcid.org/0009-0000-3511-1454 https://orcid.org/0000-0003-1341-8310 https://orcid.org/0000-0003-2013-7269 https://orcid.org/0000-0001-8074-1009 https://orcid.org/0009-0004-7460-2925 https://orcid.org/0000-0003-3618-8706 https://orcid.org/0000-0001-6874-7854 https://orcid.org/0000-0002-0586-6853 https://orcid.org/0000-0003-0710-1966 https://orcid.org/0000-0002-6895-8472 https://orcid.org/0000-0002-2484-9952 https://orcid.org/0000-0001-8271-9163 https://orcid.org/0000-0003-0007-5312 https://orcid.org/0000-0002-3556-5648 https://orcid.org/0000-0001-9988-2282 https://orcid.org/0000-0002-5730-0317 https://orcid.org/0000-0001-9568-9389 mailto:hqling@genetics.ac.cn mailto:yuehu@fudan.edu.cn mailto:zhangyijing@fudan.edu.cn https://creativecommons.org/licenses/by-nc/4.0/ https://creativecommons.org/licenses/by-nc/4.0/ https://doi.org/10.1093/molbev/msae178 differentiation trajectories derived from single-cell RNA sequencing data allows for an examination of the relationship between subgenome divergence and the progression of development. pgDQ produces robust results and is insensitive to data dropout and noise, avoiding high error rates due to multiple comparisons of genes, cells, and subgenomes. A statistical diagnostic approach is proposed to identify genes that are central to subgenome divergence during develop- ment, which facilitates the integration of different data modalities, enabling the identification of factors and pathways that mediate subgenome-divergent activity during development. Case studies have demonstrated that applying pgDQ to single-cell and bulk tissue transcriptomic data promotes a systematic and deeper understanding of how dynamic subgenome divergence contributes to developmental trajectories in polyploid evolution. Key words: pgDQ, scRNA-seq, allopolyploid, subgenome diversity, evo-devo. Introduction Allopolyploidization, a process involving hybridization and subsequent genome doubling, has the potential to fix hy- brid vigor and drive adaptive evolution (Doyle et al. 2008; Schranz et al. 2012; Van de Peer et al. 2017; Wendel et al. 2018). Unraveling the effects of allopolyploidization on transcriptional and developmental innovation is essential to elucidate the causal chain from the merger of genomes to phenotypic evolution (Roulin et al. 2013). Many studies have shown subgenome-divergent transcription across various tissues (Pfeifer et al. 2014; Ramírez-González et al. 2018; Li et al. 2019; Wang et al. 2021). However, the bulk populations of millions of cells meant that the expres- sion value for each gene was an average of its expression across a population of input cells, which is insufficient for providing the necessary resolution to conclusively de- cipher molecular or cellular behavior. Moreover, ensemble measures are inappropriate for investigating the stochastic regulation of subgenome divergence during development, which is the prerequisite for elucidating how genome di- vergence contributes to allopolyploid development, a fun- damental question in the evolutionary developmental biology (evo-devo) of polyploidy (De Storme and Mason 2014; Doyle and Coate 2019). Recent advancements in single-cell RNA sequencing (scRNA-seq) have emerged as a powerful approach for recon- structing cellular differentiation trajectories (Marioni and Arendt 2017; Zhai and Xu 2021; Zhou and Troyanskaya 2021; Yang et al. 2023), offering great potential for elucidating complex regulatory mechanisms underlying developmental dynamics of divergent subgenomes in allopolyploids. Unfortunately, to fully harness the rich information of single- cell data in polyploidy studies, several important technical challenges must be addressed. Single-cell data analysis entails high-dimensional comparisons of genes and cells, resulting in elevated error rates (Argelaguet et al. 2021). These errors are further compounded in polyploids due to an additional layer of subgenome comparisons (Hu et al. 2021). Coupled with data sparsity, fully utilizing single-cell data in polyploids poses significant challenges. Furthermore, while the integration of single-cell data with existing information-rich datasets—such as gene functional annota- tions including Gene Ontology (GO) terms, Pfam protein domain annotations, and the transcription factor (TF) tar- get gene sets—offers valuable insights for elucidating under- lying biological mechanisms, effective methods of data integration are still lacking. Addressing these challenges could further enhance our comprehension of the intricate relationship between genomic diversity and the process of cellular differentiation in allopolyploid species. A critical requirement for allopolyploid scRNA-seq data analysis is the robust quantification of subgenome transcrip- tional divergence during cellular differentiation. In this study, we establish a quantitative computational framework de- signed to analyze dynamic subgenome divergence of tran- scriptome regulation using both single-cell and bulk tissue transcriptomic data. Our framework demonstrates robust- ness and effectiveness in identifying key genes contributing to subgenome divergence during development, which facil- itates further integration with different data modalities and efficiently connects genome divergence with developmental processes in allopolyploid wheat. Results Challenges of Single-Cell Data Analysis in Polyploids We highlight the limitations of single-cell data analysis based on conventional methods using real data. scRNA-seq data from the root tip of the common wheat cultivar “Chinese Spring” were prefiltered at the cellular and gene levels, yield- ing a matrix of gene expression levels across 4,935 cells clus- tered into 20 groups (Fig. 1a and b, supplementary figs. S1 and S2, Supplementary Material online). To detect subge- nome divergence, conventional methods first identify differ- entially expressed triad genes (with 1:1:1 correspondence across the A, B, and D subgenomes) between subgenomes in each sample (tissue, cell cluster, or cell), followed by com- paring the divergence across genes or samples as previously described (Pfeifer et al. 2014; Ramírez-González et al. 2018) (Fig. 1c and d). Within the dataspace, the relative expression of a homoeolog in one subgenome against the overall three- subgenome expression determines a triad’s position in the ternary plot. The relative distance of each point (a triad gene) to the balance point [1/3, 1/3, 1/3] representing a sub- genome dominance level is used to construct a distance ma- trix for triad genes across cells (Fig. 1e). For qualitative analysis, seven categories of homoeolog expression bias are defined based on this distance matrix, as shown by the bulk root RNA-seq data (Fig. 1f), which are further compared across cells. For quantitative analysis, cellular subgenome di- vergence is determined by calculating the average for all ex- pressed genes in each cell (Fig. 1g). Consequently, scRNA-seq Wang et al. · https://doi.org/10.1093/molbev/msae178 MBE 2 D ow nloaded from https://academ ic.oup.com /m be/article/41/9/m sae178/7746018 by C entro Internacional de M ejoram iento de M aiz y Trigo user on 14 N ovem ber 2024 http://academic.oup.com/mbe/article-lookup/doi/10.1093/molbev/msae178#supplementary-data http://academic.oup.com/mbe/article-lookup/doi/10.1093/molbev/msae178#supplementary-data data analysis based on conventional methods necessitates multifaceted comparisons across genes, cells, and subge- nomes. However, such an analytical process, even within a single sample, can result in a high false-positive rate when scrutinizing the subgenome divergence across a multitude of genes (Fig. 1h; Kvam et al. 2012; Hu et al. 2021). In the situ- ation of n independent tests for n respective triad genes, if the chance of observing a false positive is 0.05 given that there is no subgenome bias for a triad gene, then a family- wise error rate, the probability of at least one false rejection among n tests, is 1−(1−0.05)n. Specifically, this probability escalates to roughly 9.75% upon the independent evaluation of two genes and further ascends to ∼14.26% with the as- sessment of three genes. The implementation of multiple test correction techniques, such as the Bonferroni correction method, leads to a marked reduction in the number of genes identified as exhibiting subgenome bias. Moreover, single- cell data inherently provide limited detectable signals (Fig. 1i), resulting in the detection of fewer genes associated with subgenome divergence (Fig. 1j, on average, 985 triad genes, accounting for 6% of all triad genes, exhibit subge- nome bias per cell) compared with bulk tissue transcrip- tomic data (Fig. 1f, 6,702 triad genes, representing 40% of all triad genes). Consequently, the data sparsity and exten- sive comparisons between subgenomes and across cells lead to a considerable decrease in statistical power. Framework for the Quantification of Stochastic Cell-wise and Gene-wise Subgenome Divergence During Development To address the above-mentioned challenges, we propose a computational workflow, called “pseudo-genome diver- gence quantification” (pgDQ), for quantifying dynamic sub- genome divergence of transcriptome regulation in the context of cell differentiation, followed by the identification of essential genes on the basis of the contribution to subge- nome divergence during development (Fig. 2). We demon- strate the framework using the wheat root tip single-cell data. The original data matrix is partitioned by subgenomes and reshaped into a 4,935 × 16,427 × 3 expression tensor, corresponding to the expression profiles of 16,427 triad genes across three subgenomes within 4,935 cells. For each cell, we construct three pseudo-genome expression vectors Q A , Q B , Q D of all 16,427 triad genes according to the subge- nome origin. In the dataspace, the Euclidean distance among these three high-dimensional vectors preserves the subge- nome divergence in the original data. In this way, the com- parison of subgenome transcriptome divergence per cell transforms to a comparison of the Euclidean distance be- tween pseudo-samples in a high-dimensional space. This method directly reflects the overall subgenome divergence within cells, circumventing the high error rates associated with averaging subgenome divergence across thousands of genes. The quantitative measure of subgenome divergence remains consistent irrespective of the mapping strategy se- lected (supplementary fig. S3, Supplementary Material on- line). Moreover, by scoring the subgenome distance as a whole rather than relying on the expression of individual genes, this approach is expected to be robust against drop- outs and noise. Further mapping the quantitative subgenome-divergent transcription within each cell onto the pseudotime developmental trajectory of single-cell data helps elucidate subgenome divergence during cellular differentiation, providing deeper insights into polyploid evo- lution and polyploid development. As part of pgDQ, we developed the divGene algorithm (details in Fig. 5a) to score the contribution of each gene to subgenome divergence during cell differentiation using a statistical diagnostics approach. The resulting quantitative measurement of the contribution of each gene to subge- nome diversity is useful for the subsequent integration of other data types. For example, ranking genes according to the contribution score and comparing the enrichment of GO terms or the binding of TFs may lead to rapid iden- tification of the upstream regulators and downstream pathways associated with the top-ranked genes. Evaluation of the Utility of pgDQ for Quantifying Subgenome Divergence at the Cellular Level To evaluate the capability of pgDQ in handling the inher- ent sparsity and noise in scRNA-seq data (Fig. 3a), we com- pared its performance with the conventional method using the full dataset and simulated datasets to mimic low- coverage or high-noise conditions. The subgenome diver- gence measured using each dataset was compared with the result for the full dataset. Using simulated missing values (10% to 90%), we calcu- lated Pearson’s correlation coefficient to clarify the associ- ation between the resulting divergence and the divergence revealed by the original scRNA-seq data. As shown in Fig. 3b, pgDQ consistently performed better than the con- ventional method. Notably, Pearson’s correlation coefficient was still >0.85 when the simulated missing value was 50%. We next added simulated technical noise (Poisson’s distribu- tion, λ = 1) to the scRNA-seq data (Fig. 3a, right panel). The comparison with the original dataset indicated pgDQ was in- sensitive to the added noise (Fig. 3c). Thus, in terms of the robustness against data dropout and noise, which are typical problems associated with single-cell data analyses, pgDQ outperforms the conventional approach. pgDQ Quantifies Subgenome Divergence at Single-Cell Resolution Along the Evolutionary Trajectory We are now able to robustly quantify and compare subge- nome divergence at the single-cell level. Understanding how this divergence dynamically changes during cellular differen- tiation is central to understanding the linkage between sub- genome divergence and polyploid development. There are already well-established algorithms, such as pseudotime tra- jectory (Trapnell et al. 2014), that utilize single-cell data to infer the quantitative trajectory of cellular differentiation during development. Based on this approach, the initial cell cluster, hypothesized to be the meristem, was identified Allopolyploid Single-cell Data Integration and gene mining · https://doi.org/10.1093/molbev/msae178 MBE 3 D ow nloaded from https://academ ic.oup.com /m be/article/41/9/m sae178/7746018 by C entro Internacional de M ejoram iento de M aiz y Trigo user on 14 N ovem ber 2024 http://academic.oup.com/mbe/article-lookup/doi/10.1093/molbev/msae178#supplementary-data (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) Fig. 1. Limitations of the conventional method for detecting subgenome divergence using allopolyploid wheat single-cell data. a) An UMAP plot depicting 20 cell clusters from wheat root tips, with each dot representing an individual cell and colors distinguishing the clusters. b) A single-cell expression matrix where each row corresponds to a single cell and each column corresponds to a triad gene from one subgenome. c) A heatmap displaying the expression levels of triad genes across the three subgenomes in a single cell, with gray indicating genes not detected in a given cell, highlighting the data sparsity inherent to scRNA-seq. d) An enlarged heatmap displaying the subgenome-biased expression groups of triad genes. e) A ternary plot illustrating quantifying subgenome dominance based on the relative expression abundance of 16,427 triads (49,281 genes) in hexaploid wheat. Each dot representing a triad gene and the relative distance to three vertices represents its relative expression level in one subgenome in comparison with the total expression across the three subgenomes. The axes represent the relative expression levels across the three subgenomes. f) A ternary plot illustrating the classification of subgenome-biased transcription groups based on the bulk root tran- scriptomic data. A total of 6,702 triad genes, representing 40% of all triad genes, show subgenome bias in bulk data. g) The matrix of distance of each triad gene expression to the balance point, derived from panel e, illustrates subgenome divergence across individual genes within all cells. Cross-cell comparisons are based on averaged gene expression to evaluate collective subgenome divergence. h) A graphical representation il- lustrating the exponential rise in the family-wise error rate as the number of comparisons increases. i) A ternary plot showing the subgenome- biased transcription of triad genes across subgenomes in a single cell. On average, 985 traid genes, accounting for 6% of all triad genes, exhibit subgenome bias per cell. The plot reveals a limited number of detectable genes per cell due to the sparsity of single-cell data. j) A comparative chart of the proportion of divergent triads detected in root tip single cells and bulk root tissue. Wang et al. · https://doi.org/10.1093/molbev/msae178 MBE 4 D ow nloaded from https://academ ic.oup.com /m be/article/41/9/m sae178/7746018 by C entro Internacional de M ejoram iento de M aiz y Trigo user on 14 N ovem ber 2024 from previously published wheat root tip single-cell data (Zhang et al. 2023; Fig. 4a). For each cell, there were two quantitative metrics, namely the “pseudotime ordering score,” representing the cellular differentiation during devel- opment (Fig. 4b), and the “subgenome divergence score” as determined by pgDQ (Fig. 4c). Clear association was Fig. 2. A pgDQ workflow. Separation by subgenome: the original matrix is reshaped into an expression tensor with expression values separated by subgenome, resulting in three distinct pseudo-genomes. For each cell (take cell 1 for example), these pseudo-genomes are represented as points in a multidimensional space, where subgenome divergence is directly indicated by the Euclidean distance between these points. Quantification of subgenome divergence in development: for each cell, the Euclidean distance between the three pseudo-genomes was calcu- lated, providing a direct measure of subgenome divergence for that cell. The ⇀ symbol represents a vector of gene expression values in each pseudo-genome. This transformation converts the complex statistical issue of detecting subgenome divergence into a simple calculation of spa- tial distances, minimizing errors from extensive multiple comparisons. By mapping the quantitative subgenome divergence onto cell differen- tiation trajectories ordered by pseudo-time, the stochastic subgenome divergence could be linked to cell differentiation during development. Key gene detection and data integration: the quantitative cell-level subgenome divergence could be further applied for quantifying the contri- bution of genes to subgenome divergence during cell differentiation (the divGene algorithm, detailed description in Fig. 5a), which serves as the basis for integrating different data types, facilitating mechanism insights into evo-devo. Allopolyploid Single-cell Data Integration and gene mining · https://doi.org/10.1093/molbev/msae178 MBE 5 D ow nloaded from https://academ ic.oup.com /m be/article/41/9/m sae178/7746018 by C entro Internacional de M ejoram iento de M aiz y Trigo user on 14 N ovem ber 2024 detected between these two quantitative features (Fig. 4d), which aligns with the evo-devo theory that certain develop- mental stages echo the evolutionary trajectory (Irie and Kuratani 2011) at the cellular transcriptional level. The asso- ciation between subgenome divergence and differentiation was also detected within a single-cell cluster (meristem) dis- tinguished by Uniform Manifold Approximation and Projection (UMAP) (Fig. 4e). These results highlight the sig- nificance of quantifying subgenome divergence at the cellu- lar level, offering a detailed understanding of how evolution and development are interconnected. divGene Detects Genes Indispensable for Subgenome Divergence During Differentiation Genes that contribute to progressive subgenome diver- gence may drive or be closely related to cellular differenti- ation. However, comparing gene expression across cells is challenging because of data sparsity; only 2% to 35% of genes can be detected in a single cell (Fig. 1i and j). This coupled with large-scale multiple comparisons across sub- genomes and cells leads to unreliable gene-wise results. The divGene algorithm utilizes cellular divergence and dif- ferentiation information to guide gene identification and is insensitive to data dropout. The basis is that if leaving out one gene significantly affects the correlation between subge- nome divergence and cell differentiation as reflected by pseudotime ordering, then the gene is potentially indispens- able for divergence-mediated differentiation. Briefly, the cor- relation between subgenome divergence and cell differentiation across all 4,935 cells was determined before and after leaving out one triad gene (Fig. 5a). Because of the nonlinear correlation between the divergence score and the pseudotime ordering score (Fig. 4d), we applied Spearman’s rank correlation coefficient (ρ). The ρ value was calculated for each of the 16,427 triads. An indispensabil- ity score (IS) was calculated as ln 2/(1 + r) ( 􏼁 (Fig. 5b). Figure 5c illustrates the subgenome-divergent expression of TaTIP1 (a top-ranked gene) whose overexpression can pro- mote development (Rodrigues et al. 2016). If the top 2.5% of the triad genes were left out, the subgenome divergence trend during differentiation was no longer apparent (Fig. 5d). In summary, divGene effectively detected key genes associated with subgenome divergence during development. Ranking Indispensable Genes to Integrate Multimodal Data and Facilitate Mechanism Insights The major challenge in single-cell data analyses lies in inter- preting the results and integrating multimodal data to gain insights into biological mechanisms. A gene list-centric analysis which involves comparing lists of biologically signifi- cant genes curated from different data modalities is a powerful strategy broadly applicable for integrating data (Subramanian et al. 2005; Yi et al. 2013). This threshold-free approach can overcome the limitations of conventional ana- lyses that rely on a single arbitrary P-value threshold that does not preserve the weight of different genes and may dis- card relevant biological information. (a) (b) (c) Fig. 3. Robustness of pgDQ against data dropout and noise. a) A scheme with simulated data dropout and noise. Distribution of the correlation between the original data and the datasets with simulated sparsity (b) and noise (c). The x axis represents the fraction of genes with adjusted expression values due to introduced sparsity or noise in the simulation. Wang et al. · https://doi.org/10.1093/molbev/msae178 MBE 6 D ow nloaded from https://academ ic.oup.com /m be/article/41/9/m sae178/7746018 by C entro Internacional de M ejoram iento de M aiz y Trigo user on 14 N ovem ber 2024 In this study, genes were ranked according to the IS, re- flecting their contribution to genome divergence during development and serving as a promising anchor for inte- grating data. By comparing the ranked genes with lists of genes characterized according to the associated pathways and functions in previously annotated databases, we deter- mined that genes with a high IS were mainly related to cell wall biogenesis, membrane dynamics, and cell tip growth (Fig. 6a and b), which are well-characterized processes me- diating differentiation (Barberon et al. 2016; Somssich et al. 2016). While the overall trend shows a positive correlation between subgenome divergence and pseudotime ordering (Fig. 4), it is observed that a few functional gene sets display decreased subgenome divergence during the devel- opmental process (supplementary fig. S4, Supplementary Material online). Genes in the top enriched pathways displayed increased subgenome divergence during differ- entiation (Fig. 6b). Their candidate function in linking sub- genome divergence and polyploid development is an interesting issue for further study. TFs play key roles in differentiation (Vaquerizas et al. 2009). The hundreds of TF-binding target lists in (a) (b) (c) (d) (e) Fig. 4. Superimposing quantitative subgenome divergence onto the pseudotime developmental trajectory. a) An expression pattern of the meri- stem marker gene TaH2AX in UMAP. b) Cells plotted along the pseudotime trajectory, with colors indicating the developmental hierarchy based on the pseudotime ordering score. The progression from less differentiated to more differentiated states is visualized, originating from the meri- stem. c) The cellular subgenome divergence scores, as calculated by the pgDQ method, are overlaid on the UMAP plot, with colors representing the degree of subgenome divergence within individual cells. d) The distribution of subgenome divergence scores across all 20 cell clusters is depicted, with clusters ranked according to their average pseudotime ordering score. This ranking illustrates the relationship between cellular differentiation and subgenome divergence. e) The Spearman’s rank correlation coefficient is calculated to assess the association between the pseudotime ordering score and the subgenome divergence score for each cell within the meristem cluster. The scatter plot displays the cells as dots, with the strength and direction of the correlation evident through the distribution and clustering patterns. Allopolyploid Single-cell Data Integration and gene mining · https://doi.org/10.1093/molbev/msae178 MBE 7 D ow nloaded from https://academ ic.oup.com /m be/article/41/9/m sae178/7746018 by C entro Internacional de M ejoram iento de M aiz y Trigo user on 14 N ovem ber 2024 http://academic.oup.com/mbe/article-lookup/doi/10.1093/molbev/msae178#supplementary-data http://academic.oup.com/mbe/article-lookup/doi/10.1093/molbev/msae178#supplementary-data Wheat-RegNet profiled using DNA affinity purification sequencing (DAP-seq) (Tang et al. 2023) were com- pared with a gene list ranked by the IS. A rank-based statistical test suggested that bZIP and NAC TF families are mostly likely involved in regulating subgenome divergence during differentiation (Fig. 6c and d). Members of these TF families are responsible for phloem, xylem, and root hair development reported previously (Yin et al. 1997; Xie et al. 2000; Yamaguchi et al. 2011). Together, ranking indispensable genes facil- itates the integration of other data modalities and the interpretation of biological significance, allowing a dee- per understanding of the mechanisms linking polyploid evolution and development. Applying pgDQ to High-Dimensional Tissue Developmental Data In addition to single-cell data, pgDQ is readily applicable to quantifying, integrating, and interpreting other types of high-dimensional data in polyploids. Subgenome- divergent transcription differs to varying degrees in different tissues. How divergent expression relates and contributes to development has long been a research focus of polyploid research (Pfeifer et al. 2014; Wicker et al. 2018). However, comparing subgenome divergence across a large amount of tissues is statistically difficult, limiting further data integration and mechanism elucidation. Here, we applied pgDQ to study the effects of wheat (a) (b) (c) (d) Fig. 5. divGene identifies essential genes potentially driving subgenome divergence during development. a) A scheme depicting the divGene algorithm for calculation of the IS. The Spearman’s rank correlation coefficient (ρ) for the subgenome divergence in cells after leaving out one triad gene was calculated. The IS was then calculated as ln 2/(1 + ρ) ( 􏼁 , followed by Z-score normalization. b) Distribution of the IS for all triad genes. The top-ranked functional genes were annotated. c) An expression pattern of the top-ranked triad gene TaTIP1. d) Cellular sub- genome divergence in an UMAP before and after leaving out the top 2.5% of genes. Wang et al. · https://doi.org/10.1093/molbev/msae178 MBE 8 D ow nloaded from https://academ ic.oup.com /m be/article/41/9/m sae178/7746018 by C entro Internacional de M ejoram iento de M aiz y Trigo user on 14 N ovem ber 2024 subgenome divergence on tissue developmental processes. RNA-seq data from 209 wheat developmental stages of roots, leaves, shoots, spikes, and grains were downloaded from the Toronto International Data Release (Toronto International Data Release Workshop Authors et al. 2009). Samples with different tissues and developmental stages could be distinguished by UMAP (Fig. 7a). A pseudotime analysis was performed and the developmental trajec- tory was projected onto UMAP (Fig. 7b). pgDQ was ap- plied to quantify subgenome-divergent scores in each tissue, which were also superimposed upon UMAP. The dynamic subgenome-divergent transcription coin- cided well with the progression of tissue development (Fig. 7b and c), consistent with the finding at the cellular level (Fig. 4). To detect genes contributing to subgenome divergence during development, ISs for all triad genes were calculated based on the divGene algorithm. The top-ranked genes in- clude well-characterized factors that control various devel- opmental processes (Fig. 7d). For example, BEI and GluA regulate grain development (Akasaka et al. 2009; Yang et al. 2019), FTIP7 regulates flower development (Song et al. 2018), and HTD1 and HL6 control plant vegetative growth (Zou et al. 2005; Sun et al. 2017). Further integration with functional pathways and TF-binding profiles detected similar results with single-cell analysis, with cell wall organ- ization and NAC families among the top enriched pathways and TF-mediating subgenome-divergent development (Fig. 7e and f). Overall, pgDQ uncovers a significant correl- ation between subgenome divergence and cell differenti- ation in common wheat, facilitating the identification of essential genes contributing to subgenome divergence dur- ing development and enhancing the understanding of de- velopmental mechanisms in polyploids. Discussion As the cost of high-throughput data continues to decrease, single-cell and large-scale bulk tissue transcriptomic data are gradually becoming prevalent for evo-devo studies, posing significant challenges to data comparison, integra- tion, and interpretation. The analysis of polyploids is far more difficult because of the additional layer of compari- son across subgenomes in each individual cell. The lack of available computational tools has impeded the wide- spread use of single-cell and large-scale bulk sequencing (a) (c) (b) (d) Fig. 6. Leveraging IS for multimodal data integration. a) Top enriched functional terms for genes ranked by their ISs using GSEA (Subramanian et al. 2005). The x axis represents the genes ordered by the normalized IS, with the distribution of IS shown in the bottom panel. The middle heatmap visualizes the ranked list position for genes sharing one GO term. The top panel depicts the enrichment score, assessing the degree of correlation between gene set membership and gene ranking based on IS. The enrichment plot for the top five enriched terms is shown, which are detailed in b. b) Analysis of enriched functional terms and cellular subgenome divergence. Left, list of top ten enriched functional terms as iden- tified in a. The normalized enrichment score (NES), P-value (P), and false discovery rate (FDR) are determined by GSEA package. Right, a heatmap depicting cellular subgenome divergence, quantified by equation (1) in the “Materials and Methods” section, with n denoting the number of genes associated with each specified GO term. The enriched functional processes contributed to subgenome divergence during cellular devel- opment. c) and d) Analogous to a) and b), these panels show the enrichment analysis for gene sets based on TF target genes identified by DAP-seq (Zhang, et al. 2022) and curated in the Wheat-RegNet database (Tang et al. 2023). Allopolyploid Single-cell Data Integration and gene mining · https://doi.org/10.1093/molbev/msae178 MBE 9 D ow nloaded from https://academ ic.oup.com /m be/article/41/9/m sae178/7746018 by C entro Internacional de M ejoram iento de M aiz y Trigo user on 14 N ovem ber 2024 data in polyploid research. In this study, we designed a quantitative computational framework to overcome these limitations and address key issues related to polyploid research. First, to robustly detect subgenome divergence at the cellular level, we restructured the data by segregating the expression values of genes from each subgenome, effect- ively transforming the complex, multidimensional diver- gence analysis into a more tractable comparison of Euclidean distances between pseudo-genomes within a high-dimensional space. This transformation preserved the original data information, while avoiding the high error rate caused by the high-dimensional comparison, noise, and sparsity, as illustrated in Fig. 2. Second, to robustly de- tect the contribution of genes to subgenome divergence during development, we adopted a top-down strategy coupled with statistical diagnostics to overcome specific issues (i.e. data noise and sparsity) presented by single-cell data. Third, the results of the above quantitative compar- isons were easily integrated with other data modalities, avoiding the loss of biologically significant genes due to the use of an arbitrary P-value threshold, which is a serious issue as the number of comparisons increases for integrat- ing data. It should be noted that quantifying subgenome diver- gence by pgDQ relies on the clear distinction between in- dividual subgenomes. It is not suitable for processing polyploids without a phased subgenome annotation. We also expanded the application of the pgDQ method be- yond hexaploid wheat to include tetraploids. By simulating (a) (d) (b) (c) (e) (f) Fig. 7. The application of pgDQ in tissue-wise subgenome-divergent detection and data integration. a) An UMAP plot illustrating the develop- mental stages of various wheat tissues. b) An UMAP plot with samples colored by pseudotime, indicating progression from immature to mature states. c) An UMAP plot with samples colored by subgenome divergence scores, as determined by pgDQ, with color intensity reflecting the divergence magnitude. d) Distribution of the divGene-calculated ISs for triad genes. The top enriched functional genes are labeled. The most enriched functional terms (e) and TFs (f) linked to genes ordered by their ISs. Wang et al. · https://doi.org/10.1093/molbev/msae178 MBE 10 D ow nloaded from https://academ ic.oup.com /m be/article/41/9/m sae178/7746018 by C entro Internacional de M ejoram iento de M aiz y Trigo user on 14 N ovem ber 2024 three pseudo-tetraploid genomes from the hexaploid wheat, we obtained results that were consistently in line with those from the hexaploid wheat (supplementary fig. S5, Supplementary Material online). The method’s effect- iveness grows with the increase in ploidy, which is pertin- ent considering the prevalence of polyploidy among cultivated crops. As the scale and availability of polyploids rapidly in- crease, we anticipate that pgDQ, which has diverse appli- cations and is simple to use, will become a valuable tool that can be rapidly adopted by researchers focused on polyploids, potentially leading to insightful evo-devo discoveries. Materials and Methods Plant Materials and Growth Conditions Common wheat (Triticum aestivum cultivar “Chinese Spring”) seeds were surface-sterilized by a 10-min incuba- tion in 30% H2O2 and then thoroughly washed five times with distilled water. The seeds were germinated in water for 3 d at 22 °C. The root tips (0.5 cm) of the seedlings were harvested for scRNA-seq experiment. Single-Cell RNA-seq Library Construction and Sequencing The root tips from 80 to 100 wheat plants were harvested and digested in digestion buffer at room temperature for 2 h (1.5% cellulase R10, 1.5% macerozyme R10, 0.3 M man- nitol, 3 mM β-mercaptoethanol, 10 mM CaCl2, and 0.3% Bovine Serum Albumin (BSA)). Single-cell suspensions were loaded onto 10× Genomics Chromium to capture ∼6,000 single cells following the manufacturer’s instructions for the 10× Genomics Chromium Single-Cell 3′ kit (V3). cDNA amplification, library construction, and sequencing on the Illumina NovaSeq 6000 platform (paired-end multi- plexing run, 150 bp) were performed by LC-Bio Technology (Hangzhou, China). Processing of Single-Cell RNA-seq Data Data from 10× Genomics were analyzed by Cell Ranger (v7.0.0) (https://www.10xgenomics.com/support/software/ cell–ranger/latest) to perform alignment, barcode count- ing, and unique molecular identifier (UMI) counting. The reference genome (v1.0) and GTF annotation files (v1.1) of T. aestivum were downloaded from the International Wheat Genome Sequencing Consortium (IWGSC). Uniquely mapped reads were used for UMI counting, which account for 95% of total aligned reads (supplementary fig. S6, Supplementary Material online). To assess whether the inclusion of the remaining 5% of multimapped reads affects the quantification of subge- nome divergence, we compared various algorithms de- signed for handling multimapped reads, including the Uniform, PropUnique, EM, and Rescue algorithms available in STARsolo (v2.7.11b) (Dobin et al. 2013). The Uniform algorithm allocates multigene UMIs evenly across all genes within the multigene set. The PropUnique algorithm assigns multigene UMIs propor- tionally based on the number of unique UMIs per gene. The EM algorithm utilizes maximum likelihood estima- tion to distribute multigene UMIs among their respective genes (Jin et al. 2015). The Rescue algorithm distributes multigene UMIs to their gene set proportionally to the sum of the number of unique-gene UMIs and uniformly distributed multigene UMIs in each gene (Mortazavi et al. 2008). Cell Clustering and Reconstruction of Cell Trajectories The initial steps involved preprocessing the gene–cell ma- trices obtained from Cell Ranger to eliminate ambient or leaked RNA by “autoEstCont” and “adjustCounts” func- tions in SoupX (v1.6.2) package (Young and Behjati 2020). The processed data were imported into the Seurat (v4.0) R package (Stuart et al. 2019) for further qual- ity control and analysis. The low-quality cells and genes were filtered according to the following criteria: 1) Cells with an expressed gene count between 500 and 10,000 were kept (“subset” function with parameters “nFeature_RNA <10,000 and nFeature_RNA >500”). 2) Cells with UMIs between 500 and 20,000 were kept (“subset” function with “nCount_RNA <20,000 and nCount_RNA >500”). 3) Genes expressed in three or fewer cells were removed. 4) Cells with a mitochondrial gene exceeding 2% were excluded (“subset” function with “percent.mt <2”). 5) DoubletFinder (v2.0.4) (McGinnis et al. 2019) was employed for the detection of doublets, retaining only those cells identified as singlets. “NormalizeData” function with the method “LogNormalize” and a scaling factor 10,000 was used for normalization. This process involved dividing the feature counts for each cell by the total counts for that cell and then multiplying by the scale factor. Harmony (v0.1.0) (Korsunsky et al. 2019) was used to integrate replicates. Highly variable genes were identified using the “FindVariableFeatures” function with the “method” set to “vst” and “nfeatures” specified as 2,000. For principal component analysis, the scaled data were reduced to 100 principal components (set npcs = 100). Clustering of cells was identified using “FindClusters” func- tion with a resolution of 0.5. The data structures were vi- sualized by UMAP (“RunUMAP” function with “dims set to 1:50”). The R package Monocle (v2.22.0) was used to analyze the process of cell differentiation (pseudo-time ordering) and the determination of cell fate. First, marker genes with an adjusted P-value <0.05 were defined as develop- mental progress specific genes. Second, “reduceDimension” Allopolyploid Single-cell Data Integration and gene mining · https://doi.org/10.1093/molbev/msae178 MBE 11 D ow nloaded from https://academ ic.oup.com /m be/article/41/9/m sae178/7746018 by C entro Internacional de M ejoram iento de M aiz y Trigo user on 14 N ovem ber 2024 http://academic.oup.com/mbe/article-lookup/doi/10.1093/molbev/msae178#supplementary-data http://academic.oup.com/mbe/article-lookup/doi/10.1093/molbev/msae178#supplementary-data https://www.10xgenomics.com/support/software/cell-ranger/latest https://www.10xgenomics.com/support/software/cell-ranger/latest http://academic.oup.com/mbe/article-lookup/doi/10.1093/molbev/msae178#supplementary-data function with parameters “max_components = 2” and “method = ‘DDRTree’” was used to reduce the dimensional- ity of the data. Last, “orderCells” function was then used to order cells in pseudotime along a trajectory. The state with clusters expressing marker genes of meristem cells was set as the root state. Processing of RNA-seq Data We downloaded root RNA-seq data from the NCBI GEO database (accession number GSE198284). The sequen- cing reads were trimmed with fastp (v0.22.0) (Chen et al. 2018), which eliminated bases with low quality scores (<20) as well as short reads (<20 bp) and sequen- cing adapters. The remaining clean reads were aligned to the IWGSC RefSeq genome assembly version 1.0 using HISAT2 program (v2.1.0) (Kim et al. 2015). SortMeRNA (Kopylova et al. 2012) was used to remove reads origin- ating from chloroplast, mitochondria, and rRNA. The featureCount program of the Subread package (v1.5.3) (Liao et al. 2014) was used to determine the RNA-seq read density of the high-confidence genes from version 1.1. Definition of Subgenome Expression Divergence Using Conventional Method in Common Wheat OrthoFinder (Emms and Kelly 2015) was used to identify the orthologous genes between subgenomes. The orthogroups with only one copy in each subgenome (1:1:1) were defined as triad genes. A total of 16,427 triad genes were used for downstream analysis. The relative ex- pression was determined by the relative proportion of read counts in one subgenome to the overall counts from three subgenomes. The subgenome-biased and unbiased expres- sions were defined by the relative position to the seven predefined standard coordinates in the ternary plot. Specifically, the proportion triad [1/3, 1/3, 1/3] represents a balanced expression or binding. The proportion triads [0.5, 0.5, 0], [0, 0, 1], [0.5, 0, 0.5], [0, 1, 0], [0, 0.5, 0.5], and [1, 0, 0] represent divergent patterns (D suppressed, D dominant, B suppressed, B dominant, A suppressed, and A dominant, respectively). Subgenome gene expression di- vergence pattern was determined by the nearest standard proportion pattern. Quantification of Subgenome Divergence at Cellular Level Using pgDQ Each subgenome was treated as an individual pseudo- genome. For each cell, subgenome divergence (distance, Dist0) across pseudo-genomes in dataspace was calculated using the following formula: Dist0 = ����������������� 􏽘n i=1 (Ai − Bi) 2 􏽲 + ����������������� 􏽘n i=1 (Ai − Di) 2 􏽲 + ����������������� 􏽘n i=1 (Bi − Di) 2 􏽲 (1) where Ai, Bi, and Di are the expression values of triad gene i and n is the total number of triad genes. The Euclidean dis- tance is implemented in R package rdist (v0.0.5). Simulated Datasets and Evaluation To compare the robustness between the pgDQ and the conventional methods, we simulated scRNA-seq datasets with 10% to 90% missing values using “sample” and “re- place” function in R. The Poisson distributed noise was also randomly added to the original data. Each simulation was performed 150 times, and the distribution of the simu- lated variables is shown in supplementary fig. S7, Supplementary Material online. For pgDQ, the Pearson’s correlation coefficient between the Dist0 of all cells of si- mulated data and original data was calculated. For the conventional method, the distance of relative expression of each triad gene to balance point [0.33, 0.33, 0.33] was calculated for each cell. Then, the correlations between the distances of all cells of simulated data and original data for all triad genes were compared with correlations generated by pgDQ. Calculating Gene ISs Using pgDQ For each fixed triad gene j, the expression values Aj, Bj, and Dj were set to zero, and then, subgenome divergence (dis- tance, Distj) for each cell was calculated using formula (1). Pseudotime for all cells (T0) was extracted from the cell dif- ferentiation trajectory that had been constructed earlier. Then, the Spearman’s rank correlation coefficient (ρ) be- tween the T0 and the Distj of all cells was calculated using cor function in R package stats (v3.6.2). The IS of the triad gene was calculated using the following formula: IS = ln 2 1 + ρ 􏼒 􏼓 (2) Next, the IS was Z-score-normalized, and we ranked all genes according to the normalized IS for further analysis. Gene Set Enrichment Analysis of Ranked Indispensable Genes Genes ranked by the normalized IS were used to calculate the enrichment score of GO terms or the targets of TFs using gene set enrichment analysis (GSEA) (v4.3.2) (Subramanian et al. 2005). The GO terms were collected from GOMAP (Wimalanathan and Lawrence-Dill 2021), and those with fewer than 15 triad genes were filtered out. Quantification of Subgenome Divergence at Cellular Level Using pgDQ in Simulated Pseudo-tetraploids We simulated three pseudo-tetraploid datasets by ex- tracting data from any two subgenomes of hexaploid wheat dataset. For each cell in each pseudo-tetraploid, subgenome divergence (distance, Dist) across pseudo- genomes in dataspace was calculated using the following formula: Wang et al. · https://doi.org/10.1093/molbev/msae178 MBE 12 D ow nloaded from https://academ ic.oup.com /m be/article/41/9/m sae178/7746018 by C entro Internacional de M ejoram iento de M aiz y Trigo user on 14 N ovem ber 2024 http://academic.oup.com/mbe/article-lookup/doi/10.1093/molbev/msae178#supplementary-data http://academic.oup.com/mbe/article-lookup/doi/10.1093/molbev/msae178#supplementary-data DistAB = ����������������� 􏽘n i=1 (Ai − Bi) 2 􏽲 (3) DistAD = ����������������� 􏽘n i=1 (Ai − Di) 2 􏽲 (4) DistBD = ����������������� 􏽘n i=1 (Bi − Di) 2 􏽲 (5) where Ai, Bi, and Di are the expression values of homoeo- logous gene i and n is the total number of homoeologous genes. The Euclidean distance is implemented in R package rdist (v0.0.5). Supplementary Material Supplementary material is available at Molecular Biology and Evolution online. Funding This study was supported by the Open Project Program of the State Key Laboratory for Crop Stress Resistance and High- Efficiency Production (Y.J.Z), CAS “Light of West China” for the Interdisciplinary Innovation Team (No. xbzg-zdsys- 202109)(B.Z), National Natural Science Foundation of China (32270628) (Y.J.Z), National Key R&D Program of China (2023YFF1205101) (Y.Q.H.), State Key Laboratory of Crop Stress Adaptation and Improvement (2024KF03) (Y.J.Z), Yangfan Project of Shanghai Science and Technology Commission (23YF1403100) (M.Y.W), State Key Laboratory of Plant Cell and Chromosome Engineering, Institute of Genetics and Developmental Biology (PCCE-KF-2023-03) (Y.J.Z), and grant from Shanghai Key Laboratory of Agricultural Genetics and Breeding (21DZ2271900) (M.Y.W), Beijing Life Science Academy (BLSA), No: 2024400CD0100 (Y.J.Z), State Key Laboratory of Genetic Engineering (SKLGE-2312) (W.L.Z), National Natural Science Foundation of China (31290210) (H.-Q.L). Author Contributions Y.J.Z., Y.Q.H., H.Q.L., C.H.L., B.Z., H.L., H.D.M., Y.B.X., and W.L.Z. conceived and designed the experiments. Z.J.L., K.D.L., S.S.Z., Y.L.F., Y.Z., and W.T. performed the experiments. M.Y.W., H.Y.W., Y.Y.Z, Y.P.T., A.R., S.B., and Y.J.Z. analyzed the data. Y.J.Z. wrote the manuscript with input from all authors. Conflict of Interest The authors declare no conflict of interest. Data Availability The source code for pgDQ is released under an MIT open source license and is available from GitHub (https:// github.com/MeiyueWang/pgDQ). The sequencing data are available in the Gene Expression Omnibus repository https://www.ncbi.nlm.nih.gov/geo/ under accession num- ber GSE224228. References Akasaka T, Vu NT, Chaen K, Nishi A, Satoh H, Ida H, Omori T, Kimura M. The action of rice branching enzyme I (BEI) on starches. Biosci Biotechnol Biochem. 2009:73(11):2516–2518. https://doi.org/10. 1271/bbb.90352. Argelaguet R, Cuomo ASE, Stegle O, Marioni JC. Computational prin- ciples and challenges in single-cell data integration. Nat Biotechnol. 2021:39(10):1202–1215. https://doi.org/10.1038/ s41587-021-00895-7. Barberon M, Vermeer JE, De Bellis D, Wang P, Naseer S, Andersen TG, Humbel BM, Nawrath C, Takano J, Salt DE, et al. Adaptation of root function by nutrient-induced plasticity of endodermal dif- ferentiation. Cell. 2016:164(3):447–459. https://doi.org/10.1016/ j.cell.2015.12.021. Chen SF, Zhou YQ, Chen YR, Gu J. Fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. 2018:34(17):i884–i890. https://doi.org/10.1093/bioinformatics/bty560. De Storme N, Mason A. Plant speciation through chromosome in- stability and ploidy change: cellular mechanisms, molecular fac- tors and evolutionary relevance. Curr Plant Biol. 2014:1:10–33. https://doi.org/10.1016/j.cpb.2014.09.002. Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013:29(1):15–21. https://doi.org/10. 1093/bioinformatics/bts635. Doyle JJ, Coate JE. Polyploidy, the nucleotype, and novelty: the im- pact of genome doubling on the biology of the cell. Int J Plant Sci. 2019:180(1):1–52. https://doi.org/10.1086/700636. Doyle JJ, Flagel LE, Paterson AH, Rapp RA, Soltis DE, Soltis PS, Wendel JF. Evolutionary genetics of genome merger and doubling in plants. Annu Rev Genet. 2008:42(1):443–461. https://doi.org/ 10.1146/annurev.genet.42.110807.091524. Emms DM, Kelly S. OrthoFinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup infer- ence accuracy. Genome Biol. 2015:16(1):157. https://doi.org/10. 1186/s13059-015-0721-2. Hu G, Grover CE, Arick MA, Liu M, Peterson DG, Wendel JF. Homoeologous gene expression and co-expression network ana- lyses and evolutionary inference in allopolyploids. Brief Bioinform. 2021:22(2):1819–1835. https://doi.org/10.1093/bib/bbaa035. Irie N, Kuratani S. Comparative transcriptome analysis reveals verte- brate phylotypic period during organogenesis. Nat Commun. 2011:2(1):248. https://doi.org/10.1038/ncomms1248. Jin Y, Tam OH, Paniagua E, Hammell M. TEtranscripts: a package for including transposable elements in differential expression ana- lysis of RNA-seq datasets. Bioinformatics. 2015:31(22): 3593–3599. https://doi.org/10.1093/bioinformatics/btv422. Kim D, Langmead B, Salzberg SL. HISAT: a fast spliced aligner with low memory requirements. Nat Methods. 2015:12(4):357–360. https://doi.org/10.1038/nmeth.3317. Kopylova E, Noé L, Touzet H. SortMeRNA: fast and accurate filtering of ribosomal RNAs in metatranscriptomic data. Bioinformatics. 2012:28(24):3211–3217. https://doi.org/10.1093/bioinformatics/ bts611. Korsunsky I, Millard N, Fan J, Slowikowski K, Zhang F, Wei K, Baglaenko Y, Brenner M, Loh PR, Raychaudhuri S. Fast, sensitive and accurate integration of single-cell data with Harmony. Nat Methods. 2019: 16(12):1289–1296. https://doi.org/10.1038/s41592-019-0619-0. Kvam VM, Liu P, Si Y. A comparison of statistical methods for detect- ing differentially expressed genes from RNA-seq data. Am J Bot. 2012:99(2):248–256. https://doi.org/10.3732/ajb.1100340. Li Z, Wang M, Lin K, Xie Y, Guo J, Ye L, Zhuang Y, Teng W, Ran X, Tong Y, et al. The bread wheat epigenomic map reveals distinct chromatin architectural and evolutionary features of functional genetic elements. Genome Biol. 2019:20(1):139. https://doi.org/ 10.1186/s13059-019-1746-8. Liao Y, Smyth GK, Shi W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Allopolyploid Single-cell Data Integration and gene mining · https://doi.org/10.1093/molbev/msae178 MBE 13 D ow nloaded from https://academ ic.oup.com /m be/article/41/9/m sae178/7746018 by C entro Internacional de M ejoram iento de M aiz y Trigo user on 14 N ovem ber 2024 http://academic.oup.com/mbe/article-lookup/doi/10.1093/molbev/msae178#supplementary-data https://github.com/MeiyueWang/pgDQ https://github.com/MeiyueWang/pgDQ https://www.ncbi.nlm.nih.gov/geo/ https://doi.org/10.1271/bbb.90352 https://doi.org/10.1271/bbb.90352 https://doi.org/10.1038/s41587-021-00895-7 https://doi.org/10.1038/s41587-021-00895-7 https://doi.org/10.1016/j.cell.2015.12.021 https://doi.org/10.1016/j.cell.2015.12.021 https://doi.org/10.1093/bioinformatics/bty560 https://doi.org/10.1016/j.cpb.2014.09.002 https://doi.org/10.1093/bioinformatics/bts635 https://doi.org/10.1093/bioinformatics/bts635 https://doi.org/10.1086/700636 https://doi.org/10.1146/annurev.genet.42.110807.091524 https://doi.org/10.1146/annurev.genet.42.110807.091524 https://doi.org/10.1186/s13059-015-0721-2 https://doi.org/10.1186/s13059-015-0721-2 https://doi.org/10.1093/bib/bbaa035 https://doi.org/10.1038/ncomms1248 https://doi.org/10.1093/bioinformatics/btv422 https://doi.org/10.1038/nmeth.3317 https://doi.org/10.1093/bioinformatics/bts611 https://doi.org/10.1093/bioinformatics/bts611 https://doi.org/10.1038/s41592-019-0619-0 https://doi.org/10.3732/ajb.1100340 https://doi.org/10.1186/s13059-019-1746-8 https://doi.org/10.1186/s13059-019-1746-8 Bioinformatics. 2014:30(7):923–930. https://doi.org/10.1093/ bioinformatics/btt656. Marioni JC, Arendt D. How single-cell genomics is changing evolu- tionary and developmental biology. Annu Rev Cell Dev Biol. 2017:33(1):537–553. https://doi.org/10.1146/annurev-cellbio- 100616-060818. McGinnis CS, Murrow LM, Gartner ZJ. DoubletFinder: doublet de- tection in single-cell RNA sequencing data using artificial nearest neighbors. Cell Syst. 2019:8(4):329–337.e4. https://doi.org/10. 1016/j.cels.2019.03.003. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008:5(7):621–628. https://doi.org/10.1038/nmeth.1226. Pfeifer M, Kugler KG, Sandve SR, Zhan BJ, Rudi H, Hvidsten TR, Mayer KFX, Olsen OA, Rogers J, Dolezel J, et al. Genome interplay in the grain transcriptome of hexaploid bread wheat. Science. 2014: 345(6194):1250091. https://doi.org/10.1126/science.1250091. Ramírez-González RH, Borrill P, Lang D, Harrington SA, Brinton J, Venturini L, Davey M, Jacobs J, van Ex F, Pasha A, et al. The tran- scriptional landscape of polyploid wheat. Science. 2018: 361(6403):eaar6089. https://doi.org/10.1126/science.aar6089. Rodrigues MI, Takeda AA, Bravo JP, Maia IG. The eucalyptus tono- plast intrinsic protein (TIP) gene subfamily: genomic organiza- tion, structural features, and expression profiles. Front Plant Sci. 2016:7:1810. https://doi.org/10.3389/fpls.2016.01810. Roulin A, Auer PL, Libault M, Schlueter J, Farmer A, May G, Stacey G, Doerge RW, Jackson SA. The fate of duplicated genes in a poly- ploid plant genome. Plant J. 2013:73(1):143–153. https://doi. org/10.1111/tpj.12026. Schranz ME, Mohammadin S, Edger PP. Ancient whole genome du- plications, novelty and diversification: the WGD radiation lag- time model. Curr Opin Plant Biol. 2012:15(2):147–153. https:// doi.org/10.1016/j.pbi.2012.03.011. Somssich M, Khan GA, Persson S. Cell wall heterogeneity in root de- velopment of Arabidopsis. Front Plant Sci. 2016:7:1242. https:// doi.org/10.3389/fpls.2016.01242. Song S, Chen Y, Liu L, See YHB, Mao C, Gan Y, Yu H. OsFTIP7 deter- mines auxin-mediated anther dehiscence in rice. Nat Plants. 2018:4(7):495–504. https://doi.org/10.1038/s41477-018-0175-0. Stuart T, Butler A, Hoffman P, Hafemeister C, Papalexi E, Mauck WM 3rd, Hao Y, Stoeckius M, Smibert P, Satija R. 2019. Comprehensive integration of single-cell data. Cell. 177(7): 1888–1902.e21. https://doi.org/10.1016/j.cell.2019.05.031. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A. 2005:102(43):15545–15550. https://doi.org/10.1073/ pnas.0506580102. Sun W, Gao D, Xiong Y, Tang X, Xiao X, Wang C, Yu S. Hairy leaf 6, an AP2/ERF transcription factor, interacts with OsWOX3B and reg- ulates trichome formation in rice. Mol Plant. 2017:10(11): 1417–1433. https://doi.org/10.1016/j.molp.2017.09.015. Tang T, Tian S, Wang H, Lv X, Xie Y, Liu J, Wang M, Zhao F, Zhang W, Li H, et al. Wheat-RegNet: an encyclopedia of common wheat hierarchical regulatory networks. Mol Plant. 2023:16(2):318. https://doi.org/10.1016/j.molp.2022.12.018. Toronto International Data Release Workshop Authors; Birney E, Hudson TJ, Green ED, Gunter C, Eddy S, Rogers J, Harris JR, Ehrlich SD, Apweiler R, et al. Prepublication data sharing. Nature. 2009:461(7261):168–170. https://doi.org/10.1038/461168a. Trapnell C, Cacchiarelli D, Grimsby J, Pokharel P, Li S, Morse M, Lennon NJ, Livak KJ, Mikkelsen TS, Rinn JL. The dynamics and reg- ulators of cell fate decisions are revealed by pseudotemporal or- dering of single cells. Nat Biotechnol. 2014:32(4):381–386. https://doi.org/10.1038/nbt.2859. Van de Peer Y, Mizrachi E, Marchal K. The evolutionary significance of polyploidy. Nat Rev Genet. 2017:18(7):411–424. https://doi. org/10.1038/nrg.2017.26. Vaquerizas JM, Kummerfeld SK, Teichmann SA, Luscombe NM. A census of human transcription factors: function, expression and evolution. Nat Rev Genet. 2009:10(4):252–263. https://doi. org/10.1038/nrg2538. Wang M, Li Z, Zhang Y, Zhang Y, Xie Y, Ye L, Zhuang Y, Lin K, Zhao F, Guo J, et al. An atlas of wheat epigenetic regulatory elements re- veals subgenome divergence in the regulation of development and stress responses. Plant Cell. 2021:33(4):865–881. https:// doi.org/10.1093/plcell/koab028. Wendel JF, Lisch D, Hu G, Mason AS. The long and short of doubling down: polyploidy, epigenetics, and the temporal dynamics of genome fractionation. Curr Opin Genet Dev. 2018:49:1–7. https://doi.org/10.1016/j.gde.2018.01.004. Wicker T, Gundlach H, Spannagl M, Uauy C, Borrill P, Ramírez-González RH, De Oliveira R; International Wheat Genome Sequencing Consortium; Mayer KFX, Paux E, et al. Impact of transposable elements on genome structure and evo- lution in bread wheat. Genome Biol. 2018:19(1):103. https://doi. org/10.1186/s13059-018-1479-0. Wimalanathan K, Lawrence-Dill CJ. Gene ontology meta annotator for plants (GOMAP). Plant Methods. 2021:17(1):54. https://doi. org/10.1186/s13007-021-00754-1. Xie Q, Frugis G, Colgan D, Chua NH. Arabidopsis NAC1 transduces auxin signal downstream of TIR1 to promote lateral root devel- opment. Genes Dev. 2000:14(23):3024–3036. https://doi.org/10. 1101/gad.852200. Yamaguchi M, Mitsuda N, Ohtani M, Ohme-Takagi M, Kato K, Demura T. VASCULAR-RELATED NAC-DOMAIN7 directly regu- lates the expression of a broad range of genes for xylem vessel formation. Plant J. 2011:66(4):579–590. https://doi.org/10.1111/ j.1365-313X.2011.04514.x. Yang JW, Chen YM, Jing Y, Green MR, Han L. Advancing CAR T cell therapy through the use of multidimensional omics data. Nat Rev Clin Oncol. 2023:20(4):211–228. https://doi.org/10.1038/ s41571-023-00729-2. Yang YH, Guo M, Sun SY, Zou YL, Yin SY, Liu YN, Tang SZ, Gu MH, Yang ZF, Yan CJ. Natural variation of OsGluA2 is involved in grain protein content regulation in rice. Nat Commun. 2019:10(1):1949. https://doi.org/10.1038/s41467-019-09919-y. Yi X, Du Z, Su Z. PlantGSEA: a gene set enrichment analysis toolkit for plant community. Nucleic Acids Res. 2013:41(W1): W98–W103. https://doi.org/10.1093/nar/gkt281. Yin Y, Zhu Q, Dai S, Lamb C, Beachy RN. RF2a, a bZIP transcriptional activator of the phloem-specific rice tungro bacilliform virus pro- moter, functions in vascular development. EMBO J. 1997:16(17): 5247–5259. https://doi.org/10.1093/emboj/16.17.5247. Young MD, Behjati S. Soupx removes ambient RNA contamination from droplet-based single-cell RNA sequencing data. Gigascience. 2020:9(12):giaa151. https://doi.org/10.1093/gigascience/giaa151. Zhai N, Xu L. Pluripotency acquisition in the middle cell layer of cal- lus is required for organ regeneration. Nat Plants. 2021:7(11): 1453–1460. https://doi.org/10.1038/s41477-021-01015-8. Zhang L, He C, Lai Y, Wang Y, Kang L, Liu A, Lan C, Su H, Gao Y, Li Z, et al. Asymmetric gene expression and cell-type-specific regula- tory networks in the root of bread wheat revealed by single-cell multiomics analysis. Genome Biol. 2023:24(1):65. https://doi.org/ 10.1186/s13059-023-02908-x. Zhang Y, Li Z, Liu J, Zhang Y, Ye L, Peng Y, Wang H, Diao H, Ma Y, Wang M, et al. Transposable elements orchestrate subgenome- convergent and -divergent transcription in common wheat. Nat Commun. 2022:13(1):6940. https://doi.org/10.1038/s41467- 022-34290-w. Zhou J, Troyanskaya OG. An analytical framework for interpretable and generalizable single-cell data analysis. Nat Methods. 2021: 18(11):1317–1321. https://doi.org/10.1038/s41592-021-01286-1. Zou J, Chen Z, Zhang S, Zhang W, Jiang G, Zhao X, Zhai W, Pan X, Zhu L. Characterizations and fine mapping of a mutant gene for high tillering and dwarf in rice (Oryza sativa L). Planta. 2005:222(4): 604–612. https://doi.org/10.1007/s00425-005-0007-0. Wang et al. · https://doi.org/10.1093/molbev/msae178 MBE 14 D ow nloaded from https://academ ic.oup.com /m be/article/41/9/m sae178/7746018 by C entro Internacional de M ejoram iento de M aiz y Trigo user on 14 N ovem ber 2024 https://doi.org/10.1093/bioinformatics/btt656 https://doi.org/10.1093/bioinformatics/btt656 https://doi.org/10.1146/annurev-cellbio-100616-060818 https://doi.org/10.1146/annurev-cellbio-100616-060818 https://doi.org/10.1016/j.cels.2019.03.003 https://doi.org/10.1016/j.cels.2019.03.003 https://doi.org/10.1038/nmeth.1226 https://doi.org/10.1126/science.1250091 https://doi.org/10.1126/science.aar6089 https://doi.org/10.3389/fpls.2016.01810 https://doi.org/10.1111/tpj.12026 https://doi.org/10.1111/tpj.12026 https://doi.org/10.1016/j.pbi.2012.03.011 https://doi.org/10.1016/j.pbi.2012.03.011 https://doi.org/10.3389/fpls.2016.01242 https://doi.org/10.3389/fpls.2016.01242 https://doi.org/10.1038/s41477-018-0175-0 https://doi.org/10.1016/j.cell.2019.05.031 https://doi.org/10.1073/pnas.0506580102 https://doi.org/10.1073/pnas.0506580102 https://doi.org/10.1016/j.molp.2017.09.015 https://doi.org/10.1016/j.molp.2022.12.018 https://doi.org/10.1038/461168a https://doi.org/10.1038/nbt.2859 https://doi.org/10.1038/nrg.2017.26 https://doi.org/10.1038/nrg.2017.26 https://doi.org/10.1038/nrg2538 https://doi.org/10.1038/nrg2538 https://doi.org/10.1093/plcell/koab028 https://doi.org/10.1093/plcell/koab028 https://doi.org/10.1016/j.gde.2018.01.004 https://doi.org/10.1186/s13059-018-1479-0 https://doi.org/10.1186/s13059-018-1479-0 https://doi.org/10.1186/s13007-021-00754-1 https://doi.org/10.1186/s13007-021-00754-1 https://doi.org/10.1101/gad.852200 https://doi.org/10.1101/gad.852200 https://doi.org/10.1111/j.1365-313X.2011.04514.x https://doi.org/10.1111/j.1365-313X.2011.04514.x https://doi.org/10.1038/s41571-023-00729-2 https://doi.org/10.1038/s41571-023-00729-2 https://doi.org/10.1038/s41467-019-09919-y https://doi.org/10.1093/nar/gkt281 https://doi.org/10.1093/emboj/16.17.5247 https://doi.org/10.1093/gigascience/giaa151 https://doi.org/10.1038/s41477-021-01015-8 https://doi.org/10.1186/s13059-023-02908-x https://doi.org/10.1186/s13059-023-02908-x https://doi.org/10.1038/s41467-022-34290-w https://doi.org/10.1038/s41467-022-34290-w https://doi.org/10.1038/s41592-021-01286-1 https://doi.org/10.1007/s00425-005-0007-0 A Quantitative Computational Framework for Allopolyploid Single-Cell Data Integration and Core Gene Ranking in Development Introduction Results Challenges of Single-Cell Data Analysis in Polyploids Framework for the Quantification of Stochastic Cell-wise and Gene-wise Subgenome Divergence During Development Evaluation of the Utility of pgDQ for Quantifying Subgenome Divergence at the Cellular Level pgDQ Quantifies Subgenome Divergence at Single-Cell Resolution Along the Evolutionary Trajectory divGene Detects Genes Indispensable for Subgenome Divergence During Differentiation Ranking Indispensable Genes to Integrate Multimodal Data and Facilitate Mechanism Insights Applying pgDQ to High-Dimensional Tissue Developmental Data Discussion Materials and Methods Plant Materials and Growth Conditions Single-Cell RNA-seq Library Construction and Sequencing Processing of Single-Cell RNA-seq Data Cell Clustering and Reconstruction of Cell Trajectories Processing of RNA-seq Data Definition of Subgenome Expression Divergence Using Conventional Method in Common Wheat Quantification of Subgenome Divergence at Cellular Level Using pgDQ Simulated Datasets and Evaluation Calculating Gene ISs Using pgDQ Gene Set Enrichment Analysis of Ranked Indispensable Genes Quantification of Subgenome Divergence at Cellular Level Using pgDQ in Simulated Pseudo-tetraploids Supplementary Material Funding Author Contributions Conflict of Interest Data Availability References