Euphytica (2022) 218:154 https://doi.org/10.1007/s10681-022-03103-y Genomic prediction of drought tolerance during seedling stage in maize using low‑cost molecular markers Ao Zhang · Shan Chen · Zhenhai Cui · Yubo Liu · Yuan Guan · Shuang Yang · Jingtao Qu · Juchao Nie · Dongdong Dang · Cong Li · Xiaomei Dong · Jinjuan Fan · Yanshu Zhu · Xuecai Zhang · Jose Crossa · Huiying Cao · Yanye Ruan · Hongjian Zheng Received: 15 December 2021 / Accepted: 8 September 2022 / Published online: 8 October 2022 © The Author(s) 2022 Abstract Drought tolerance in maize is a com- of 379 inbred lines were genotyped with diver- plex and polygenic trait, especially in the seedling sity arrays technology (DArT) and genotyping-by- stage. In plant breeding, complex genetic traits can sequencing (GBS) platforms. Target traits of seedling be improved by genomic selection (GS), which has emergence rate (ER), seedling plant height (SPH), become a practical and effective breeding tool. In and grain yield (GY) were evaluated under two natu- the present study, a natural maize population named ral drought stress environments in northeast China. Northeast China core population (NCCP) consisting Adequate genetic variations were observed for all the target traits, but they were divergent across environ- ments. Similarly, the heritability of the target trait Ao Zhang and Shan Chen have contributed equally to the research presented in this article. also varied across years and environments, the herit- abilities in 2019 (0.88, 0.82, 0.85 for ER, SPH, GY) Supplementary Information The online version were higher than those in 2020 (0.65, 0.53, 0.33) and contains supplementary material available at https:// doi. cross-2-years (0.32, 0.26, 0.33). In total, three marker org/ 10. 1007/ s10681- 022- 03103-y. A. Zhang · Y. Liu · Y. Guan · D. Dang · X. Zhang · Z. Cui  H. Zheng (*)  Key Laboratory of Soybean Molecular Design Breeding, CIMMYT-China Specialty Maize Research Center, Crop Northeast Institute of Geography and Agroecology, Breeding and Cultivation Research Institute, Shanghai Chinese Academy of Sciences,  Changchun,  Jilin, China Academy of Agricultural Sciences, Shanghai, China e-mail: hjzh6188@163.com S. Yang  Shenyang Academy of Agricultural Sciences, Shenyang, A. Zhang · S. Chen · D. Dang · C. Li · X. Dong · J. Fan · Liaoning, China Y. Zhu · H. Cao · Y. Ruan (*)  College of Bioscience and Biotechnology, Shenyang J. Qu  Agricultural University, Shenyang, Liaoning, China Maize Research Institute, Sichuan Agricultural University, e-mail: yanyeruan@syau.edu.cn Wenjiang, Sichuan, China A. Zhang · S. Chen · D. Dang · C. Li · X. Dong · J. Fan · J. Nie  Y. Zhu · H. Cao · Y. Ruan  Fumeng County Modern Agricultural Development Shenyang City Key Laboratory of Maize Genomic Service Center, Fuxin, Liaoning, China Selection Breeding, Shenyang, Liaoning, China A. Zhang · Y. Liu · X. Zhang · J. Crossa  International Maize and Wheat Improvement Center (CIMMYT), El Batan, Texcoco, México Vol.: (0123456789) 1 154 Page 2 of 21 Euphytica (2022) 218:154 datasets, 11,865 SilicoDArT markers obtained from LD Linkage disequilibrium the DArT-seq platform, 7837 SNPs obtained from MAF Minor allele frequency the DArT-seq platform, and 91,003 SNPs obtained MAS Marker-assisted selection from the GBS platform, were used for GS analy- MRD Modified Rogers’ distance sis after quality control. The results of phylogenetic NCCP Northeast China core population trees showed that broad genetic diversity existed in NJ Neighbor-Joining the NCCP population. Genomic prediction results PCA Principal component analysis showed that the average prediction accuracies esti- QTL Quantitative trait loci mated using the DArT SNP dataset under the two- rrBLUP Ridge regression BLUP fold cross-validation scheme were 0.27, 0.19, and SNP Single nucleotide polymorphism 0.33, for ER, SPH, and GY, respectively. The result SPH Seedling plant height of SilicoDArT is close to the SNPs from DArT-seq, TP Training population those were 0.26, 0.22, and 0.33. For the trait with TPS Training population size lower heritability, the prediction accuracy can be improved using the dataset filtered by linkage dis- equilibrium. For the same trait, the prediction accura- Introduction cies estimated with two DArT marker datasets were consistently higher than that estimated with the GBS Maize (Zea mays L.) is an important source of food, SNP dataset under the same genotyping cost. The animal feed, and energy because of its high yield prediction accuracy was improved by controlling potential (Jiang et  al. 2018). Consequently, poor population structure and marker quality, even though maize production caused by abiotic stresses has a tre- the marker density was reduced. The prediction accu- mendously detrimental impact on global food secu- racies were improved by more than 30% using the rity. Seed germination and seedling establishments significant-associated SNPs. Due to the complexity are the initial stages of maize growth, which affect the of drought tolerance under the natural stress environ- seedling emergence rate, uniformity, and robustness, ments, multiple years of data need to be accumulated and then determine maize yield potential (Tian et al. to improve prediction accuracy by reducing genotype- 2014). However, these initial stages are frequently by-environment interaction. Modeling genotype- subjected to severe drought stress worldwide due to by-environment interaction into genomic prediction global warming, which decreases the emergence rate needs to be further developed for improving drought and retards seedling growth. Therefore, the develop- tolerance in maize. The results obtained from the pre- ment of drought-tolerant maize hybrids is essential to sent study provides valuable pathway for improving address the global temperature rise and end the global drought tolerance in maize using GS. food insecurity. The selection of elite drought-tolerant maize inbred Keywords Maize (Zea mays L.) · GBS · DArT · lines is the basis of the development of drought-tol- GS · Drought stress erant maize hybrids. However, traditional methods for selecting inbred lines are time-consuming, labor- Abbreviations intensive, and inefficient. In recent decades, with the ANOVA A nalysis of variance development of high-throughput and inexpensive BLUE Best linear unbiased estimate genotyping technologies, genomic selection (GS) BP Breeding population has become a practical and effective tool in animal DArT Diversity arrays technology and plant breeding (Michel et  al. 2016; Zhang et al. DL Deep learning 2017b). The principle of GS is to establish the pre- ER Emergence rate diction model based on the genotypic and phenotypic GBS Genotyping-by-sequencing data from a training population (TP), which is used to GEBV Genomic estimated breeding values derive genomic estimated breeding values (GEBVs) GS Genomic selection for all the individuals in the breeding population (BP) GY Grain yield from their genomic profiles (Meuwissen et al. 2001), H2 Broad-sense heritability guiding breeders to select excellent individuals from Vol:. (1234567890) 1 3 Euphytica (2022) 218:154 Page 3 of 21 154 the population without phenotypic data. GS has been the field observations, is used to evaluate the reliabil- shown to improve the efficiency of recurrent selection ity of GS results (Combs and Bernardo 2013). A vari- of bi-parental populations (Beyene et al. 2015; Mass- ety of factors are known to influence GS prediction man et al. 2013; Môro et al. 2019; Vivek et al. 2017) accuracy, which includes training population size, and multi-parent populations (Zhang et al. 2017b) to relatedness between training and test individuals, reduce breeding cycle time and accelerate the genetic DNA marker type, marker quality and density, trait gain per unit cost and time. At present, GS has been heritability, statistical models, linkage disequilibrium widely used in maize for various traits, including (LD) between markers, and population structure, grain yield (Liu et  al. 2018), developmental traits etc. (Liu et al. 2018; Norman et al. 2018). The qual- (Cui et al. 2020; Zhang et al. 2019), abiotic and biotic ity and density of genomic markers are influenced by stress tolerances (Beyene et al. 2015; Cao et al. 2017; genotyping platforms. The ideal sequencing platform Vivek et al. 2017) and grain nutritional quality (Guo should be inexpensive, provide high-throughput and et al. 2020). Compared with the widely used marker- good genomic coverage, as well as be replicable and assisted selection (MAS) technology, the GS strategy stable. Genotyping-by-sequencing (GBS), which can has the advantage of without detection of quantitative simplify complex genomes, has now been used as a trait loci (QTL) in advance, and it can be used to pre- high-throughput and cost-effective tool for generating dict the performance of the breeding population with- high-density molecular markers in many crop species out genetic structure analysis (Bernardo 2016). More- (Elshire et  al. 2011). The Diversity Arrays Technol- over, GS utilizes all genes/loci affecting phenotypes, ogy (DArT) is a DNA hybridization-based molecu- including both major genes/loci and minor genes/loci lar marker technique, which can detect variation at (Crossa et al. 2017; Desta and Ortiz 2014; Xiao et al. numerous genomic loci without sequence information 2017). Therefore, it is a promising genomic tool for of reference genome. DArT marker has been used improving the complex quantitative traits. for the implementation of MAS in maize (Dos San- Genomic prediction (GP) can be implemented tos et al. 2016). In addition, DArT can be combined with statistical models that can estimate the marker with next-generation sequencing platforms known as effects accurately in a training population. Over DArT-seq, which permits simultaneous detection of the past 2 decades, various statistical models and several thousands of DNA polymorphisms by scor- machine learning methods have been proposed for the ing the presence or absence of DNA fragments in implementation of GP. Among the parametric mod- genomic representations and through a process of els, ridge regression best linear unbiased predictions reducing the genomic complexity (Kilian et al. 2012). (rrBLUP) and genomic-BLUP (GBLUP) are most The maize growth from seed germination stage commonly used in plant breeding (Heslot et al. 2012). to seedling establishment stage involves a variety of Bayes A assumes a prior distribution of effects with a complicated metabolic transformations, physiologi- higher probability of moderate to large effects, while cal activities, and cellular and tissue differentiation, Bayes B and Bayes Cπ assume some marker effects to which are regulated by many genomic loci, includ- be zero. Reproducing Kernel Hilbert Spaces (RKHS) ing both major and minor effect loci expressed under regression is equivalent to a GBLUP with a linear drought stress. These abundant genetic variations kernel (de los Campos et  al. 2009). Deep Learning exist in maize germplasms with different genetic (DL), a nonparametric model, is a type of machine backgrounds. In this study, three marker datasets of learning (ML) approach and a subfield of artificial a natural maize population obtained from the DArT intelligence (Montesinos-López et  al. 2021). For and GBS genotyping platforms, as well as the phe- most traits with different genetic architectures, the notypic data of emergence rate (ER) and plant height differences between different models are slight. rrB- (SPH) in the seedling stage, and grain yield (GY) in LUP is a classical method with lower computational the mature stage evaluated under two drought stress requirements and usually performs well across dif- environments, were used to perform GS analysis. The ferent traits (Kwong et al. 2017; Maulana et al. 2021; main objectives of the present study were to: (1) eval- Resende et al. 2012). uate the phenotypic variation of several target traits Prediction accuracy of GS, the correlation between in the NCCP in natural drought stress environments; the genomic estimated breeding values (GEBV) and (2) compare the genomic prediction accuracies of the Vol.: (0123456789) 1 3 154 Page 4 of 21 Euphytica (2022) 218:154 target traits of ER, SPH, and GY estimated from the the proportion of the number of surviving plants to three marker datasets obtained from the DArT and the total number of seeds planted. SPH was measured GBS genotyping platforms; (3) assess the prediction as the distance from the plant base to the highest of accuracies of three traits by improving the marker the seedling plant. The average dry grain weight from quality; and (4) improve the prediction accuracies by five plants was used to represent the GY for each controlling population structure and incorporating entry. trait-marker associations. The best linear unbiased estimator (BLUE) values and broad-sense heritability (H2) of ER, SPH, and GY were calculated within and across years using Materials and methods the META-R software version 6.04 (Alvarado et  al. 2020) (http:// hdl. handle. net/ 11529/ 10201). The linear Plant materials, phenotype evaluation, and mixed model used in META-R was implemented in heritability estimation the LME4 R-package, functions of lmer and REML were used to estimate the variance components. A collection of 391 inbred lines, designated as the northeast China core population (NCCP), were used Yijk =  + Geni + Envj + Geni × Envj + Repk + ijk for genomic prediction analyses in the current study. where Yijk is the trait of interest, μ is the overall mean, All these inbred lines were selected from China, Geni , Envj , and Geni × Envj are the effects of the i-th Mexico, and America, and they adapted to the spring genotype, j-th year, and i-th genotype by j-th year maize area of Northeast China. The drought stress interaction, respectively. All the effects are consid- tolerance of this population was determined in the ered random, which is assumed to be normally and Fuxin Mongolian autonomous county, Liaoning independently distributed, with mean zero and homo- Province, China (42°06′ N, 122°55′ E) in 2019 and scedastic variance σ2. Genotype and environment is 2020, respectively, where drought stress occurs fre- a fixed effect when calculating BLUEs, the covariate quently during the spring season. The drought trial is always a fixed effect. Repk is the effect of k-th rep- was planted on 12 May and harvested on 7 October lication. ijk is the random error term of the i-th geno- in 2019, and planted on 11 May and harvested on 9 type, j-th year, k-th replication. Traits with heritability October in 2020. During the maize growing season, below 0.05 within the individual environment were the average temperature was 21 °C both in 2019 and excluded from the across-location analysis. 2020. The average rainfall was 4  mm and 3  mm in Broad-sense heritability ( H2) based on the entry 2019 and 2020, respectively. The highest tempera- means within the trial was estimated as follows: ture was above 35 °C in both years, and the precipita- tion was less than 20 mm per month, causing serious 2 g drought stress in the seedling stage (Fig.  1). Irriga- H2 = 2 e 2 g 2 tion was not performed before planting, and plant- + + g nEnvs nEnvs×nreps ing was scheduled before precipitation according to the weather forecast for getting good germination. A where 2 , 2 , and 2 are the genotypic variance, error g ge completely randomized block design with three repli- variance, and genotype-by-environment interaction cations was applied in each environment. The inbred variance, respectively, and nreps and nEnvs are the lines were sown using one seed per hole in a single numbers of replications and environments, respec- plot, 3 m long, with plant spacing of 10 cm and row tively (Carena et al. 2010). spacing of 60  cm. Target traits of emergence rate (ER), seedling plant height (SPH) and grain yield Genotyping and quality control (GY) were evaluated to represent the drought toler- ance of the tested plant materials. The target traits of Leaf samples of three individual plants were collected ER and SPH were measured at the seedling stage, i.e. for each inbred line during the seeding stage for DNA 20 days after planting. While GY was measured at the extraction with a CTAB procedure, and genotyp- maturity stage, the moisture content was measured by ing was carried out by DArT (https:// www. diver sitya LDS-1G Grain Moisture Meter. ER was measured as rrays. com/) and GBS (Elshire et al. 2011) platforms. Vol:. (1234567890) 1 3 Euphytica (2022) 218:154 Page 5 of 21 154 Fig. 1 Distribution of temperature and rainfall in maize planting plots of Fuxin Mongolian Autono- mous County in 2019 and 2020. The horizontal axis shows the date from sowing to harvest. The panels of a and c are temperature (°C) data, including maximum temperature (tem_max), minimum temperature (tem_min), and average temperature (tem_avg). The panels of b and d are rainfall data, and the unit is mm Vol.: (0123456789) 1 3 154 Page 6 of 21 Euphytica (2022) 218:154 Genotyping of all inbred lines with the GBS were generated by DArT-seq: SilicoDArT and SNP. method was performed at Wuhan medical laboratory, The SilicoDArT dataset is the presence/absence vari- BGI Co., Ltd, following a protocol widely used by the ation of the tag sequences, while the SNP dataset maize research community (Elshire et al. 2011). The was always used for genetics analysis and molecular genomic DNA was digested with the ApeKI restric- marker-assisted selection in previous studies. The tion enzyme, and a DNA library was constructed SNP dataset was obtained by mapping the SNPs to the in 96-plex and sequenced on Illumina-hiseq XTen B73 RefGen_v4 reference genome using the BLAST sequencing system. Each sample obtained an average tool, and some of the fragments in SilicoDArT were of about seven hundred and fifty thousand reads. The not able to be aligned to the reference genome. The B73 RefGen_v4 was the reference genome, and BWA number of markers in SilicoDArT dataset was 62,794, software was applied to match all sequencing data to while the number of markers in the SNP dataset was the reference genome. Details in single nucleotide 39,659, and 547 of them were not able to be anchored polymorphism (SNP) calling and imputation have to any ten maize chromosomes (Table S1). been previously described (Cao et  al. 2017). Each In both DArT and GBS datasets, TASSEL 5 (Brad- line has an average of approximately 77,000 SNPs bury et al. 2007) was used to filter out markers with distributed on each chromosome. In total, 776,285 minor allele frequency (MAF) < 0.05 and a missing loci were aggregated to cover all SNPs, and 768,558 rate > 20%. The samples with a missing rate of more of them were mapped to the ten maize chromosomes than 20% were discarded. The number of lines in (Table S1), while 7727 could not be anchored to any DArT and GBS datasets is 379 and 378, respectively. chromosome (data not displayed). In total, the number of markers is 11,865 in the Sili- Genotyping of all the inbred lines with DArT- coDArT dataset, 7837 in the DArT SNP dataset, and seq method was carried out at the SAGA (https:// 91,003 in the GBS SNP dataset (Table 1). Imputation seeds ofdis covery. org/ about/ genot yping- platf orm/) was performed using the method of Cao et al. (2017). sequencing lab, jointly established by the DArT com- pany and CIMMYT. Two enzymes of PstI (CTG CAG SNP distribution across the whole maize genome ) and HpaII (CCGG) were used to digest the DNA samples to reduce the complexity of the genome. For The distribution of SNPs in the genome is one of the each 96 wells plate, 16% of the samples were repli- indicators of the quality of the genotyping platforms. cated to assess reproducibility (Pereira et  al. 2020). A package named RIdeogram (Hao et al. 2020), writ- After enzyme digestion, DNA from different samples ten in R programming (R Core Team, 2021), was was linked with barcodes of different base combina- used to assess the distribution of SNPs obtained from tions of sequences to construct a simplified sequenc- both the DArT-seq and GBS platforms. This analysis ing DNA library. Equimolar amounts of amplification was not applied to the SilicoDArT dataset, due to a products from each sample were pooled by plate and lack of information on physical positions on the refer- amplified by c-Bot (Illumina) bridge PCR, followed ence genome. The plot of the SNP distribution was by fragment sequencing on Illumina Hiseq 2500 drawn using the ideogram function. The customized (www. illum ina. com). The short fragment sequenc- R scripts to plot the heatmap of the SNP distribution ing technology (150 bp) and the simplified sequenc- ing DNA library of mixed samples was sequenced on a single lane (Kilian et  al. 2012) (https:// www. Table 1 The number of materials and sites, proportion hete- diver sitya rrays. com/). SNPs were called using the rozygous and MAF for markers of SilicoDArT, DArT and GBS DArTsoft analytical pipeline (http:// www. diver sitya datasets after filtering rrays. com/ softw are. html) (Chen et al. 2016). DArT’s SilicoDArT DArT GBS SNP development is different from GBS, which does not rely on the reference genome information, Number of Taxa 379 379 378 but mainly depends on the sequencing library data Number of Sites 11,865 7837 91,003 from the DArT company. All reads were aligned by Proportion Missing 0.07 0.07 0.069 sequence analysis based on maize tags of metage- Proportion Heterozygous 0.03 0.02 0.018 nome representation. Two types of marker datasets Average Minor Allele Frequency 0.24 0.24 0.23 Vol:. (1234567890) 1 3 Euphytica (2022) 218:154 Page 7 of 21 154 on maize chromosomes using the HapMap format random cross-validation analysis was implemented input dataset were provided (https:// aozha ngchi na. for all traits. For prediction, regression models were github. io/R/ chrom esome heatm apTool/ Chrom esome evaluated by the prediction accuracy, a Pearson’s cor- Heatm ap. html). relation between the GEBVs and the estimated breed- ing values in the testing population (Thavamanikumar Genomic prediction model et al. 2015). All predictions were repeated 100 times, and the mean value was taken as the average predic- The rrBLUP model assumes homogenous variance tion accuracy. of all markers and shrinks all marker effects equally to zero. rrBLUP is equivalent to BLUP and uses the Effect of marker quality on the estimation of genomic realized relationship matrix estimated from the mark- prediction accuracy ers. The genomic prediction analysis was performed with the ridge regression BLUP (rrBLUP) package For the same target trait, genomic prediction accuracy (Endelman, 2011). The mixed model is described as: estimated by the data from the drought stress condi- tion is always relatively low than that estimated by y = X + Zu +  the data from the optimal condition. Higher marker quality was proved to improve prediction accuracy. where y is the vector (n × 1) of BLUPs from the phe- As mentioned in in materials and methods section, notypic data analysis, X is a design matrix for fixed different levels of MAF and missing rates were fil- effects; ε is the vector (n × 1) of independently ran- tered in the DArT and GBS marker datasets to control dom errors with assumed distribution N (0, Iσ 2 ε ), Z the marker quality and quantity. is the design matrix (n × n) for random effects and u For evaluating the effect of linkage disequilib- is the vector of random effects with u ∼ N (0, Kσ 2 u ) rium (LD) among markers on the estimation of the and K was an a realized (additive) relationship matrix genomic prediction accuracy, neighbor SNPs not in computed from marker genotypes proposed by Van- a strong LD phase, i.e., LD values lower than a spe- Raden (2008). In addition, n is the number of indi- cific threshold (0.1 and 0.2) within an LD block, were viduals (Endelman 2011; Liu et  al. 2018). Variance excluded from the prediction analyses. TASSEL 5.0 components are estimated by REML using the spec- (Bradbury et al. 2007) was used to perform LD analy- tral decomposition algorithm of Kang et  al. (2008). sis, and LD decay plots were developed using the cus- The mixed.solve function, a linear mixed-model equa- tomized R scripts (https:// aozha ngchi na. github. io/R/ tion estimates marker effects and GEBVs. GEBVs LDdec ay/ LDdec ayPlo tTool. html). Genomic predic- are derived from the realized (additive) relationship tion accuracy was calculated from 100 times repeti- matrix of individuals calculated from marker geno- tion using twofold cross-validation. A twofold cross- types (Tecle et al. 2014). validation analysis was implemented for all traits. For prediction, regression models were evaluated by the Effect of training population size (TPS) on the prediction accuracy. All predictions were repeated estimation of genomic prediction accuracy 100 times, and the mean value was taken as the aver- age prediction accuracy. To evaluate the effect of TPS on the estimation of genomic prediction accuracy, the cross-validation Effect of genetic structure on the estimation of scheme was used to randomly generate different train- genomic prediction accuracy ing and prediction sets and assess the prediction accu- racy for different target traits in prediction sets. The It was shown that the relationship between train- training populations were extracted from the whole ing and prediction sets had a significant impact on population The TPS ranged from 10 to 90%, with an the estimation of prediction accuracy (Zhang et  al. interval of 10%. Each cross-validation scheme was 2017a). In the current study, NCCP consists of multi- repeated 100 times. For the same trait, the predic- ple heterotic groups, such as the Reid, Reid_Chinese, tion accuracies estimated from the DArT and GBS Lancaster, Tangsipingtou, Lvda Red Cob, Longdan, molecular marker datasets were compared. A gradient Jidan, CML, Mixed, etc. Some elite breeding lines Vol.: (0123456789) 1 3 154 Page 8 of 21 Euphytica (2022) 218:154 from Northeast China were also included in this estimate the Q value in this study. The MLM model is population, including Shen118, Shen3336, Liao6049, described as follows: Dan340, Ji818, etc. A phylogenetic tree was used to demonstrate the y = X + Qv + Zu +  genetic relationship among all the inbred lines used where X is the SNP marker matrix, Q and Z represent in NCCP. The genetic distances among all the inbred the subpopulation membership matrix and kinship lines were calculated in Bio-R (Biodiversity Analysis matrix respectively,  and v are the coefficient vec- with R) software developed by CIMMYT using the tors of SNP markers and subpopulation membership filtered DArT and GBS datasets’ (https:// hdl. han- respectively, u is the random genetic effect vector, and dlenet/11529/10820). The formula to compute the  is the random error vector, which is Var[] = I2 modified Rogers distance (MRD) used in the present e (Yu et al. 2006; Fan et al. 2016). study is as follows: The thresholds of significant markers were set � as P < 0.5, 0.1, 0.01, 0.001. The prediction accura- ∑L ∑n � �2 l p 1 lax − p l=1 a= lay cies were estimated using significant markers, equal MRDxy = 2L amount of non-significant markers, and the rest of non-significant markers. Genomic prediction accu- where plax is the estimated frequency of allele a, racy was calculated from 100 times repetitions using which is located in locus l, and genotype x; l is the twofold cross-validation. number of loci, nl is the number of alleles in the locus l. Markers from both platforms were used to create the phylogenetic tree. Based on the genetic distance Results matrix, the Neighbor-Joining (NJ) method of MEGA X (https:// www. megas oftwa re. net/) and FigTree Phenotypic data and correlation analysis (http:// tree. bio. ed. ac. uk/ softw are/ figtr ee/) software were used to create the phylogenetic trees (Saitou and The phenotypic variations and standard deviations Nei 1987). of the three traits in 2019, 2020, and across 2 years The principal component analysis (PCA) was per- were shown in Table  2. Sufficient genetic variations formed to assess the effects of genetic structure on for genomic selection were found under the drought the estimation of prediction accuracy. All the inbred environment. The phenotypic values of ER ranged lines in NCCP were divided into two subgroups. The from 0.16 to 1.00, with an average value of 0.81 in prcomp function from the stats package in R and the 2019; ranged from 0.32 to 0.90, with an average value plot function from the base package in R were applied of 0.70 in 2020; and ranged from 0.60 to 0.84 with to generate the PCA plot. Genomic prediction accu- an average value of 0.76 across 2 years. For SPH, the racy was estimated within each subgroup from 100 phenotypic values ranged from 9.14 to 20.64, with times repetitions using twofold cross-validation. an average value of 14.96 in 2019; ranged from 8.61 to 15.06 with an average value of 11.67 in 2020; and ranged from 12.00 to 14.48, with an average value Genomic prediction analysis with the significant trait of 13.36 across 2 years. The GY values ranged from associated markers 12.22 to 93.96, from 7.59 to 23.32, and from 18.32 to 38.19 in 2019, 2020, and across 2 years, respectively. For each trait, all SNPs were divided into the sig- The average value of GY was 40.00, 10.97, and 25.04 nificant marker and non-significant marker groups in 2019, 2020, and across 2 years, respectively. The according to the GWAS results. The GWAS was coefficient of variation (CV) in 2020 was greater than conducted by GAPIT (version 3) (Wang and Zhang that in 2019 for each trait (Table 2). Correspondingly, 2021)  using the MLM model to obtain the associa- the variance of the environment component was great, tion markers from GBS and DArT platforms. This low to moderate heritabilities for all the three traits model simultaneously incorporates population struc- were obtained in 2020 and across 2 years, and relative ture (Yu et al. 2006). The first three PCs were used to high heritabilities were observed in 2019 for all the Vol:. (1234567890) 1 3 Euphytica (2022) 218:154 Page 9 of 21 154 Table 2 Phenotypic performance, variance component and broad-sense heritability of three traits in the maize association popula- tion Traita Means ± SD Range Unit CV Variance componentb H2c Rep Genotype Environment Gen × Env 19ER 0.81 ± 0.16 0.16–1.00 13.69 0.03** 0.88 3 20ER 0.70 ± 0.11 0.32–0.90 24.29 0.02** 0.65 3 ER 0.76 ± 0.04 0.60–0.84 18.81 0.01** 0.01** 0.02** 0.32 3 19SPH 14.96 ± 1.78 9.14–20.64 cm 10.11 3.57** 0.82 3 20SPH 11.67 ± 1.15 8.61–15.06 cm 22.04 2.48** 0.53 3 SPH 13.36 ± 0.42 12.00–14.48 cm 15.43 0.67** 5.27** 2.39** 0.26 3 19GY 40.00 ± 14.91 12.22–93.96 g 29.36 267.16** 0.85 3 20GY 10.97 ± 2.65 7.59–23.32 g 119.21 28.66* 0.33 3 GY 25.04 ± 3.57 18.32–38.19 g 48.34 43.27** 420.80** 126.06** 0.33 3 ER BLUE of emergence rate across two environments, SPH BLUE of seedling plant height across two environments, GY BLUE of grain yield across two environments a 19ER: emergence rate (ER) was measured in 19FX (2019 Fuxin); 20ER: emergence rate (ER) was measured in 20FX (2020 Fuxin); 19SPH: seedling plant height (SPH) was measured in 19FX (2019 Fuxin); 20SPH: seedling plant height (SPH) was measured in 20FX (2020 Fuxin); 19GY: grain yield (GY) was measured in 19FX (2019 Fuxin); 20GY: grain yield (GY) was measured in 20FX (2020 Fuxin) b *Significant at P ≤ 0.05; **Significant at P ≤ 0.01 c Family mean-based broad-sense heritability three traits. The H2 of ER, SPH, and GY in 2019 was before filtering was observed at the MAF interval of 0 0.88, 0.82, and 0.85, respectively. In 2020, the H2 of to 0.05 for all the three marker sets. The MAF distri- ER, SPH, and GY was 0.65, 0.53, and 0.33, respec- butions in other intervals from 0.05 to 0.50 were rela- tively. Across 2 years, the H 2 of ER, SPH, and GY tively uniform. Average MAF after filtering was 0.24, was 0.32, 0.26, and 0.33, respectively (Table 2). The 0.24, and 0.23 for SilicoDArT, SNPs from DArT-seq, phenotypic data analysis based on BLUEs revealed and SNPs from GBS, respectively (Table  1). Con- that ER, SPH, and GY had normal distributions or sistent MAF distributions between the two genotyp- skewed normal distributions (Fig. 2). The phenotypic ing platforms potentially indicated that the quality of correlations among all the three traits were positive sequencing and molecular marker datasets generated and significant, and the correlation coefficients were by both genotyping platforms were high and reliable 0.36, 0.35, and 0.25 (P < 0.01) between ER and SPH, (Fig. 3a–c). between ER and GY, and between SPH and GY, Both in the SilicoDArT and DArT-seq SNP data- respectively. sets, distributions of missing rates showed a decreas- ing trend from the interval of 0–10% to the interval Quality control on genotypic data and marker of 90–100%, and close to 40% of molecular markers distribution had the missing rates less than 10%. In contrast, close to 40% of molecular markers in the GBS SNP dataset Two kinds of marker sets were obtained in the current had the missing rates more than 90%, and 10% of the research. The SilicoDArT markers were from DArT- GBS SNPs appeared in each interval of missing rate seq, and SNPs were from the GBS and DArT-seq. The of 0–10%, 10–20%, 20–30%, 30–40%, and 40–50% plots of MAF and missing rate distribution before fil- (Fig. 3d–f). The irregular distribution of missing rates tering for three marker sets were shown in Fig. 3. The in the GBS SNP dataset indicated that the quality of SilicoDArT markers had an average MAF of 0.10, this SNP dataset is not high. and the DArT-seq SNPs set had an average MAF of High-quality marker sets were obtained after filter- 0.11 before filtering. For the GBS SNP set, the aver- ing, and the average missing rate was 0.070, 0.073, age MAF before filtering was 0.11. The highest peak and 0.068 in the SilicoDArT dataset, DArT-seq SNP Vol.: (0123456789) 1 3 154 Page 10 of 21 Euphytica (2022) 218:154 markers at both ends of the chromosomes was high and decreased toward the centromere regions. The SNPs in both datasets were enriched at the ends of chromosomes 5 and 8. More uniform distribution was observed in the GBS SNP dataset, whereas the DArT- seq SNP dataset had higher marker density on the long arms of chromosomes 8 and 9. Uneven distribu- tions may potentially lead to a decrease in prediction accuracy. The phylogenetic tree of NCCP The phylogenetic trees were generated with the DArT-seq SNPs and GBS SNPs (Fig. 5). In total, the NCCP was divided into 13 groups, corresponding to the heterotic groups of NSS, SS, Huanglvxi, Lvxi, Reid, Mixed, France, PA, Jidan, Longdan, Lvda- honggu, Tangsipingtou and Unknown. The inbred Fig. 2 Frequency distributions and correlations of BLUE (best lines from CIMMYT are tropical material, only two unbiased linear estimator) as phenotype values were calcu- of them were regenerated in northeast China and lated from the maize seedling emergence rate (ER), seedling classified into the Mixed group. Some small groups plant height (SPH), and grain yield (GY) of maize measured in 19FX (2019 Fuxin) and 20FX (2020 Fuxin) under drought and inbred lines selected from the multiple paren- environment. The plots on the diagonal represent the pheno- tal lines also were classified as the unknown group. typic distribution frequency of ER, SPH, GY. The values above DArT SNPs produced a better phylogenetic tree, and the diagonal line are the Pearson’s correlation coefficients heterotic groups of Reid, Lvxi, and Huanglvxi were between every two traits. The values below the diagonal line are scattered plots for every two traits. *represents a significant separated clearly. However, the phylogenetic tree pro- difference at the 0.05 level; **represents a significant differ- duced with the GBS SNPs did not separate the heter- ence at the 0.01 level otic groups clearly in the present study. Some inbred lines from local breeders lost their source informa- tion, so they were classified into the unknown group. dataset, and GBS SNP dataset, respectively (Table 1). Some materials were collected from third parties with Heterozygosity is usually a critical index for marker inaccurate source information, it is not easy to clas- filtering, especially for inbred lines. In this research, sify them into heterotic groups properly. For some the average heterozygosity rates were very low in all inbred lines, pedigree information conflicted with the the three marker datasets, i.e., less than 0.01 before phylogenetic tree results generated with molecular filtering, and less than 0.03 after filtering. It indicated markers, it is hard to generate a perfect phylogenetic that the breeding lines in the NCCP are homozygous tree in a practical breeding program. (Table S1 and Table 1). After filtering, the number of markers decreased by The genomic prediction accuracy estimated from the 81%, 80%, and 88% in the SilicoDArT dataset, DArT- DArT and GBS markers seq SNP dataset, and GBS SNP dataset, respectively (Table S1 and Table 1). The missing rate was mainly Genomic prediction accuracies were estimated for responsible for the reduction of markers, the missing all the three target traits with the BLUE values and rate before filtering was 25%, 24%, and 56% in the the SNP datasets filtered with different parameters SilicoDArT dataset, DArT-seq SNP dataset, and GBS (Fig.  6). Using DArT SNP datasets, the predic- SNP dataset, respectively (Table S1). tion accuracy increased continuously as the TPS The markers of DArT-seq SNP dataset and increased across all three traits evaluated under a nat- GBS SNP dataset were distributed across the entire ural drought stress environment (Fig. 6). The predic- maize genome (Fig.  4). Generally, the density of tion accuracy slightly increased across all three traits Vol:. (1234567890) 1 3 Euphytica (2022) 218:154 Page 11 of 21 154 Fig. 3 Distribution of MAF and missing rate before filtering rate distribution of the SilicoDArT marker dataset; e missing in three marker datasets: a MAF distribution of the SilicoDArT rate distribution of the DArT marker dataset; f missing rate dis- marker dataset; b MAF distribution of the DArT marker data- tribution of the GBS marker dataset set; c MAF distribution of the GBS marker dataset; d missing Fig. 4 The markers density thermogram of DArT and GBS datasets. a The markers density thermogram of DArT’s SNP; b the markers density thermogram of GBS when the TPS increased from 50 to 90%. The smallest traits of SPH and GY, 40% of the training population standard error was observed in prediction accuracy size gained the smallest standard error, which indi- when 50% to 60% of the training population size was cated that 40 to 60% of the total genotypes assigned used to predict the target traits of ER. For the target as the training set could achieve good prediction Vol.: (0123456789) 1 3 154 Page 12 of 21 Euphytica (2022) 218:154 Fig. 5 Phylogenetic Trees for SNP markers from DArT-seq and GBS platforms. The letters a and b represent NCCP’s phylogenetic trees from DArT’s SNPs and GBS’s SNPs accuracy in the DArT SNP markers. For the GBS accuracies using the SNPs from the DArT-seq were SNP markers, a similar trend was observed for GY. 0.27, 0.19, and 0.33 for ER, SPH, and GY traits, With the increase of the TPS, the average and median respectively. The prediction result estimated with Sil- prediction accuracies showed slightly different for the icoDArT markers was similar to that estimated with target traits of ER and SPH (Fig. 6). The prediction DArT-seq SNPs, which were 0.26, 0.22, and 0.33 accuracies estimated in a single year were similar to for ER, SPH, and GY traits, respectively. While, the those using BLUEs across 2 years (Figs. S1, S2). The prediction accuracies estimated from the GBS SNP heritabilities of the three traits were higher in 2019, markers were − 0.02, − 0.06, and 0.20 for the traits of slightly improving the prediction accuracy. Predic- ER, SPH, and GY, respectively. tion accuracies estimated with SilicoDArT mark- ers were consistent with those estimated with DArT Different prediction accuracies affected by marker SNP markers (Fig. S3). Therefore, 50% of TPS was quality selected for further analysis. Genomic prediction accuracies estimated using For all the three traits, the results of prediction accu- twofold cross-validation (50% of TPS) for all three racies estimated from the three molecular marker traits were shown in Fig.  7. The prediction accura- datasets filtered with the different combinations of cies estimated with the markers from different geno- MAF and missing rate were presented in Table  3, typic platforms showed significant differences for Tables S2, and S3. For the ER trait, the average pre- three interesting traits. For traits of SPH, the predic- diction accuracies range from 0.19 to 0.29, and the tion accuracies estimated with the SilicoDArT data- highest prediction accuracy was observed using 2,762 set and DArT-seq SNP dataset showed significant SilicoDArT markers. Interestingly, increasing the fil- differences, which means alignment of the markers tering of missing rates improved the prediction accu- to the reference genome resulting in a loss of predic- racy, while MAF had little effect on the estimation tion accuracy. For three traits, the prediction accura- of the prediction accuracy. Similar trends were also cies estimated from the DArT-seq SNPs were sig- observed for ER and GY. In most cases, the prediction nificantly higher than those estimated from the GBS accuracies did not show significant differences under SNPs (P < 0.01). Among the three traits, the pre- different combinations of MAF and missing rate. The diction accuracy of SPH was the lowest in both the LD decay distances estimated with the DArT SNPs DArT and GBS markers, which was consistent with and GBS SNPs were 6.16 kb and 3.83 kb, respectively the lowest heritability of SPH. The average prediction (Fig. S4). The result of prediction accuracy estimated Vol:. (1234567890) 1 3 Euphytica (2022) 218:154 Page 13 of 21 154 Fig. 6 Genomic prediction accuracies of ER, SPH, and GY in markers; b ER estimated with GBS markers; c SPH estimated NCCP across two natural drought conditions, when the train- with DArT markers; d SPH estimated with GBS markers; e ing population size was set from 10 to 90% of total genotypes, GY estimated with DArT markers; f GY estimated with GBS with an interval of 10%. Panel a ER estimated with DArT markers Vol.: (0123456789) 1 3 154 Page 14 of 21 Euphytica (2022) 218:154 distribution of two subgroups using PC1 and PC2 estimated with the DArT and GBS markers, and the first two PC explained 35% and 31% of the genetic variation, respectively (Fig.  8). The PCA results estimated with the markers from the two platforms were very similar. Finally, 180 inbred lines weres classified into Subgroup 1, and 199 inbred lines were classified into Subgroup 2. For all the traits, the prediction accuracies esti- mated within each subgroup were shown in Table 5. Using the DArT SNP markers, the prediction accuracies in Subgroup 1 with 50% of TPS were 0.17, 0.11, and 0.11 for ER, SPH, and GY traits, respectively. For Subgroup 2, the prediction accu- racies were 0.29, 0.20, and 0.33 for ER, SPH and Fig. 7 Genomic prediction accuracies of ER, SPH, and GY GY traits, respectively. The prediction accuracies estimated with the SilicoDArT, DArT and GBS marker data- in Subgroup 1 were significantly lower than those sets under twofold cross-validation. The significant difference was analyzed by one-way ANOVA of SPSS using the whole population for all three traits, whereas the prediction accuracies in Subgroup 2 were higher than those using the entire population with the different DArT/GBS marker datasets filtered for all three traits. For the GBS markers, no dif- by various LD values was presented in Table 4. Filter- ferences were observed in the prediction accuracy ing with different LD values caused different marker between the two subgroups. densities, and the number of markers decreased as the values increased. SPH obtained the highest aver- Prediction accuracy estimated with the trait-marker age prediction accuracy in the LD0.1 of DArT with associations 800 SNPs, followed by the accuracy in the DArT- LD0.2 with 2681 markers and without filtering with To improve the prediction accuracy under the condi- 7,837 SNPs. In contrast, the DArT-LD0.1 in ER and tion of natural drought in the maize seedling stage, GY traits were found to have the highest accuracies. the trait-marker associations detected in GWAS were Using the GBS markers, the prediction accuracies for selected for predictions using different significance traits of ER and GY estimated from 2333 SNPs fil- levels (P < 0.5, 0.1, 0.01, 0.001). The same amount tered with LD value (r2 = 0.1) were higher than those of randomly selected markers and all non-significant estimated from all the 91,003 SNPs without filtering. markers excluding significant markers were used as Slight improvements were observed, when a smaller controls for comparison analysis. The results showed number of high-quality markers filtered with different that the prediction accuracies estimated with sig- parameters was used for prediction, but the best fil- nificant markers were significantly higher than those tering parameters for improving prediction accuracy estimated from the same amount of random markers varied in different traits and genotyping platforms. for all the traits evaluated under drought tolerance environment at the seedling stage. When SNPs are Prediction accuracy estimated within subgroups from GBS, the prediction accuracies estimated with 46,168, 9074, 872 and 72 significant SNPs in P < 0.5, In general, the population with close relationships 0.1, 0.01 and 0.01were 0. 36, 0. 59, 0. 61 and 0. 57 is more likely to obtain high prediction accuracy. for ER trait, respectively. For SPH and GY traits, both Effects of controlling the population structure on had the same trend as ER, and the maximum predic- improving the prediction accuracy were assessed. tion accuracy was obtained when P < 0.01, which For keeping enough lines in each subgroup, NCCP were 0.62 and 0.63 respectively (Fig. 9). When SNPs was divided into two subgroups, according to the were from DArT-seq, 3858, 775, 90 and 11 signifi- population structure result. The PCA plots show the cant SNPs were selected for prediction for ER trait Vol:. (1234567890) 1 3 Euphytica (2022) 218:154 Page 15 of 21 154 Table 3 Genomic prediction accuracies of ER, SPH, and GY in NCCP under natural drought conditions, estimated from the Sili- coDArT marker datasets with different levels of quality filtered with missing rate and MAF Missing rate MAF Number of mark- Prediction accuracy ers ER SPH GY 0% 0.1 1356 0.24 0.23 0.31 0.2 926 0.22 0.22 0.32 0.3 583 0.19 0.22 0.29 0.4 265 0.21 0.22 0.31 10% 0.1 6484 0.25 0.22 0.34 0.2 4450 0.25 0.21 0.33 0.3 2838 0.25 0.22 0.33 0.4 1379 0.26 0.21 0.33 20% 0.1 9483 0.27 0.22 0.33 0.2 6376 0.26 0.21 0.33 0.3 3934 0.26 0.23 0.34 0.4 1851 0.27 0.22 0.31 30% 0.1 12,615 0.26 0.23 0.33 0.2 8571 0.26 0.22 0.33 0.3 5451 0.27 0.23 0.33 0.4 2650 0.26 0.21 0.33 40% 0.1 14,726 0.28 0.23 0.33 0.2 9726 0.27 0.22 0.33 0.3 5972 0.26 0.21 0.33 0.4 2762 0.29 0.23 0.35 in P < 0.5, 0.1, 0.01 and 0.001, and their predic- without using greenhouses. In the current study, the tion accuracies reached to 0.54, 0.71, 0.66 and 0.53, three traits have found enough phenotypic variation respectively. The same trends were also observed in our NCCP for ER, SPH, and GY traits. We trans- for SPH and GY traits, and the prediction accuracies formed the data to the BLUEs (best unbiased linear were the highest when P < 0.1, i.e. 0.75 for SPH, and estimator) as phenotype values for the rrBLUP model 0.60 for GY (Fig S5). assumptions (Fig. 2). The results of correlation anal- ysis show that ER and SPH potentially impact GY under a drought environment. The phylogenetic tree Discussion shows our population has a rich genetic variation for improving drought resistance, although the two phy- Drought tolerance, especially in the seedling stage, logenetic trees are not the same. is a complex and inherent trait of maize (Wang et al. GS is an effective breeding tool for improving 2016). The global climate has changed recently, complex traits in maize (Crossa et al. 2014). The GS resulting in drastic fluctuations in rainfall patterns can accelerate the genetic gain per unit time and unit and increasing temperature. Sudden climate changes cost by reducing the selection cycle time and the phe- can cause significant economic losses worldwide. notyping cost when the prediction accuracy is high. There are signs of grain yield stagnation in maize, In the present study, a natural maize population geno- especially in drought-stressed and semi-arid regions. typed with GBS and DArT markers were used to esti- Fuxin County is a region providing a natural test site mate the genomic prediction accuracies of ER, SPH, located in the main maize producing area of North- and GY under drought stress environments. Results east China that has suffered years of drought. In this indicated that the prediction for the inbred line per- region, we can deploy more large-scale experiments mance under a drought environment is tough through Vol.: (0123456789) 1 3 154 Page 16 of 21 Euphytica (2022) 218:154 Table 4 Genomic prediction accuracies of ER, SPH, and GY traits were more than 0.8 in 2019, but the average in NCCP under natural drought conditions, estimated from the prediction accuracies were only slightly higher than DArT and GBS marker datasets with different levels of quality filtered with linkage disequilibrium (LD) that in 2020 and across 2 years. This implied that the prediction of drought resistance needs to accumu- Number of Prediction accuracy late more year data to improve the heritability across markers ER SPH GY years and achieve stable phenotype. In the same trait, the prediction accuracy esti- DArT-LD0.1 800 0.25 0.26 0.31 mated from the DArT markers was higher than that DArT-LD0.2 2681 0.27 0.24 0.33 estimated from the GBS marker dataset, and the dif- DArT 7837 0.27 0.19 0.33 ference was significant. Both DArT and GBS are cost GBS-LD0.1 2333 0.02 − 0.07 0.23 efficient and high-throughput genotyping platforms GBS-LD0.2 8116 0.01 − 0.07 0.20 that can be used to implement GS. The GBS markers GBS 91,003 − 0.02 − 0.06 0.20 were implemented successfully in maize to improve DArT/GBS: The markers with an MAF < 0.05 and a missing multiple traits with different levels of genetic com- rate > 20% were filtered out plexity (Crossa et al. 2013; Wang et al. 2020). Com- LD0.2: On the basis of DArT/GBS filtering, it was filtered out according to linkage disequilibrium r 2 ≥ 0.2 paratively, only a few studies were reported using DArT markers to perform GS in wheat (Crossa et al. LD0.1: On the basis of DArT/GBS filtering, it was filtered out according to linkage disequilibrium r 2 ≥ 0.1 2016; Liu et al. 2020). We spent a similar amount of money on two genotyping platforms, about $30 per sample, to control the budget within an acceptable both GBS and DArT genotyping platforms. The aver- range for breeders. This price can obtain DArT mark- age prediction accuracies of ER, SPH and GY in the ers with normal quality, but a little bit lower quality panel estimated using SNPs of DArT were 0.27, 0.19 for GBS markers. That means the genome coverage and 0.33, respectively, SilicoDArT markers were from GBS has not reached 1×, which is probably the slightly higher than the DArT SNPs; while the pre- reason why the prediction accuracies are not suit- diction accuracies with GBS markers were 0.14, and able when using GBS markers in this research. Our 0.20 (Fig. 7). Low prediction accuracy may be mainly research indicated that DArT genotyping platform is affected by the quality of markers and heritabilities. more suitable than GBS when working with a lower Our heritabilities and prediction accuracies for the budget. GS using DArT markers is also being imple- GY trait were within the range of the previous study mented in CIMMYT maize breeding programs for that used 22 populations under water stress environ- improving grain yield and the other major agronomic ments (Zhang et  al. 2017a). Interestingly, unlike traits. under normal conditions, the heritabilities of the three Fig. 8 The first two principal components of DArT and GBS datasets. a The first two principal components of DArT. b The first two principal components of GBS. Subgroup 1 is marked with black and subgroup 2 is marked with red Vol:. (1234567890) 1 3 Euphytica (2022) 218:154 Page 17 of 21 154 Table 5 Prediction accuracies under different subgroups of DArT and GBS Trait DArT GBS All Subgroup1 Subgroup2 All Subgroup1 Subgroup2 ER 0.27 0.17 0.29 − 0.02 0.00 − 0.01 SPH 0.19 0.11 0.20 − 0.06 − 0.09 − 0.07 GY 0.33 0.11 0.33 0.20 0.21 0.22 This study showed that the average prediction doesn’t always work, and may be due to the fact accuracy was improved along with the TPS increas- that some complex traits require markers with spe- ing in all the traits in both genotyping platforms. In cific effects. Also, there is a tradeoff between the contrast, the standard error was at first raised and number of markers and marker quality or popula- then decreased as the TPS increased (Fig.  6). Rela- tion structure because marker quality decreases as tively high prediction accuracies with the slightest the number of markers increase in a specific marker standard error were observed in all traits when 50% dataset. This study suggests that genomic prediction to 60% of the total genotypes were used as a train- can also be performed using high-quality and low- ing set. Results of this study are consistent with a density markers in the initial breeding stage under previous report (Guo et al. 2020) that can be applied the drought environment, especially when heritabil- in the practical breeding program. The marker den- ity is low. sity is an essential effect on prediction accuracy if The low-cost DArT platform increases the pos- population sets and heritabilities are the same. Our sibility of integrating genomic selection strate- research found that the prediction accuracy of about gies into practical breeding programs for small 8000 markers from DArT is significantly higher than breeding companies and the private sector. The that of 90,000 markers from GBS, which emphasized SilicoDArT markers are possible to capture lost the importance of marking quality and indicated effects using a single reference genome for a 8000 markers is enough to implement genomic pre- diversity-rich population, which is essential for diction in maize. Low-density marker populations challenging drought tolerance prediction. The appeal to GS because they decrease multicollinear- key to the genomic prediction of maize drought ity and computational time consumption and allow resistance is to improve heritability and stabilize more individuals to be genotyped for the same cost. the phenotype. Natural drought experimental sites However, this prediction was made in a breeding pro- can increase the heritability of planting materi- gram’s initial stage, and higher marker density could als by accumulating data for many years without play a key role in the following breeding cycle for the a greenhouse. Fuxin county provides a chance to long term. As a mature genotyping platform, GBS study genomic prediction for drought tolerance. markers from the sufficient sequencing depth have a The SNPs significantly associated with traits can superb performance on genomic prediction (Liu et al. remarkably improve the prediction accuracy to 2021). The higher marker density of GBS also has a drought tolerance by removing low effect or inva- high application prospect in the multi-cycle breeding lid markers. This result is consistent with a previ- process. ous study (Cerrudo et al. 2018). The GP accuracies We tried to increase the prediction accuracy obtained from the marker-trait associated SNPs by improving the marker quality and controlling were relatively higher than those obtained from the the population structure, but at the same time, genome-wide SNPs for most of the target traits, by the marker density was also reduced. Our results about 5–30% (Yuan et  al. 2019).New models for show that the prediction accuracy was improved integrating genotype-by-environment interaction in some cases (Tables  3, 4, 5), even if the density into genomic prediction needs to be further devel- of the marker was reduced. However, this strategy oped to improve the selection of maize drought Vol.: (0123456789) 1 3 154 Page 18 of 21 Euphytica (2022) 218:154 Fig. 9 Prediction accuracy of GS under different situations and number of markers at different significance levels using the SNP of GBS. “SM” represents the signifi- cance markers, “ENSM” represents the randomly selected same amount of non-significant markers with the significance mark- ers, “NSM” represents all non-significant markers excluding significant mark- ers from genotype datasets and “ALL” represents the prediction of all markers under 60% of the training population. a and b respec- tively represent the predic- tion accuracy obtained by GS and number of markers at different significance levels by GWAS of ER trait. c and d respectively represent the prediction accuracy obtained by GS and number of markers at different significance levels by GWAS of SPH trait. e and f respectively represent the prediction accuracy obtained by GS and number of markers at different sig- nificance levels by GWAS of GY trait tolerance, although the genomic prediction using XD and JF revised the manuscript. All authors have read and significant markers has dramatically promoted the approved the final version of the manuscript. prediction accuracy. Funding This work was supported by Shanghai Agri- culture Applied Technology Development Program, China Author’s contributions AZ, SY, and ZC collected materials; (Z20190101), the National Science Foundation for Young Sci- AZ, SC, DD analyzed data; AZ, SC, CL, JN, XD, YG, YZ and entists of China (31801442), the CIMMYT-China Specialty JF carried out the field experiments; HZ and YR designed the Maize Research Center (KF201802), the Natural Sciences study; SC, AZ and YL wrote the manuscript; XZ, JC, LZ, YR, Vol:. (1234567890) 1 3 Euphytica (2022) 218:154 Page 19 of 21 154 Foundation of Liaoning Provincial Department of Education Chen J, Zavala C, Ortega N, Petroli C, Franco J, Burgueño J, (LJKZ0657). Costich DE, Hearne SJ (2016) The development of qual- ity control genotyping approaches: a case study using Data availability The datasets generated during and/or ana- elite maize lines. PLoS ONE 11:e0157236 lysed during the current study are available from the corre- Combs E, Bernardo R (2013) Accuracy of genomewide sponding author on reasonable request. selection for different traits with constant population size, heritability, and number of markers. Plant Genome Declarations 6:plantgenome2012-11 Crossa J, Beyene Y, Kassa S, Pérez P, Hickey JM, Chen C, de Conflict of interest The authors declare that they have no los Campos G, Burgueño J, Windhausen VS, Buckler E, conflict of interest. Jannink J-L, Lopez Cruz MA, Babu R (2013) Genomic prediction in maize breeding populations with genotyp- Open Access This article is licensed under a Creative Com- ing-by-sequencing. G3 Genes Genomes Genet 3:1903 mons Attribution 4.0 International License, which permits Crossa J, Pérez P, Hickey J, Burgueño J, Ornella L, Cerón- use, sharing, adaptation, distribution and reproduction in any Rojas J, Zhang X, Dreisigacker S, Babu R, Li Y, Bon- medium or format, as long as you give appropriate credit to the nett D, Mathews K (2014) Genomic prediction in CIM- original author(s) and the source, provide a link to the Crea- MYT maize and wheat breeding programs. Heredity tive Commons licence, and indicate if changes were made. The (edinb) 112:48–60 images or other third party material in this article are included Crossa J, Jarquín D, Franco J, Pérez-Rodríguez P, Bur- in the article’s Creative Commons licence, unless indicated gueño J, Saint-Pierre C, Vikram P, Sansaloni C, Petroli otherwise in a credit line to the material. If material is not C, Akdemir D, Sneller C, Reynolds M, Tattaris M, included in the article’s Creative Commons licence and your Payne T, Guzman C, Peña RJ, Wenzl P, Singh S (2016) intended use is not permitted by statutory regulation or exceeds Genomic prediction of gene bank wheat landraces. G3 the permitted use, you will need to obtain permission directly Genes Genomes Genet 6:1819 from the copyright holder. To view a copy of this licence, visit Crossa J, Pérez-Rodríguez P, Cuevas J, Montesinos-López http:// creat iveco mmons. org/ licen ses/ by/4. 0/. O, Jarquín D, de los Campos G, Burgueño J, González- Camacho JM, Pérez-Elizalde S, Beyene Y, Dreisigacker S, Singh R, Zhang XC, Gowda M, Roorkiwal M, Rutkoski J, Varshney RK (2017) Genomic selection in plant breed- References ing: methods, models, and perspectives. Trends Plant Sci 22:961–975 Alvarado G, Rodríguez FM, Pacheco A, Burgueño J, Crossa Cui ZH, Dong HX, Zhang A, Ruan YY, He Y, Zhang ZW J, Vargas M, Pérez-Rodríguez P, Lopez-Cruz MA (2020) Assessment of the potential for genomic selec- (2020) META-R: a software to analyze data from multi- tion to improve husk traits in maize. G3-Genes Genomes environment plant breeding trials. Crop J 8:745–756 Genet 10:g3.401600.402020 Bernardo R (2016) Bandwagons I, too, have known. Theor de los Campos G, Gianola D, Rosa GJM (2009) Reproducing Appl Genet 129:2323–2332 Kernel Hilbert spaces regression: a general framework for Beyene Y, Semagn K, Mugo S, Tarekegne A, Babu R, Meisel genetic evaluation. J Anim Sci 87:1883–1887 B, Sehabiague P, Makumbi D, Magorokosho C, Oikeh Desta ZA, Ortiz R (2014) Genomic selection: genome- S, Gakunga J, Vargas M, Olsen M, Prasanna BM, Ban- wide prediction in plant improvement. Trends Plant Sci ziger M, Crossa J (2015) Genetic gains in grain yield 19:592–601 through genomic selection in eight bi-parental maize Dos Santos JPR, Pires LPM, de Castro Vasconcellos RC, populations under drought stress. Crop Sci 55:154–163 Pereira GS, Von Pinho RG, Balestre M (2016) Genomic Bradbury PJ, Zhang ZW, Kroon DE, Casstevens TM, Ram- selection to resistance to stenocarpella maydis in maize doss Y, Buckler ES (2007) TASSEL: software for asso- lines using DArTseq markers. BMC Genet 17:86 ciation mapping of complex traits in diverse samples. Elshire RJ, Glaubitz JC, Sun Q, Poland JA, Kawamoto K, Bioinformatics 23:2633–2635 Buckler ES, Mitchell SE (2011) A robust, simple genotyp- Cao SL, Loladze A, Yuan Y, Wu YS, Zhang A, Chen JF, ing-by-sequencing (GBS) approach for high diversity spe- Huestis G, Cao JS, Chaikam V, Olsen M, Prasanna BM, cies. PLoS ONE 6:e19379 San Vicente F, Zhang XC (2017) Genome-wide analysis Endelman JB (2011) Ridge regression and other kernels for of tar spot complex resistance in maize using genotyp- genomic selection with R package rrBLUP. Plant Genome ing-by-sequencing SNPs and whole-genome prediction. 4:250–255 Plant Genome 10:plantgenome2016-10 Fan Y, Zhou G, Sergey S, Chen Z, Cai S, Li C, Zhou M (2016) Carena MJ, Hallauer AR, Filho JM (2010) Quantitative Genome-wide association study reveals a new qtl for genetics in maize breeding. Springer, New York salinity tolerance in barley (Hordeum vulgare L.). Front Cerrudo D, Cao S, Yuan Y, Martinez C, Suarez EA, Babu Plant Sci 7:946 R, Zhang TS (2018) Genomic selection outperforms Guo R, Dhliwayo T, Mageto EK, Palacios-Rojas N, Lee M, marker assisted selection for grain yield and physiologi- Yu D, Ruan Y, Zhang A, San Vicente F, Olsen M, Crossa cal traits in a maize doubled haploid population across J, Prasanna BM, Zhang LJ, Zhang XC (2020) Genomic water treatments. Front Plant Sci 9:366 prediction of kernel zinc concentration in multiple maize Vol.: (0123456789) 1 3 154 Page 20 of 21 Euphytica (2022) 218:154 populations using genotyping-by-sequencing and repeat Môro GV, Santos MF, de Souza Júnior CL (2019) Comparison amplification sequencing markers. Front Plant Sci 11:534 of genome-wide and phenotypic selection indices in maize. Hao Z, Lv D, Ge Y, Shi J, Weijers D, Yu G, Chen J (2020) Euphytica 215:76 RIdeogram: drawing SVG graphics to visualize and map Norman A, Taylor J, Edwards J, Kuchel H (2018) Optimising genome-wide data on the idiograms. PeerJ Comput Sci genomic selection in wheat: effect of marker density, popu- 6:e251 lation size and population structure on prediction accuracy. Heslot N, Yang H-P, Sorrells ME, Jannink J-L (2012) Genomic G3 Genes Genomes Genet 8:2889–2899 selection in plant breeding: a comparison of models. Crop Pereira WJ, de CastroRodriguesPappas M, Grattapaglia D, Pap- Sci 52:146–160 pas GJ (2020) A cost-effective approach to DNA methyla- Jiang B, Wang P, Zhuang S, Li MS, Li Z, Gong ZH (2018) tion detection by methyl sensitive DArT sequencing. PLoS Detection of maize drought based on texture and morpho- ONE 15:e0233800 logical geatures. Comput Electron Agric 151:50–60 R Core Team (2021) R: a language and environment for statisti- Kang HM, Zaitlen NA, Wade CM, Kirby A, Heckerman D, cal computing, Vienna. https:// www.R- proje ct. org/. Daly MJ, Eskin E (2008) Efficient control of population Resende MFR Jr, Muñoz P, Resende MDV, Garrick DJ, Fer- structure in model organism association mapping. Genet- nando RL, Davis JM, Jokela EJ, Martin TA, Peter GF, Kirst ics 178:1709–1723 M (2012) Accuracy of genomic selection methods in a Kilian A, Wenzl P, Huttner E, Carling J, Xia L, Blois H, Caig standard data set of loblolly pine (Pinus taeda L.). Genetics V, Heller-Uszynska K, Jaccoud D, Hopper C, Aschen- 190:1503–1510 brenner-Kilian M, Evers M, Peng K, Cayla C, Hok P, Saitou N, Nei M (1987) The neighbor-joining method: a new Uszynski G (2012) Diversity arrays technology: S generic method for reconstructing phylogenetic trees. Mol Biol Evol genome profiling technology on open platforms. Methods 4:406–425 Mol Biol (clifton, NJ) 888:67–89 Tecle IY, Edwards JD, Menda N, Egesi C, Rabbi IY, Kulakow Kwong QB, Ong AL, Teh CK, Chew FT, Tammi M, Mayes S, P, Kawuki R, Jannink J-L, Mueller LA (2014) solGS: a Kulaveerasingam H, Yeoh SH, Harikrishna JA, Appleton web-based tool for genomic selection. BMC Bioinform DR (2017) Genomic selection in commercial perennial 15:398–398 crops: applicability and improvement in oil palm (Elaeis Thavamanikumar S, Dolferus R, Thumma BR (2015) Compari- guineensis Jacq.). Sci Rep 7:2872 son of genomic selection models to predict flowering time Liu XG, Wang HW, Wang H, Guo ZF, Xu XJ, Liu JC, Wang SH, and spike grain number in two hexaploid wheat doubled Li WX, Zou C, Prasanna BM, Olsen MS, Huang CL, Xu haploid populations. G3 (bethesda) 5:1991–1998 YB (2018) Factors affecting genomic selection revealed by Tian Y, Guan B, Zhou DW, Yu JB, Li GD, Lou YJ (2014) empirical evidence in maize. Crop J 6:341–352 Responses of seed germination, seedling growth, and seed Liu C, Sukumaran S, Jarquin D, Crossa J, Dreisigacker S, Sansaloni yield traits to seed pretreatment in maize (Zea mays L.). Sci C, Reynolds M (2020) Comparison of array- and sequencing- World J 2014:834630 based markers for genome-wide association mapping and VanRaden PM (2008) Efficient methods to compute genomic genomic prediction in spring wheat. Crop Sci 60:211–225 predictions. J Dairy Sci 91:4414–4423 Liu Y, Hu G, Zhang A, Loladze A, Hu Y, Wang H, Qu J, Zhang Vivek BS, Krishna GK, Vengadessan V, Babu R, Zaidi PH, Kha X, Olsen M, San Vicente F, Crossa J, Lin F, Prasanna BM LQ, Mandal SS, Grudloyma P, Takalkar S, Krothapalli K, (2021) Genome-wide association study and genomic predic- Singh IS, Ocampo ETM, Xingming F, Burgueño J, Azrai M, tion of fusarium ear rot resistance in tropical maize germ- Singh RP, Crossa J (2017) Use of genomic estimated breed- plasm. Crop J 9:325–341 ing values results in rapid genetic gains for drought tolerance Massman JM, Gordillo A, Lorenzana RE, Bernardo R (2013) in maize. Plant Genome 10:plantgenome2016.2007.0070 Genomewide predictions from maize single-cross data. Wang J, Zhang Z (2021) GAPIT version 3: boosting power and Theor Appl Genet 126:13–22 accuracy for genomic association and prediction. Genom Maulana F, Kim K-S, Anderson JD, Sorrells ME, Butler TJ, Liu Proteomics Bioinform 19:1–12 S, Baenziger PS, Byrne PF, Ma X-F (2021) Genomic selec- Wang X, Wang H, Liu S, Ferjani A, Li J, Yan J, Yang X, Qin F tion of forage agronomic traits in winter wheat. Crop Sci (2016) Genetic variation in ZmVPP1 contributes to drought 61:410–421 tolerance in maize seedlings. Nat Genet 48:1233–1241 Meuwissen THE, Hayes BJ, Goddard ME (2001) Prediction of Wang N, Yuan Y, Wang H, Yu D, Liu Y, Zhang A, Gowda M, total genetic value using genome-wide dense marker maps. Nair SK, Hao Z, Lu Y, San Vicente F, Prasanna BM, Li X, Genetics 157:1819–1829 Zhang X (2020) Applications of genotyping-by-sequencing Michel S, Ametz C, Gungor H, Epure D, Grausgruber H, (GBS) in maize genetics and breeding. Sci Rep 10:16308 Löschenberger F, Buerstmayr H (2016) Genomic selec- Xiao YJ, Liu HJ, Wu LJ, Warburton M, Yan JB (2017) Genome- tion across multiple breeding cycles in applied bread wheat wide association studies in maize: praise and stargaze. Mol breeding. Theor Appl Genet 129:1179–1189 Plant 10:359–374 Montesinos-López OA, Montesinos-López A, Pérez-Rodríguez Yu J, Pressoir G, Briggs WH, Vroh Bi I, Yamasaki M, Doebley P, Barrón-López JA, Martini JWR, Fajardo-Flores SB, JF, McMullen MD, Gaut BS, Nielsen DM, Holland JB, Gaytan-Lugo LS, Santana-Mancilla PC, Crossa J (2021) A Kresovich S, Buckler ES (2006) A unified mixed-model review of deep learning applications for genomic selection. method for association mapping that accounts for multiple BMC Genom 22:19 levels of relatedness. Nat Genet 38:203–208 Vol:. (1234567890) 1 3 Euphytica (2022) 218:154 Page 21 of 21 154 Yuan Y, Cairns JE, Babu R, Gowda M, Makumbi D, Magoroko- (2017b) Rapid cycling genomic selection in a multiparental sho C, Zhang A, Liu Y, Wang N, Hao Z, San Vicente F, tropical maize population. G3 (bethesda) 7:2315–2326 Olsen MS, Prasanna BM, Lu Y, Zhang X (2019) Genome- Zhang HH, Yin LL, Wang MY, Yuan XH, Liu XL (2019) Factors wide association mapping and genomic prediction analyses affecting the accuracy of genomic selection for agricultural reveal the genetic architecture of grain yield and flowering economic traits in maize, cattle, and pig populations. Front time under drought and heat stress conditions in maize. Genet 10:189 Front Plant Sci 9:1919 Zhang A, Wang H, Beyene Y, Semagn K, Liu Y, Cao S, Cui Z, Publisher’s Note Springer Nature remains neutral with regard Ruan Y, Burgueno J, San Vicente F, Olsen M, Prasanna BM, to jurisdictional claims in published maps and institutional Crossa J, Yu H, Zhang X (2017a) Effect of trait heritability, affiliations. training population size and marker density on genomic pre- diction accuracy estimation in 22 bi-parental tropical maize populations. Front Plant Sci 8:1916 Zhang XC, Pérez-Rodríguez P, Burgueño J, Olsen M, Buckler E, Atlin G, Prasanna BM, Vargas M, San Vicente F, Crossa J Vol.: (0123456789) 1 3