Euphytica (2022) 218:154 
https://doi.org/10.1007/s10681-022-03103-y
Genomic prediction of drought tolerance during seedling 
stage in maize using low‑cost molecular markers
Ao Zhang · Shan Chen · Zhenhai Cui · Yubo Liu · Yuan Guan · Shuang Yang · Jingtao Qu · Juchao Nie · 
Dongdong Dang · Cong Li · Xiaomei Dong · Jinjuan Fan · Yanshu Zhu · Xuecai Zhang · Jose Crossa · 
Huiying Cao · Yanye Ruan · Hongjian Zheng
Received: 15 December 2021 / Accepted: 8 September 2022 / Published online: 8 October 2022 
© The Author(s) 2022
Abstract Drought tolerance in maize is a com- of 379 inbred lines were genotyped with diver-
plex and polygenic trait, especially in the seedling sity arrays technology (DArT) and genotyping-by-
stage. In plant breeding, complex genetic traits can sequencing (GBS) platforms. Target traits of seedling 
be improved by genomic selection (GS), which has emergence rate (ER), seedling plant height (SPH), 
become a practical and effective breeding tool. In and grain yield (GY) were evaluated under two natu-
the present study, a natural maize population named ral drought stress environments in northeast China. 
Northeast China core population (NCCP) consisting Adequate genetic variations were observed for all the 
target traits, but they were divergent across environ-
ments. Similarly, the heritability of the target trait 
Ao Zhang and Shan Chen have contributed equally to the 
research presented in this article. also varied across years and environments, the herit-
abilities in 2019 (0.88, 0.82, 0.85 for ER, SPH, GY) 
Supplementary Information The online version were higher than those in 2020 (0.65, 0.53, 0.33) and 
contains supplementary material available at https:// doi. cross-2-years (0.32, 0.26, 0.33). In total, three marker 
org/ 10. 1007/ s10681- 022- 03103-y.
A. Zhang · Y. Liu · Y. Guan · D. Dang · X. Zhang · Z. Cui 
H. Zheng (*)  Key Laboratory of Soybean Molecular Design Breeding, 
CIMMYT-China Specialty Maize Research Center, Crop Northeast Institute of Geography and Agroecology, 
Breeding and Cultivation Research Institute, Shanghai Chinese Academy of Sciences,  Changchun,  Jilin, China
Academy of Agricultural Sciences, Shanghai, China
e-mail: hjzh6188@163.com S. Yang 
Shenyang Academy of Agricultural Sciences, Shenyang, 
A. Zhang · S. Chen · D. Dang · C. Li · X. Dong · J. Fan · Liaoning, China
Y. Zhu · H. Cao · Y. Ruan (*) 
College of Bioscience and Biotechnology, Shenyang J. Qu 
Agricultural University, Shenyang, Liaoning, China Maize Research Institute, Sichuan Agricultural University, 
e-mail: yanyeruan@syau.edu.cn Wenjiang, Sichuan, China
A. Zhang · S. Chen · D. Dang · C. Li · X. Dong · J. Fan · J. Nie 
Y. Zhu · H. Cao · Y. Ruan  Fumeng County Modern Agricultural Development 
Shenyang City Key Laboratory of Maize Genomic Service Center, Fuxin, Liaoning, China
Selection Breeding, Shenyang, Liaoning, China
A. Zhang · Y. Liu · X. Zhang · J. Crossa 
International Maize and Wheat Improvement Center 
(CIMMYT), El Batan, Texcoco, México
Vol.: (0123456789)
1 
 154 Page 2 of 21 Euphytica (2022) 218:154
datasets, 11,865 SilicoDArT markers obtained from LD  Linkage disequilibrium
the DArT-seq platform, 7837 SNPs obtained from MAF  Minor allele frequency
the DArT-seq platform, and 91,003 SNPs obtained MAS  Marker-assisted selection
from the GBS platform, were used for GS analy- MRD  Modified Rogers’ distance
sis after quality control. The results of phylogenetic NCCP  Northeast China core population
trees showed that broad genetic diversity existed in NJ  Neighbor-Joining
the NCCP population. Genomic prediction results PCA  Principal component analysis
showed that the average prediction accuracies esti- QTL  Quantitative trait loci
mated using the DArT SNP dataset under the two- rrBLUP  Ridge regression BLUP
fold cross-validation scheme were 0.27, 0.19, and SNP  Single nucleotide polymorphism
0.33, for ER, SPH, and GY, respectively. The result SPH  Seedling plant height
of SilicoDArT is close to the SNPs from DArT-seq, TP  Training population
those were 0.26, 0.22, and 0.33. For the trait with TPS  Training population size
lower heritability, the prediction accuracy can be 
improved using the dataset filtered by linkage dis-
equilibrium. For the same trait, the prediction accura- Introduction
cies estimated with two DArT marker datasets were 
consistently higher than that estimated with the GBS Maize (Zea mays L.) is an important source of food, 
SNP dataset under the same genotyping cost. The animal feed, and energy because of its high yield 
prediction accuracy was improved by controlling potential (Jiang et  al. 2018). Consequently, poor 
population structure and marker quality, even though maize production caused by abiotic stresses has a tre-
the marker density was reduced. The prediction accu- mendously detrimental impact on global food secu-
racies were improved by more than 30% using the rity. Seed germination and seedling establishments 
significant-associated SNPs. Due to the complexity are the initial stages of maize growth, which affect the 
of drought tolerance under the natural stress environ- seedling emergence rate, uniformity, and robustness, 
ments, multiple years of data need to be accumulated and then determine maize yield potential (Tian et al. 
to improve prediction accuracy by reducing genotype- 2014). However, these initial stages are frequently 
by-environment interaction. Modeling genotype- subjected to severe drought stress worldwide due to 
by-environment interaction into genomic prediction global warming, which decreases the emergence rate 
needs to be further developed for improving drought and retards seedling growth. Therefore, the develop-
tolerance in maize. The results obtained from the pre- ment of drought-tolerant maize hybrids is essential to 
sent study provides valuable pathway for improving address the global temperature rise and end the global 
drought tolerance in maize using GS. food insecurity.
The selection of elite drought-tolerant maize inbred 
Keywords Maize (Zea mays L.) · GBS · DArT · lines is the basis of the development of drought-tol-
GS · Drought stress erant maize hybrids. However, traditional methods 
for selecting inbred lines are time-consuming, labor-
Abbreviations intensive, and inefficient. In recent decades, with the 
ANOVA A nalysis of variance development of high-throughput and inexpensive 
BLUE  Best linear unbiased estimate genotyping technologies, genomic selection (GS) 
BP  Breeding population has become a practical and effective tool in animal 
DArT  Diversity arrays technology and plant breeding (Michel et  al. 2016; Zhang et al. 
DL  Deep learning 2017b). The principle of GS is to establish the pre-
ER  Emergence rate diction model based on the genotypic and phenotypic 
GBS  Genotyping-by-sequencing data from a training population (TP), which is used to 
GEBV  Genomic estimated breeding values derive genomic estimated breeding values (GEBVs) 
GS  Genomic selection for all the individuals in the breeding population (BP) 
GY  Grain yield from their genomic profiles (Meuwissen et al. 2001), 
H2  Broad-sense heritability guiding breeders to select excellent individuals from 
Vol:. (1234567890)
1 3
Euphytica (2022) 218:154 Page 3 of 21 154
the population without phenotypic data. GS has been the field observations, is used to evaluate the reliabil-
shown to improve the efficiency of recurrent selection ity of GS results (Combs and Bernardo 2013). A vari-
of bi-parental populations (Beyene et al. 2015; Mass- ety of factors are known to influence GS prediction 
man et al. 2013; Môro et al. 2019; Vivek et al. 2017) accuracy, which includes training population size, 
and multi-parent populations (Zhang et al. 2017b) to relatedness between training and test individuals, 
reduce breeding cycle time and accelerate the genetic DNA marker type, marker quality and density, trait 
gain per unit cost and time. At present, GS has been heritability, statistical models, linkage disequilibrium 
widely used in maize for various traits, including (LD) between markers, and population structure, 
grain yield (Liu et  al. 2018), developmental traits etc. (Liu et al. 2018; Norman et al. 2018). The qual-
(Cui et al. 2020; Zhang et al. 2019), abiotic and biotic ity and density of genomic markers are influenced by 
stress tolerances (Beyene et al. 2015; Cao et al. 2017; genotyping platforms. The ideal sequencing platform 
Vivek et al. 2017) and grain nutritional quality (Guo should be inexpensive, provide high-throughput and 
et al. 2020). Compared with the widely used marker- good genomic coverage, as well as be replicable and 
assisted selection (MAS) technology, the GS strategy stable. Genotyping-by-sequencing (GBS), which can 
has the advantage of without detection of quantitative simplify complex genomes, has now been used as a 
trait loci (QTL) in advance, and it can be used to pre- high-throughput and cost-effective tool for generating 
dict the performance of the breeding population with- high-density molecular markers in many crop species 
out genetic structure analysis (Bernardo 2016). More- (Elshire et  al. 2011). The Diversity Arrays Technol-
over, GS utilizes all genes/loci affecting phenotypes, ogy (DArT) is a DNA hybridization-based molecu-
including both major genes/loci and minor genes/loci lar marker technique, which can detect variation at 
(Crossa et al. 2017; Desta and Ortiz 2014; Xiao et al. numerous genomic loci without sequence information 
2017). Therefore, it is a promising genomic tool for of reference genome. DArT marker has been used 
improving the complex quantitative traits. for the implementation of MAS in maize (Dos San-
Genomic prediction (GP) can be implemented tos et al. 2016). In addition, DArT can be combined 
with statistical models that can estimate the marker with next-generation sequencing platforms known as 
effects accurately in a training population. Over DArT-seq, which permits simultaneous detection of 
the past 2 decades, various statistical models and several thousands of DNA polymorphisms by scor-
machine learning methods have been proposed for the ing the presence or absence of DNA fragments in 
implementation of GP. Among the parametric mod- genomic representations and through a process of 
els, ridge regression best linear unbiased predictions reducing the genomic complexity (Kilian et al. 2012).
(rrBLUP) and genomic-BLUP (GBLUP) are most The maize growth from seed germination stage 
commonly used in plant breeding (Heslot et al. 2012). to seedling establishment stage involves a variety of 
Bayes A assumes a prior distribution of effects with a complicated metabolic transformations, physiologi-
higher probability of moderate to large effects, while cal activities, and cellular and tissue differentiation, 
Bayes B and Bayes Cπ assume some marker effects to which are regulated by many genomic loci, includ-
be zero. Reproducing Kernel Hilbert Spaces (RKHS) ing both major and minor effect loci expressed under 
regression is equivalent to a GBLUP with a linear drought stress. These abundant genetic variations 
kernel (de los Campos et  al. 2009). Deep Learning exist in maize germplasms with different genetic 
(DL), a nonparametric model, is a type of machine backgrounds. In this study, three marker datasets of 
learning (ML) approach and a subfield of artificial a natural maize population obtained from the DArT 
intelligence (Montesinos-López et  al. 2021). For and GBS genotyping platforms, as well as the phe-
most traits with different genetic architectures, the notypic data of emergence rate (ER) and plant height 
differences between different models are slight. rrB- (SPH) in the seedling stage, and grain yield (GY) in 
LUP is a classical method with lower computational the mature stage evaluated under two drought stress 
requirements and usually performs well across dif- environments, were used to perform GS analysis. The 
ferent traits (Kwong et al. 2017; Maulana et al. 2021; main objectives of the present study were to: (1) eval-
Resende et al. 2012). uate the phenotypic variation of several target traits 
Prediction accuracy of GS, the correlation between in the NCCP in natural drought stress environments; 
the genomic estimated breeding values (GEBV) and (2) compare the genomic prediction accuracies of the 
Vol.: (0123456789)
1 3
 154 Page 4 of 21 Euphytica (2022) 218:154
target traits of ER, SPH, and GY estimated from the the proportion of the number of surviving plants to 
three marker datasets obtained from the DArT and the total number of seeds planted. SPH was measured 
GBS genotyping platforms; (3) assess the prediction as the distance from the plant base to the highest of 
accuracies of three traits by improving the marker the seedling plant. The average dry grain weight from 
quality; and (4) improve the prediction accuracies by five plants was used to represent the GY for each 
controlling population structure and incorporating entry.
trait-marker associations. The best linear unbiased estimator (BLUE) values 
and broad-sense heritability  (H2) of ER, SPH, and 
GY were calculated within and across years using 
Materials and methods the META-R software version 6.04 (Alvarado et  al. 
2020) (http:// hdl. handle. net/ 11529/ 10201). The linear 
Plant materials, phenotype evaluation, and mixed model used in META-R was implemented in 
heritability estimation the LME4 R-package, functions of lmer and REML 
were used to estimate the variance components.
A collection of 391 inbred lines, designated as the 
northeast China core population (NCCP), were used Yijk =  + Geni + Envj + Geni × Envj + Repk + ijk
for genomic prediction analyses in the current study. where Yijk is the trait of interest, μ is the overall mean, 
All these inbred lines were selected from China, Geni , Envj , and Geni × Envj are the effects of the i-th 
Mexico, and America, and they adapted to the spring genotype, j-th year, and i-th genotype by j-th year 
maize area of Northeast China. The drought stress interaction, respectively. All the effects are consid-
tolerance of this population was determined in the ered random, which is assumed to be normally and 
Fuxin Mongolian autonomous county, Liaoning independently distributed, with mean zero and homo-
Province, China (42°06′ N, 122°55′ E) in 2019 and scedastic variance σ2. Genotype and environment is 
2020, respectively, where drought stress occurs fre- a fixed effect when calculating BLUEs, the covariate 
quently during the spring season. The drought trial is always a fixed effect. Repk is the effect of k-th rep-
was planted on 12 May and harvested on 7 October lication. ijk is the random error term of the i-th geno-
in 2019, and planted on 11 May and harvested on 9 type, j-th year, k-th replication. Traits with heritability 
October in 2020. During the maize growing season, below 0.05 within the individual environment were 
the average temperature was 21 °C both in 2019 and excluded from the across-location analysis.
2020. The average rainfall was 4  mm and 3  mm in Broad-sense heritability ( H2) based on the entry 
2019 and 2020, respectively. The highest tempera- means within the trial was estimated as follows:
ture was above 35 °C in both years, and the precipita-
tion was less than 20 mm per month, causing serious 2
g
drought stress in the seedling stage (Fig.  1). Irriga- H2 =
2
e
2 g 2
tion was not performed before planting, and plant- + +
g nEnvs nEnvs×nreps
ing was scheduled before precipitation according to 
the weather forecast for getting good germination. A where 2 , 2 , and 2  are the genotypic variance, error 
g ge
completely randomized block design with three repli- variance, and genotype-by-environment interaction 
cations was applied in each environment. The inbred variance, respectively, and nreps and nEnvs are the 
lines were sown using one seed per hole in a single numbers of replications and environments, respec-
plot, 3 m long, with plant spacing of 10 cm and row tively (Carena et al. 2010).
spacing of 60  cm. Target traits of emergence rate 
(ER), seedling plant height (SPH) and grain yield Genotyping and quality control
(GY) were evaluated to represent the drought toler-
ance of the tested plant materials. The target traits of Leaf samples of three individual plants were collected 
ER and SPH were measured at the seedling stage, i.e. for each inbred line during the seeding stage for DNA 
20 days after planting. While GY was measured at the extraction with a CTAB procedure, and genotyp-
maturity stage, the moisture content was measured by ing was carried out by DArT (https:// www. diver sitya 
LDS-1G Grain Moisture Meter. ER was measured as rrays. com/) and GBS (Elshire et al. 2011) platforms.
Vol:. (1234567890)
1 3
Euphytica (2022) 218:154 Page 5 of 21 154
Fig. 1  Distribution of 
temperature and rainfall 
in maize planting plots of 
Fuxin Mongolian Autono-
mous County in 2019 and 
2020. The horizontal axis 
shows the date from sowing 
to harvest. The panels of a 
and c are temperature (°C) 
data, including maximum 
temperature (tem_max), 
minimum temperature 
(tem_min), and average 
temperature (tem_avg). 
The panels of b and d are 
rainfall data, and the unit 
is mm
Vol.: (0123456789)
1 3
 154 Page 6 of 21 Euphytica (2022) 218:154
Genotyping of all inbred lines with the GBS were generated by DArT-seq: SilicoDArT and SNP. 
method was performed at Wuhan medical laboratory, The SilicoDArT dataset is the presence/absence vari-
BGI Co., Ltd, following a protocol widely used by the ation of the tag sequences, while the SNP dataset 
maize research community (Elshire et al. 2011). The was always used for genetics analysis and molecular 
genomic DNA was digested with the ApeKI restric- marker-assisted selection in previous studies. The 
tion enzyme, and a DNA library was constructed SNP dataset was obtained by mapping the SNPs to the 
in 96-plex and sequenced on Illumina-hiseq XTen B73 RefGen_v4 reference genome using the BLAST 
sequencing system. Each sample obtained an average tool, and some of the fragments in SilicoDArT were 
of about seven hundred and fifty thousand reads. The not able to be aligned to the reference genome. The 
B73 RefGen_v4 was the reference genome, and BWA number of markers in SilicoDArT dataset was 62,794, 
software was applied to match all sequencing data to while the number of markers in the SNP dataset was 
the reference genome. Details in single nucleotide 39,659, and 547 of them were not able to be anchored 
polymorphism (SNP) calling and imputation have to any ten maize chromosomes (Table S1).
been previously described (Cao et  al. 2017). Each In both DArT and GBS datasets, TASSEL 5 (Brad-
line has an average of approximately 77,000 SNPs bury et al. 2007) was used to filter out markers with 
distributed on each chromosome. In total, 776,285 minor allele frequency (MAF) < 0.05 and a missing 
loci were aggregated to cover all SNPs, and 768,558 rate > 20%. The samples with a missing rate of more 
of them were mapped to the ten maize chromosomes than 20% were discarded. The number of lines in 
(Table S1), while 7727 could not be anchored to any DArT and GBS datasets is 379 and 378, respectively. 
chromosome (data not displayed). In total, the number of markers is 11,865 in the Sili-
Genotyping of all the inbred lines with DArT- coDArT dataset, 7837 in the DArT SNP dataset, and 
seq method was carried out at the SAGA (https:// 91,003 in the GBS SNP dataset (Table 1). Imputation 
seeds ofdis covery. org/ about/ genot yping- platf orm/) was performed using the method of Cao et al. (2017).
sequencing lab, jointly established by the DArT com-
pany and CIMMYT. Two enzymes of PstI (CTG CAG SNP distribution across the whole maize genome
) and HpaII (CCGG) were used to digest the DNA 
samples to reduce the complexity of the genome. For The distribution of SNPs in the genome is one of the 
each 96 wells plate, 16% of the samples were repli- indicators of the quality of the genotyping platforms. 
cated to assess reproducibility (Pereira et  al. 2020). A package named RIdeogram (Hao et al. 2020), writ-
After enzyme digestion, DNA from different samples ten in R programming (R Core Team, 2021), was 
was linked with barcodes of different base combina- used to assess the distribution of SNPs obtained from 
tions of sequences to construct a simplified sequenc- both the DArT-seq and GBS platforms. This analysis 
ing DNA library. Equimolar amounts of amplification was not applied to the SilicoDArT dataset, due to a 
products from each sample were pooled by plate and lack of information on physical positions on the refer-
amplified by c-Bot (Illumina) bridge PCR, followed ence genome. The plot of the SNP distribution was 
by fragment sequencing on Illumina Hiseq 2500 drawn using the ideogram function. The customized 
(www. illum ina. com). The short fragment sequenc- R scripts to plot the heatmap of the SNP distribution 
ing technology (150 bp) and the simplified sequenc-
ing DNA library of mixed samples was sequenced 
on a single lane (Kilian et  al. 2012) (https:// www. Table 1  The number of materials and sites, proportion hete-
diver sitya rrays. com/). SNPs were called using the rozygous and MAF for markers of SilicoDArT, DArT and GBS 
DArTsoft analytical pipeline (http:// www. diver sitya datasets after filtering
rrays. com/ softw are. html) (Chen et al. 2016). DArT’s SilicoDArT DArT GBS
SNP development is different from GBS, which 
does not rely on the reference genome information, Number of Taxa 379 379 378
but mainly depends on the sequencing library data Number of Sites 11,865 7837 91,003
from the DArT company. All reads were aligned by Proportion Missing 0.07 0.07 0.069
sequence analysis based on maize tags of metage- Proportion Heterozygous 0.03 0.02 0.018
nome representation. Two types of marker datasets Average Minor Allele Frequency 0.24 0.24 0.23
Vol:. (1234567890)
1 3
Euphytica (2022) 218:154 Page 7 of 21 154
on maize chromosomes using the HapMap format random cross-validation analysis was implemented 
input dataset were provided (https:// aozha ngchi na. for all traits. For prediction, regression models were 
github. io/R/ chrom esome heatm apTool/ Chrom esome evaluated by the prediction accuracy, a Pearson’s cor-
Heatm ap. html). relation between the GEBVs and the estimated breed-
ing values in the testing population (Thavamanikumar 
Genomic prediction model et al. 2015). All predictions were repeated 100 times, 
and the mean value was taken as the average predic-
The rrBLUP model assumes homogenous variance tion accuracy.
of all markers and shrinks all marker effects equally 
to zero. rrBLUP is equivalent to BLUP and uses the Effect of marker quality on the estimation of genomic 
realized relationship matrix estimated from the mark- prediction accuracy
ers. The genomic prediction analysis was performed 
with the ridge regression BLUP (rrBLUP) package For the same target trait, genomic prediction accuracy 
(Endelman, 2011). The mixed model is described as: estimated by the data from the drought stress condi-
tion is always relatively low than that estimated by 
y = X + Zu +  the data from the optimal condition. Higher marker 
quality was proved to improve prediction accuracy. 
where y is the vector (n × 1) of BLUPs from the phe- As mentioned in in materials and methods section, 
notypic data analysis, X is a design matrix for fixed different levels of MAF and missing rates were fil-
effects; ε is the vector (n × 1) of independently ran- tered in the DArT and GBS marker datasets to control 
dom errors with assumed distribution N (0, Iσ 2
ε ), Z the marker quality and quantity.
is the design matrix (n × n) for random effects and u For evaluating the effect of linkage disequilib-
is the vector of random effects with u ∼ N (0, Kσ 2
u ) rium (LD) among markers on the estimation of the 
and K was an a realized (additive) relationship matrix genomic prediction accuracy, neighbor SNPs not in 
computed from marker genotypes proposed by Van- a strong LD phase, i.e., LD values lower than a spe-
Raden (2008). In addition, n is the number of indi- cific threshold (0.1 and 0.2) within an LD block, were 
viduals (Endelman 2011; Liu et  al. 2018). Variance excluded from the prediction analyses. TASSEL 5.0 
components are estimated by REML using the spec- (Bradbury et al. 2007) was used to perform LD analy-
tral decomposition algorithm of Kang et  al. (2008). sis, and LD decay plots were developed using the cus-
The mixed.solve function, a linear mixed-model equa- tomized R scripts (https:// aozha ngchi na. github. io/R/ 
tion estimates marker effects and GEBVs. GEBVs LDdec ay/ LDdec ayPlo tTool. html). Genomic predic-
are derived from the realized (additive) relationship tion accuracy was calculated from 100 times repeti-
matrix of individuals calculated from marker geno- tion using twofold cross-validation. A twofold cross-
types (Tecle et al. 2014). validation analysis was implemented for all traits. For 
prediction, regression models were evaluated by the 
Effect of training population size (TPS) on the prediction accuracy. All predictions were repeated 
estimation of genomic prediction accuracy 100 times, and the mean value was taken as the aver-
age prediction accuracy.
To evaluate the effect of TPS on the estimation of 
genomic prediction accuracy, the cross-validation Effect of genetic structure on the estimation of 
scheme was used to randomly generate different train- genomic prediction accuracy
ing and prediction sets and assess the prediction accu-
racy for different target traits in prediction sets. The It was shown that the relationship between train-
training populations were extracted from the whole ing and prediction sets had a significant impact on 
population The TPS ranged from 10 to 90%, with an the estimation of prediction accuracy (Zhang et  al. 
interval of 10%. Each cross-validation scheme was 2017a). In the current study, NCCP consists of multi-
repeated 100 times. For the same trait, the predic- ple heterotic groups, such as the Reid, Reid_Chinese, 
tion accuracies estimated from the DArT and GBS Lancaster, Tangsipingtou, Lvda Red Cob, Longdan, 
molecular marker datasets were compared. A gradient Jidan, CML, Mixed, etc. Some elite breeding lines 
Vol.: (0123456789)
1 3
 154 Page 8 of 21 Euphytica (2022) 218:154
from Northeast China were also included in this estimate the Q value in this study. The MLM model is 
population, including Shen118, Shen3336, Liao6049, described as follows:
Dan340, Ji818, etc.
A phylogenetic tree was used to demonstrate the y = X + Qv + Zu + 
genetic relationship among all the inbred lines used where X is the SNP marker matrix, Q and Z represent 
in NCCP. The genetic distances among all the inbred the subpopulation membership matrix and kinship 
lines were calculated in Bio-R (Biodiversity Analysis matrix respectively,  and v are the coefficient vec-
with R) software developed by CIMMYT using the tors of SNP markers and subpopulation membership 
filtered DArT and GBS datasets’ (https:// hdl. han- respectively, u is the random genetic effect vector, and 
dlenet/11529/10820). The formula to compute the  is the random error vector, which is Var[] = I2 
modified Rogers distance (MRD) used in the present e
(Yu et al. 2006; Fan et al. 2016).
study is as follows: The thresholds of significant markers were set 
� as P < 0.5, 0.1, 0.01, 0.001. The prediction accura-
∑L ∑n � �2
l p
1 lax − p
l=1 a= lay cies were estimated using significant markers, equal 
MRDxy =
2L amount of non-significant markers, and the rest of 
non-significant markers. Genomic prediction accu-
where plax is the estimated frequency of allele a, racy was calculated from 100 times repetitions using 
which is located in locus l, and genotype x; l is the twofold cross-validation.
number of loci, nl is the number of alleles in the locus 
l. Markers from both platforms were used to create 
the phylogenetic tree. Based on the genetic distance Results
matrix, the Neighbor-Joining (NJ) method of MEGA 
X (https:// www. megas oftwa re. net/) and FigTree Phenotypic data and correlation analysis
(http:// tree. bio. ed. ac. uk/ softw are/ figtr ee/) software 
were used to create the phylogenetic trees (Saitou and The phenotypic variations and standard deviations 
Nei 1987). of the three traits in 2019, 2020, and across 2 years 
The principal component analysis (PCA) was per- were shown in Table  2. Sufficient genetic variations 
formed to assess the effects of genetic structure on for genomic selection were found under the drought 
the estimation of prediction accuracy. All the inbred environment. The phenotypic values of ER ranged 
lines in NCCP were divided into two subgroups. The from 0.16 to 1.00, with an average value of 0.81 in 
prcomp function from the stats package in R and the 2019; ranged from 0.32 to 0.90, with an average value 
plot function from the base package in R were applied of 0.70 in 2020; and ranged from 0.60 to 0.84 with 
to generate the PCA plot. Genomic prediction accu- an average value of 0.76 across 2 years. For SPH, the 
racy was estimated within each subgroup from 100 phenotypic values ranged from 9.14 to 20.64, with 
times repetitions using twofold cross-validation. an average value of 14.96 in 2019; ranged from 8.61 
to 15.06 with an average value of 11.67 in 2020; and 
ranged from 12.00 to 14.48, with an average value 
Genomic prediction analysis with the significant trait of 13.36 across 2 years. The GY values ranged from 
associated markers 12.22 to 93.96, from 7.59 to 23.32, and from 18.32 to 
38.19 in 2019, 2020, and across 2 years, respectively. 
For each trait, all SNPs were divided into the sig- The average value of GY was 40.00, 10.97, and 25.04 
nificant marker and non-significant marker groups in 2019, 2020, and across 2 years, respectively. The 
according to the GWAS results. The GWAS was coefficient of variation (CV) in 2020 was greater than 
conducted by GAPIT (version 3) (Wang and Zhang that in 2019 for each trait (Table 2). Correspondingly, 
2021)  using the MLM model to obtain the associa- the variance of the environment component was great, 
tion markers from GBS and DArT platforms. This low to moderate heritabilities for all the three traits 
model simultaneously incorporates population struc- were obtained in 2020 and across 2 years, and relative 
ture (Yu et al. 2006). The first three PCs were used to high heritabilities were observed in 2019 for all the 
Vol:. (1234567890)
1 3
Euphytica (2022) 218:154 Page 9 of 21 154
Table 2  Phenotypic performance, variance component and broad-sense heritability of three traits in the maize association popula-
tion
Traita Means ± SD Range Unit CV Variance  componentb H2c Rep
Genotype Environment Gen × Env
19ER 0.81 ± 0.16 0.16–1.00 13.69 0.03** 0.88 3
20ER 0.70 ± 0.11 0.32–0.90 24.29 0.02** 0.65 3
ER 0.76 ± 0.04 0.60–0.84 18.81 0.01** 0.01** 0.02** 0.32 3
19SPH 14.96 ± 1.78 9.14–20.64 cm 10.11 3.57** 0.82 3
20SPH 11.67 ± 1.15 8.61–15.06 cm 22.04 2.48** 0.53 3
SPH 13.36 ± 0.42 12.00–14.48 cm 15.43 0.67** 5.27** 2.39** 0.26 3
19GY 40.00 ± 14.91 12.22–93.96 g 29.36 267.16** 0.85 3
20GY 10.97 ± 2.65 7.59–23.32 g 119.21 28.66* 0.33 3
GY 25.04 ± 3.57 18.32–38.19 g 48.34 43.27** 420.80** 126.06** 0.33 3
ER BLUE of emergence rate across two environments, SPH BLUE of seedling plant height across two environments, GY BLUE of 
grain yield across two environments
a 19ER: emergence rate (ER) was measured in 19FX (2019 Fuxin); 20ER: emergence rate (ER) was measured in 20FX (2020 Fuxin); 
19SPH: seedling plant height (SPH) was measured in 19FX (2019 Fuxin); 20SPH: seedling plant height (SPH) was measured in 
20FX (2020 Fuxin); 19GY: grain yield (GY) was measured in 19FX (2019 Fuxin); 20GY: grain yield (GY) was measured in 20FX 
(2020 Fuxin)
b *Significant at P ≤ 0.05; **Significant at P ≤ 0.01
c Family mean-based broad-sense heritability
three traits. The  H2 of ER, SPH, and GY in 2019 was before filtering was observed at the MAF interval of 0 
0.88, 0.82, and 0.85, respectively. In 2020, the  H2 of to 0.05 for all the three marker sets. The MAF distri-
ER, SPH, and GY was 0.65, 0.53, and 0.33, respec- butions in other intervals from 0.05 to 0.50 were rela-
tively. Across 2 years, the H 2 of ER, SPH, and GY tively uniform. Average MAF after filtering was 0.24, 
was 0.32, 0.26, and 0.33, respectively (Table 2). The 0.24, and 0.23 for SilicoDArT, SNPs from DArT-seq, 
phenotypic data analysis based on BLUEs revealed and SNPs from GBS, respectively (Table  1). Con-
that ER, SPH, and GY had normal distributions or sistent MAF distributions between the two genotyp-
skewed normal distributions (Fig. 2). The phenotypic ing platforms potentially indicated that the quality of 
correlations among all the three traits were positive sequencing and molecular marker datasets generated 
and significant, and the correlation coefficients were by both genotyping platforms were high and reliable 
0.36, 0.35, and 0.25 (P < 0.01) between ER and SPH, (Fig. 3a–c).
between ER and GY, and between SPH and GY, Both in the SilicoDArT and DArT-seq SNP data-
respectively. sets, distributions of missing rates showed a decreas-
ing trend from the interval of 0–10% to the interval 
Quality control on genotypic data and marker of 90–100%, and close to 40% of molecular markers 
distribution had the missing rates less than 10%. In contrast, close 
to 40% of molecular markers in the GBS SNP dataset 
Two kinds of marker sets were obtained in the current had the missing rates more than 90%, and 10% of the 
research. The SilicoDArT markers were from DArT- GBS SNPs appeared in each interval of missing rate 
seq, and SNPs were from the GBS and DArT-seq. The of 0–10%, 10–20%, 20–30%, 30–40%, and 40–50% 
plots of MAF and missing rate distribution before fil- (Fig. 3d–f). The irregular distribution of missing rates 
tering for three marker sets were shown in Fig. 3. The in the GBS SNP dataset indicated that the quality of 
SilicoDArT markers had an average MAF of 0.10, this SNP dataset is not high.
and the DArT-seq SNPs set had an average MAF of High-quality marker sets were obtained after filter-
0.11 before filtering. For the GBS SNP set, the aver- ing, and the average missing rate was 0.070, 0.073, 
age MAF before filtering was 0.11. The highest peak and 0.068 in the SilicoDArT dataset, DArT-seq SNP 
Vol.: (0123456789)
1 3
 154 Page 10 of 21 Euphytica (2022) 218:154
markers at both ends of the chromosomes was high 
and decreased toward the centromere regions. The 
SNPs in both datasets were enriched at the ends of 
chromosomes 5 and 8. More uniform distribution was 
observed in the GBS SNP dataset, whereas the DArT-
seq SNP dataset had higher marker density on the 
long arms of chromosomes 8 and 9. Uneven distribu-
tions may potentially lead to a decrease in prediction 
accuracy.
The phylogenetic tree of NCCP
The phylogenetic trees were generated with the 
DArT-seq SNPs and GBS SNPs (Fig. 5). In total, the 
NCCP was divided into 13 groups, corresponding to 
the heterotic groups of NSS, SS, Huanglvxi, Lvxi, 
Reid, Mixed, France, PA, Jidan, Longdan, Lvda-
honggu, Tangsipingtou and Unknown. The inbred 
Fig. 2  Frequency distributions and correlations of BLUE (best lines from CIMMYT are tropical material, only two 
unbiased linear estimator) as phenotype values were calcu- of them were regenerated in northeast China and 
lated from the maize seedling emergence rate (ER), seedling classified into the Mixed group. Some small groups 
plant height (SPH), and grain yield (GY) of maize measured 
in 19FX (2019 Fuxin) and 20FX (2020 Fuxin) under drought and inbred lines selected from the multiple paren-
environment. The plots on the diagonal represent the pheno- tal lines also were classified as the unknown group. 
typic distribution frequency of ER, SPH, GY. The values above DArT SNPs produced a better phylogenetic tree, and 
the diagonal line are the Pearson’s correlation coefficients heterotic groups of Reid, Lvxi, and Huanglvxi were 
between every two traits. The values below the diagonal line 
are scattered plots for every two traits. *represents a significant separated clearly. However, the phylogenetic tree pro-
difference at the 0.05 level; **represents a significant differ- duced with the GBS SNPs did not separate the heter-
ence at the 0.01 level otic groups clearly in the present study. Some inbred 
lines from local breeders lost their source informa-
tion, so they were classified into the unknown group. 
dataset, and GBS SNP dataset, respectively (Table 1). Some materials were collected from third parties with 
Heterozygosity is usually a critical index for marker inaccurate source information, it is not easy to clas-
filtering, especially for inbred lines. In this research, sify them into heterotic groups properly. For some 
the average heterozygosity rates were very low in all inbred lines, pedigree information conflicted with the 
the three marker datasets, i.e., less than 0.01 before phylogenetic tree results generated with molecular 
filtering, and less than 0.03 after filtering. It indicated markers, it is hard to generate a perfect phylogenetic 
that the breeding lines in the NCCP are homozygous tree in a practical breeding program.
(Table S1 and Table 1).
After filtering, the number of markers decreased by The genomic prediction accuracy estimated from the 
81%, 80%, and 88% in the SilicoDArT dataset, DArT- DArT and GBS markers
seq SNP dataset, and GBS SNP dataset, respectively 
(Table S1 and Table 1). The missing rate was mainly Genomic prediction accuracies were estimated for 
responsible for the reduction of markers, the missing all the three target traits with the BLUE values and 
rate before filtering was 25%, 24%, and 56% in the the SNP datasets filtered with different parameters 
SilicoDArT dataset, DArT-seq SNP dataset, and GBS (Fig.  6). Using DArT SNP datasets, the predic-
SNP dataset, respectively (Table S1). tion accuracy increased continuously as the TPS 
The markers of DArT-seq SNP dataset and increased across all three traits evaluated under a nat-
GBS SNP dataset were distributed across the entire ural drought stress environment (Fig. 6). The predic-
maize genome (Fig.  4). Generally, the density of tion accuracy slightly increased across all three traits 
Vol:. (1234567890)
1 3
Euphytica (2022) 218:154 Page 11 of 21 154
Fig. 3  Distribution of MAF and missing rate before filtering rate distribution of the SilicoDArT marker dataset; e missing 
in three marker datasets: a MAF distribution of the SilicoDArT rate distribution of the DArT marker dataset; f missing rate dis-
marker dataset; b MAF distribution of the DArT marker data- tribution of the GBS marker dataset
set; c MAF distribution of the GBS marker dataset; d missing 
Fig. 4  The markers density thermogram of DArT and GBS datasets. a The markers density thermogram of DArT’s SNP; b the 
markers density thermogram of GBS
when the TPS increased from 50 to 90%. The smallest traits of SPH and GY, 40% of the training population 
standard error was observed in prediction accuracy size gained the smallest standard error, which indi-
when 50% to 60% of the training population size was cated that 40 to 60% of the total genotypes assigned 
used to predict the target traits of ER. For the target as the training set could achieve good prediction 
Vol.: (0123456789)
1 3
 154 Page 12 of 21 Euphytica (2022) 218:154
Fig. 5  Phylogenetic Trees for SNP markers from DArT-seq and GBS platforms. The letters a and b represent NCCP’s phylogenetic 
trees from DArT’s SNPs and GBS’s SNPs
accuracy in the DArT SNP markers. For the GBS accuracies using the SNPs from the DArT-seq were 
SNP markers, a similar trend was observed for GY. 0.27, 0.19, and 0.33 for ER, SPH, and GY traits, 
With the increase of the TPS, the average and median respectively. The prediction result estimated with Sil-
prediction accuracies showed slightly different for the icoDArT markers was similar to that estimated with 
target traits of ER and SPH (Fig. 6). The prediction DArT-seq SNPs, which were 0.26, 0.22, and 0.33 
accuracies estimated in a single year were similar to for ER, SPH, and GY traits, respectively. While, the 
those using BLUEs across 2 years (Figs. S1, S2). The prediction accuracies estimated from the GBS SNP 
heritabilities of the three traits were higher in 2019, markers were − 0.02, − 0.06, and 0.20 for the traits of 
slightly improving the prediction accuracy. Predic- ER, SPH, and GY, respectively.
tion accuracies estimated with SilicoDArT mark-
ers were consistent with those estimated with DArT Different prediction accuracies affected by marker 
SNP markers (Fig. S3). Therefore, 50% of TPS was quality
selected for further analysis.
Genomic prediction accuracies estimated using For all the three traits, the results of prediction accu-
twofold cross-validation (50% of TPS) for all three racies estimated from the three molecular marker 
traits were shown in Fig.  7. The prediction accura- datasets filtered with the different combinations of 
cies estimated with the markers from different geno- MAF and missing rate were presented in Table  3, 
typic platforms showed significant differences for Tables S2, and S3. For the ER trait, the average pre-
three interesting traits. For traits of SPH, the predic- diction accuracies range from 0.19 to 0.29, and the 
tion accuracies estimated with the SilicoDArT data- highest prediction accuracy was observed using 2,762 
set and DArT-seq SNP dataset showed significant SilicoDArT markers. Interestingly, increasing the fil-
differences, which means alignment of the markers tering of missing rates improved the prediction accu-
to the reference genome resulting in a loss of predic- racy, while MAF had little effect on the estimation 
tion accuracy. For three traits, the prediction accura- of the prediction accuracy. Similar trends were also 
cies estimated from the DArT-seq SNPs were sig- observed for ER and GY. In most cases, the prediction 
nificantly higher than those estimated from the GBS accuracies did not show significant differences under 
SNPs (P < 0.01). Among the three traits, the pre- different combinations of MAF and missing rate. The 
diction accuracy of SPH was the lowest in both the LD decay distances estimated with the DArT SNPs 
DArT and GBS markers, which was consistent with and GBS SNPs were 6.16 kb and 3.83 kb, respectively 
the lowest heritability of SPH. The average prediction (Fig. S4). The result of prediction accuracy estimated 
Vol:. (1234567890)
1 3
Euphytica (2022) 218:154 Page 13 of 21 154
Fig. 6  Genomic prediction accuracies of ER, SPH, and GY in markers; b ER estimated with GBS markers; c SPH estimated 
NCCP across two natural drought conditions, when the train- with DArT markers; d SPH estimated with GBS markers; e 
ing population size was set from 10 to 90% of total genotypes, GY estimated with DArT markers; f GY estimated with GBS 
with an interval of 10%. Panel a ER estimated with DArT markers
Vol.: (0123456789)
1 3
 154 Page 14 of 21 Euphytica (2022) 218:154
distribution of two subgroups using PC1 and PC2 
estimated with the DArT and GBS markers, and the 
first two PC explained 35% and 31% of the genetic 
variation, respectively (Fig.  8). The PCA results 
estimated with the markers from the two platforms 
were very similar. Finally, 180 inbred lines weres 
classified into Subgroup 1, and 199 inbred lines 
were classified into Subgroup 2.
For all the traits, the prediction accuracies esti-
mated within each subgroup were shown in Table 5. 
Using the DArT SNP markers, the prediction 
accuracies in Subgroup 1 with 50% of TPS were 
0.17, 0.11, and 0.11 for ER, SPH, and GY traits, 
respectively. For Subgroup 2, the prediction accu-
racies were 0.29, 0.20, and 0.33 for ER, SPH and 
Fig. 7  Genomic prediction accuracies of ER, SPH, and GY GY traits, respectively. The prediction accuracies 
estimated with the SilicoDArT, DArT and GBS marker data- in Subgroup 1 were significantly lower than those 
sets under twofold cross-validation. The significant difference 
was analyzed by one-way ANOVA of SPSS using the whole population for all three traits, 
whereas the prediction accuracies in Subgroup 2 
were higher than those using the entire population 
with the different DArT/GBS marker datasets filtered for all three traits. For the GBS markers, no dif-
by various LD values was presented in Table 4. Filter- ferences were observed in the prediction accuracy 
ing with different LD values caused different marker between the two subgroups.
densities, and the number of markers decreased as 
the values increased. SPH obtained the highest aver- Prediction accuracy estimated with the trait-marker 
age prediction accuracy in the LD0.1 of DArT with associations
800 SNPs, followed by the accuracy in the DArT-
LD0.2 with 2681 markers and without filtering with To improve the prediction accuracy under the condi-
7,837 SNPs. In contrast, the DArT-LD0.1 in ER and tion of natural drought in the maize seedling stage, 
GY traits were found to have the highest accuracies. the trait-marker associations detected in GWAS were 
Using the GBS markers, the prediction accuracies for selected for predictions using different significance 
traits of ER and GY estimated from 2333 SNPs fil- levels (P < 0.5, 0.1, 0.01, 0.001). The same amount 
tered with LD value  (r2 = 0.1) were higher than those of randomly selected markers and all non-significant 
estimated from all the 91,003 SNPs without filtering. markers excluding significant markers were used as 
Slight improvements were observed, when a smaller controls for comparison analysis. The results showed 
number of high-quality markers filtered with different that the prediction accuracies estimated with sig-
parameters was used for prediction, but the best fil- nificant markers were significantly higher than those 
tering parameters for improving prediction accuracy estimated from the same amount of random markers 
varied in different traits and genotyping platforms. for all the traits evaluated under drought tolerance 
environment at the seedling stage. When SNPs are 
Prediction accuracy estimated within subgroups from GBS, the prediction accuracies estimated with 
46,168, 9074, 872 and 72 significant SNPs in P < 0.5, 
In general, the population with close relationships 0.1, 0.01 and 0.01were 0. 36, 0. 59, 0. 61 and 0. 57 
is more likely to obtain high prediction accuracy. for ER trait, respectively. For SPH and GY traits, both 
Effects of controlling the population structure on had the same trend as ER, and the maximum predic-
improving the prediction accuracy were assessed. tion accuracy was obtained when P < 0.01, which 
For keeping enough lines in each subgroup, NCCP were 0.62 and 0.63 respectively (Fig. 9). When SNPs 
was divided into two subgroups, according to the were from DArT-seq, 3858, 775, 90 and 11 signifi-
population structure result. The PCA plots show the cant SNPs were selected for prediction for ER trait 
Vol:. (1234567890)
1 3
Euphytica (2022) 218:154 Page 15 of 21 154
Table 3  Genomic prediction accuracies of ER, SPH, and GY in NCCP under natural drought conditions, estimated from the Sili-
coDArT marker datasets with different levels of quality filtered with missing rate and MAF
Missing rate MAF Number of mark- Prediction accuracy
ers
ER SPH GY
0% 0.1 1356 0.24 0.23 0.31
0.2 926 0.22 0.22 0.32
0.3 583 0.19 0.22 0.29
0.4 265 0.21 0.22 0.31
10% 0.1 6484 0.25 0.22 0.34
0.2 4450 0.25 0.21 0.33
0.3 2838 0.25 0.22 0.33
0.4 1379 0.26 0.21 0.33
20% 0.1 9483 0.27 0.22 0.33
0.2 6376 0.26 0.21 0.33
0.3 3934 0.26 0.23 0.34
0.4 1851 0.27 0.22 0.31
30% 0.1 12,615 0.26 0.23 0.33
0.2 8571 0.26 0.22 0.33
0.3 5451 0.27 0.23 0.33
0.4 2650 0.26 0.21 0.33
40% 0.1 14,726 0.28 0.23 0.33
0.2 9726 0.27 0.22 0.33
0.3 5972 0.26 0.21 0.33
0.4 2762 0.29 0.23 0.35
in P < 0.5, 0.1, 0.01 and 0.001, and their predic- without using greenhouses. In the current study, the 
tion accuracies reached to 0.54, 0.71, 0.66 and 0.53, three traits have found enough phenotypic variation 
respectively. The same trends were also observed in our NCCP for ER, SPH, and GY traits. We trans-
for SPH and GY traits, and the prediction accuracies formed the data to the BLUEs (best unbiased linear 
were the highest when P < 0.1, i.e. 0.75 for SPH, and estimator) as phenotype values for the rrBLUP model 
0.60 for GY (Fig S5). assumptions (Fig. 2). The results of correlation anal-
ysis show that ER and SPH potentially impact GY 
under a drought environment. The phylogenetic tree 
Discussion shows our population has a rich genetic variation for 
improving drought resistance, although the two phy-
Drought tolerance, especially in the seedling stage, logenetic trees are not the same.
is a complex and inherent trait of maize (Wang et al. GS is an effective breeding tool for improving 
2016). The global climate has changed recently, complex traits in maize (Crossa et al. 2014). The GS 
resulting in drastic fluctuations in rainfall patterns can accelerate the genetic gain per unit time and unit 
and increasing temperature. Sudden climate changes cost by reducing the selection cycle time and the phe-
can cause significant economic losses worldwide. notyping cost when the prediction accuracy is high. 
There are signs of grain yield stagnation in maize, In the present study, a natural maize population geno-
especially in drought-stressed and semi-arid regions. typed with GBS and DArT markers were used to esti-
Fuxin County is a region providing a natural test site mate the genomic prediction accuracies of ER, SPH, 
located in the main maize producing area of North- and GY under drought stress environments. Results 
east China that has suffered years of drought. In this indicated that the prediction for the inbred line per-
region, we can deploy more large-scale experiments mance under a drought environment is tough through 
Vol.: (0123456789)
1 3
 154 Page 16 of 21 Euphytica (2022) 218:154
Table 4  Genomic prediction accuracies of ER, SPH, and GY traits were more than 0.8 in 2019, but the average 
in NCCP under natural drought conditions, estimated from the prediction accuracies were only slightly higher than 
DArT and GBS marker datasets with different levels of quality 
filtered with linkage disequilibrium (LD) that in 2020 and across 2 years. This implied that the 
prediction of drought resistance needs to accumu-
Number of Prediction accuracy late more year data to improve the heritability across 
markers
ER SPH GY years and achieve stable phenotype.
In the same trait, the prediction accuracy esti-
DArT-LD0.1 800 0.25 0.26 0.31 mated from the DArT markers was higher than that 
DArT-LD0.2 2681 0.27 0.24 0.33 estimated from the GBS marker dataset, and the dif-
DArT 7837 0.27 0.19 0.33 ference was significant. Both DArT and GBS are cost 
GBS-LD0.1 2333 0.02 − 0.07 0.23 efficient and high-throughput genotyping platforms 
GBS-LD0.2 8116 0.01 − 0.07 0.20 that can be used to implement GS. The GBS markers 
GBS 91,003 − 0.02 − 0.06 0.20 were implemented successfully in maize to improve 
DArT/GBS: The markers with an MAF < 0.05 and a missing multiple traits with different levels of genetic com-
rate > 20% were filtered out plexity (Crossa et al. 2013; Wang et al. 2020). Com-
LD0.2: On the basis of DArT/GBS filtering, it was filtered out 
according to linkage disequilibrium r 2 ≥ 0.2 paratively, only a few studies were reported using 
DArT markers to perform GS in wheat (Crossa et al. 
LD0.1: On the basis of DArT/GBS filtering, it was filtered out 
according to linkage disequilibrium r 2 ≥ 0.1 2016; Liu et al. 2020). We spent a similar amount of 
money on two genotyping platforms, about $30 per 
sample, to control the budget within an acceptable 
both GBS and DArT genotyping platforms. The aver- range for breeders. This price can obtain DArT mark-
age prediction accuracies of ER, SPH and GY in the ers with normal quality, but a little bit lower quality 
panel estimated using SNPs of DArT were 0.27, 0.19 for GBS markers. That means the genome coverage 
and 0.33, respectively, SilicoDArT markers were from GBS has not reached 1×, which is probably the 
slightly higher than the DArT SNPs; while the pre- reason why the prediction accuracies are not suit-
diction accuracies with GBS markers were 0.14, and able when using GBS markers in this research. Our 
0.20 (Fig. 7). Low prediction accuracy may be mainly research indicated that DArT genotyping platform is 
affected by the quality of markers and heritabilities. more suitable than GBS when working with a lower 
Our heritabilities and prediction accuracies for the budget. GS using DArT markers is also being imple-
GY trait were within the range of the previous study mented in CIMMYT maize breeding programs for 
that used 22 populations under water stress environ- improving grain yield and the other major agronomic 
ments (Zhang et  al. 2017a). Interestingly, unlike traits.
under normal conditions, the heritabilities of the three 
Fig. 8  The first two principal components of DArT and GBS datasets. a The first two principal components of DArT. b The first two 
principal components of GBS. Subgroup 1 is marked with black and subgroup 2 is marked with red
Vol:. (1234567890)
1 3
Euphytica (2022) 218:154 Page 17 of 21 154
Table 5  Prediction accuracies under different subgroups of DArT and GBS
Trait DArT GBS
All Subgroup1 Subgroup2 All Subgroup1 Subgroup2
ER 0.27 0.17 0.29 − 0.02 0.00 − 0.01
SPH 0.19 0.11 0.20 − 0.06 − 0.09 − 0.07
GY 0.33 0.11 0.33 0.20 0.21 0.22
This study showed that the average prediction doesn’t always work, and may be due to the fact 
accuracy was improved along with the TPS increas- that some complex traits require markers with spe-
ing in all the traits in both genotyping platforms. In cific effects. Also, there is a tradeoff between the 
contrast, the standard error was at first raised and number of markers and marker quality or popula-
then decreased as the TPS increased (Fig.  6). Rela- tion structure because marker quality decreases as 
tively high prediction accuracies with the slightest the number of markers increase in a specific marker 
standard error were observed in all traits when 50% dataset. This study suggests that genomic prediction 
to 60% of the total genotypes were used as a train- can also be performed using high-quality and low-
ing set. Results of this study are consistent with a density markers in the initial breeding stage under 
previous report (Guo et al. 2020) that can be applied the drought environment, especially when heritabil-
in the practical breeding program. The marker den- ity is low.
sity is an essential effect on prediction accuracy if The low-cost DArT platform increases the pos-
population sets and heritabilities are the same. Our sibility of integrating genomic selection strate-
research found that the prediction accuracy of about gies into practical breeding programs for small 
8000 markers from DArT is significantly higher than breeding companies and the private sector. The 
that of 90,000 markers from GBS, which emphasized SilicoDArT markers are possible to capture lost 
the importance of marking quality and indicated effects using a single reference genome for a 
8000 markers is enough to implement genomic pre- diversity-rich population, which is essential for 
diction in maize. Low-density marker populations challenging drought tolerance prediction. The 
appeal to GS because they decrease multicollinear- key to the genomic prediction of maize drought 
ity and computational time consumption and allow resistance is to improve heritability and stabilize 
more individuals to be genotyped for the same cost. the phenotype. Natural drought experimental sites 
However, this prediction was made in a breeding pro- can increase the heritability of planting materi-
gram’s initial stage, and higher marker density could als by accumulating data for many years without 
play a key role in the following breeding cycle for the a greenhouse. Fuxin county provides a chance to 
long term. As a mature genotyping platform, GBS study genomic prediction for drought tolerance. 
markers from the sufficient sequencing depth have a The SNPs significantly associated with traits can 
superb performance on genomic prediction (Liu et al. remarkably improve the prediction accuracy to 
2021). The higher marker density of GBS also has a drought tolerance by removing low effect or inva-
high application prospect in the multi-cycle breeding lid markers. This result is consistent with a previ-
process. ous study (Cerrudo et al. 2018). The GP accuracies 
We tried to increase the prediction accuracy obtained from the marker-trait associated SNPs 
by improving the marker quality and controlling were relatively higher than those obtained from the 
the population structure, but at the same time, genome-wide SNPs for most of the target traits, by 
the marker density was also reduced. Our results about 5–30% (Yuan et  al. 2019).New models for 
show that the prediction accuracy was improved integrating genotype-by-environment interaction 
in some cases (Tables  3, 4, 5), even if the density into genomic prediction needs to be further devel-
of the marker was reduced. However, this strategy oped to improve the selection of maize drought 
Vol.: (0123456789)
1 3
 154 Page 18 of 21 Euphytica (2022) 218:154
Fig. 9  Prediction accuracy 
of GS under different 
situations and number 
of markers at different 
significance levels using 
the SNP of GBS. “SM” 
represents the signifi-
cance markers, “ENSM” 
represents the randomly 
selected same amount of 
non-significant markers 
with the significance mark-
ers, “NSM” represents all 
non-significant markers 
excluding significant mark-
ers from genotype datasets 
and “ALL” represents the 
prediction of all markers 
under 60% of the training 
population. a and b respec-
tively represent the predic-
tion accuracy obtained by 
GS and number of markers 
at different significance 
levels by GWAS of ER 
trait. c and d respectively 
represent the prediction 
accuracy obtained by GS 
and number of markers at 
different significance levels 
by GWAS of SPH trait. e 
and f respectively represent 
the prediction accuracy 
obtained by GS and number 
of markers at different sig-
nificance levels by GWAS 
of GY trait
tolerance, although the genomic prediction using XD and JF revised the manuscript. All authors have read and 
significant markers has dramatically promoted the approved the final version of the manuscript.
prediction accuracy. Funding This work was supported by Shanghai Agri-
culture Applied Technology Development Program, China 
Author’s contributions AZ, SY, and ZC collected materials; (Z20190101), the National Science Foundation for Young Sci-
AZ, SC, DD analyzed data; AZ, SC, CL, JN, XD, YG, YZ and entists of China (31801442), the CIMMYT-China Specialty 
JF carried out the field experiments; HZ and YR designed the Maize Research Center (KF201802), the Natural Sciences 
study; SC, AZ and YL wrote the manuscript; XZ, JC, LZ, YR, 
Vol:. (1234567890)
1 3
Euphytica (2022) 218:154 Page 19 of 21 154
Foundation of Liaoning Provincial Department of Education Chen J, Zavala C, Ortega N, Petroli C, Franco J, Burgueño J, 
(LJKZ0657). Costich DE, Hearne SJ (2016) The development of qual-
ity control genotyping approaches: a case study using 
Data availability The datasets generated during and/or ana- elite maize lines. PLoS ONE 11:e0157236
lysed during the current study are available from the corre- Combs E, Bernardo R (2013) Accuracy of genomewide 
sponding author on reasonable request. selection for different traits with constant population 
size, heritability, and number of markers. Plant Genome 
Declarations 6:plantgenome2012-11
Crossa J, Beyene Y, Kassa S, Pérez P, Hickey JM, Chen C, de 
Conflict of interest The authors declare that they have no los Campos G, Burgueño J, Windhausen VS, Buckler E, 
conflict of interest. Jannink J-L, Lopez Cruz MA, Babu R (2013) Genomic 
prediction in maize breeding populations with genotyp-
Open Access This article is licensed under a Creative Com- ing-by-sequencing. G3 Genes Genomes Genet 3:1903
mons Attribution 4.0 International License, which permits Crossa J, Pérez P, Hickey J, Burgueño J, Ornella L, Cerón-
use, sharing, adaptation, distribution and reproduction in any Rojas J, Zhang X, Dreisigacker S, Babu R, Li Y, Bon-
medium or format, as long as you give appropriate credit to the nett D, Mathews K (2014) Genomic prediction in CIM-
original author(s) and the source, provide a link to the Crea- MYT maize and wheat breeding programs. Heredity 
tive Commons licence, and indicate if changes were made. The (edinb) 112:48–60
images or other third party material in this article are included Crossa J, Jarquín D, Franco J, Pérez-Rodríguez P, Bur-
in the article’s Creative Commons licence, unless indicated gueño J, Saint-Pierre C, Vikram P, Sansaloni C, Petroli 
otherwise in a credit line to the material. If material is not C, Akdemir D, Sneller C, Reynolds M, Tattaris M, 
included in the article’s Creative Commons licence and your Payne T, Guzman C, Peña RJ, Wenzl P, Singh S (2016) 
intended use is not permitted by statutory regulation or exceeds Genomic prediction of gene bank wheat landraces. G3 
the permitted use, you will need to obtain permission directly Genes Genomes Genet 6:1819
from the copyright holder. To view a copy of this licence, visit Crossa J, Pérez-Rodríguez P, Cuevas J, Montesinos-López 
http:// creat iveco mmons. org/ licen ses/ by/4. 0/. O, Jarquín D, de los Campos G, Burgueño J, González-
Camacho JM, Pérez-Elizalde S, Beyene Y, Dreisigacker S, 
Singh R, Zhang XC, Gowda M, Roorkiwal M, Rutkoski 
J, Varshney RK (2017) Genomic selection in plant breed-
References ing: methods, models, and perspectives. Trends Plant Sci 
22:961–975
Alvarado G, Rodríguez FM, Pacheco A, Burgueño J, Crossa Cui ZH, Dong HX, Zhang A, Ruan YY, He Y, Zhang ZW 
J, Vargas M, Pérez-Rodríguez P, Lopez-Cruz MA (2020) Assessment of the potential for genomic selec-
(2020) META-R: a software to analyze data from multi- tion to improve husk traits in maize. G3-Genes Genomes 
environment plant breeding trials. Crop J 8:745–756 Genet 10:g3.401600.402020
Bernardo R (2016) Bandwagons I, too, have known. Theor de los Campos G, Gianola D, Rosa GJM (2009) Reproducing 
Appl Genet 129:2323–2332 Kernel Hilbert spaces regression: a general framework for 
Beyene Y, Semagn K, Mugo S, Tarekegne A, Babu R, Meisel genetic evaluation. J Anim Sci 87:1883–1887
B, Sehabiague P, Makumbi D, Magorokosho C, Oikeh Desta ZA, Ortiz R (2014) Genomic selection: genome-
S, Gakunga J, Vargas M, Olsen M, Prasanna BM, Ban- wide prediction in plant improvement. Trends Plant Sci 
ziger M, Crossa J (2015) Genetic gains in grain yield 19:592–601
through genomic selection in eight bi-parental maize Dos Santos JPR, Pires LPM, de Castro Vasconcellos RC, 
populations under drought stress. Crop Sci 55:154–163 Pereira GS, Von Pinho RG, Balestre M (2016) Genomic 
Bradbury PJ, Zhang ZW, Kroon DE, Casstevens TM, Ram- selection to resistance to stenocarpella maydis in maize 
doss Y, Buckler ES (2007) TASSEL: software for asso- lines using DArTseq markers. BMC Genet 17:86
ciation mapping of complex traits in diverse samples. Elshire RJ, Glaubitz JC, Sun Q, Poland JA, Kawamoto K, 
Bioinformatics 23:2633–2635 Buckler ES, Mitchell SE (2011) A robust, simple genotyp-
Cao SL, Loladze A, Yuan Y, Wu YS, Zhang A, Chen JF, ing-by-sequencing (GBS) approach for high diversity spe-
Huestis G, Cao JS, Chaikam V, Olsen M, Prasanna BM, cies. PLoS ONE 6:e19379
San Vicente F, Zhang XC (2017) Genome-wide analysis Endelman JB (2011) Ridge regression and other kernels for 
of tar spot complex resistance in maize using genotyp- genomic selection with R package rrBLUP. Plant Genome 
ing-by-sequencing SNPs and whole-genome prediction. 4:250–255
Plant Genome 10:plantgenome2016-10 Fan Y, Zhou G, Sergey S, Chen Z, Cai S, Li C, Zhou M (2016) 
Carena MJ, Hallauer AR, Filho JM (2010) Quantitative Genome-wide association study reveals a new qtl for 
genetics in maize breeding. Springer, New York salinity tolerance in barley (Hordeum vulgare L.). Front 
Cerrudo D, Cao S, Yuan Y, Martinez C, Suarez EA, Babu Plant Sci 7:946
R, Zhang TS (2018) Genomic selection outperforms Guo R, Dhliwayo T, Mageto EK, Palacios-Rojas N, Lee M, 
marker assisted selection for grain yield and physiologi- Yu D, Ruan Y, Zhang A, San Vicente F, Olsen M, Crossa 
cal traits in a maize doubled haploid population across J, Prasanna BM, Zhang LJ, Zhang XC (2020) Genomic 
water treatments. Front Plant Sci 9:366 prediction of kernel zinc concentration in multiple maize 
Vol.: (0123456789)
1 3
 154 Page 20 of 21 Euphytica (2022) 218:154
populations using genotyping-by-sequencing and repeat Môro GV, Santos MF, de Souza Júnior CL (2019) Comparison 
amplification sequencing markers. Front Plant Sci 11:534 of genome-wide and phenotypic selection indices in maize. 
Hao Z, Lv D, Ge Y, Shi J, Weijers D, Yu G, Chen J (2020) Euphytica 215:76
RIdeogram: drawing SVG graphics to visualize and map Norman A, Taylor J, Edwards J, Kuchel H (2018) Optimising 
genome-wide data on the idiograms. PeerJ Comput Sci genomic selection in wheat: effect of marker density, popu-
6:e251 lation size and population structure on prediction accuracy. 
Heslot N, Yang H-P, Sorrells ME, Jannink J-L (2012) Genomic G3 Genes Genomes Genet 8:2889–2899
selection in plant breeding: a comparison of models. Crop Pereira WJ, de CastroRodriguesPappas M, Grattapaglia D, Pap-
Sci 52:146–160 pas GJ (2020) A cost-effective approach to DNA methyla-
Jiang B, Wang P, Zhuang S, Li MS, Li Z, Gong ZH (2018) tion detection by methyl sensitive DArT sequencing. PLoS 
Detection of maize drought based on texture and morpho- ONE 15:e0233800
logical geatures. Comput Electron Agric 151:50–60 R Core Team (2021) R: a language and environment for statisti-
Kang HM, Zaitlen NA, Wade CM, Kirby A, Heckerman D, cal computing, Vienna. https:// www.R- proje ct. org/.
Daly MJ, Eskin E (2008) Efficient control of population Resende MFR Jr, Muñoz P, Resende MDV, Garrick DJ, Fer-
structure in model organism association mapping. Genet- nando RL, Davis JM, Jokela EJ, Martin TA, Peter GF, Kirst 
ics 178:1709–1723 M (2012) Accuracy of genomic selection methods in a 
Kilian A, Wenzl P, Huttner E, Carling J, Xia L, Blois H, Caig standard data set of loblolly pine (Pinus taeda L.). Genetics 
V, Heller-Uszynska K, Jaccoud D, Hopper C, Aschen- 190:1503–1510
brenner-Kilian M, Evers M, Peng K, Cayla C, Hok P, Saitou N, Nei M (1987) The neighbor-joining method: a new 
Uszynski G (2012) Diversity arrays technology: S generic method for reconstructing phylogenetic trees. Mol Biol Evol 
genome profiling technology on open platforms. Methods 4:406–425
Mol Biol (clifton, NJ) 888:67–89 Tecle IY, Edwards JD, Menda N, Egesi C, Rabbi IY, Kulakow 
Kwong QB, Ong AL, Teh CK, Chew FT, Tammi M, Mayes S, P, Kawuki R, Jannink J-L, Mueller LA (2014) solGS: a 
Kulaveerasingam H, Yeoh SH, Harikrishna JA, Appleton web-based tool for genomic selection. BMC Bioinform 
DR (2017) Genomic selection in commercial perennial 15:398–398
crops: applicability and improvement in oil palm (Elaeis Thavamanikumar S, Dolferus R, Thumma BR (2015) Compari-
guineensis Jacq.). Sci Rep 7:2872 son of genomic selection models to predict flowering time 
Liu XG, Wang HW, Wang H, Guo ZF, Xu XJ, Liu JC, Wang SH, and spike grain number in two hexaploid wheat doubled 
Li WX, Zou C, Prasanna BM, Olsen MS, Huang CL, Xu haploid populations. G3 (bethesda) 5:1991–1998
YB (2018) Factors affecting genomic selection revealed by Tian Y, Guan B, Zhou DW, Yu JB, Li GD, Lou YJ (2014) 
empirical evidence in maize. Crop J 6:341–352 Responses of seed germination, seedling growth, and seed 
Liu C, Sukumaran S, Jarquin D, Crossa J, Dreisigacker S, Sansaloni yield traits to seed pretreatment in maize (Zea mays L.). Sci 
C, Reynolds M (2020) Comparison of array- and sequencing- World J 2014:834630
based markers for genome-wide association mapping and VanRaden PM (2008) Efficient methods to compute genomic 
genomic prediction in spring wheat. Crop Sci 60:211–225 predictions. J Dairy Sci 91:4414–4423
Liu Y, Hu G, Zhang A, Loladze A, Hu Y, Wang H, Qu J, Zhang Vivek BS, Krishna GK, Vengadessan V, Babu R, Zaidi PH, Kha 
X, Olsen M, San Vicente F, Crossa J, Lin F, Prasanna BM LQ, Mandal SS, Grudloyma P, Takalkar S, Krothapalli K, 
(2021) Genome-wide association study and genomic predic- Singh IS, Ocampo ETM, Xingming F, Burgueño J, Azrai M, 
tion of fusarium ear rot resistance in tropical maize germ- Singh RP, Crossa J (2017) Use of genomic estimated breed-
plasm. Crop J 9:325–341 ing values results in rapid genetic gains for drought tolerance 
Massman JM, Gordillo A, Lorenzana RE, Bernardo R (2013) in maize. Plant Genome 10:plantgenome2016.2007.0070
Genomewide predictions from maize single-cross data. Wang J, Zhang Z (2021) GAPIT version 3: boosting power and 
Theor Appl Genet 126:13–22 accuracy for genomic association and prediction. Genom 
Maulana F, Kim K-S, Anderson JD, Sorrells ME, Butler TJ, Liu Proteomics Bioinform 19:1–12
S, Baenziger PS, Byrne PF, Ma X-F (2021) Genomic selec- Wang X, Wang H, Liu S, Ferjani A, Li J, Yan J, Yang X, Qin F 
tion of forage agronomic traits in winter wheat. Crop Sci (2016) Genetic variation in ZmVPP1 contributes to drought 
61:410–421 tolerance in maize seedlings. Nat Genet 48:1233–1241
Meuwissen THE, Hayes BJ, Goddard ME (2001) Prediction of Wang N, Yuan Y, Wang H, Yu D, Liu Y, Zhang A, Gowda M, 
total genetic value using genome-wide dense marker maps. Nair SK, Hao Z, Lu Y, San Vicente F, Prasanna BM, Li X, 
Genetics 157:1819–1829 Zhang X (2020) Applications of genotyping-by-sequencing 
Michel S, Ametz C, Gungor H, Epure D, Grausgruber H, (GBS) in maize genetics and breeding. Sci Rep 10:16308
Löschenberger F, Buerstmayr H (2016) Genomic selec- Xiao YJ, Liu HJ, Wu LJ, Warburton M, Yan JB (2017) Genome-
tion across multiple breeding cycles in applied bread wheat wide association studies in maize: praise and stargaze. Mol 
breeding. Theor Appl Genet 129:1179–1189 Plant 10:359–374
Montesinos-López OA, Montesinos-López A, Pérez-Rodríguez Yu J, Pressoir G, Briggs WH, Vroh Bi I, Yamasaki M, Doebley 
P, Barrón-López JA, Martini JWR, Fajardo-Flores SB, JF, McMullen MD, Gaut BS, Nielsen DM, Holland JB, 
Gaytan-Lugo LS, Santana-Mancilla PC, Crossa J (2021) A Kresovich S, Buckler ES (2006) A unified mixed-model 
review of deep learning applications for genomic selection. method for association mapping that accounts for multiple 
BMC Genom 22:19 levels of relatedness. Nat Genet 38:203–208
Vol:. (1234567890)
1 3
Euphytica (2022) 218:154 Page 21 of 21 154
Yuan Y, Cairns JE, Babu R, Gowda M, Makumbi D, Magoroko- (2017b) Rapid cycling genomic selection in a multiparental 
sho C, Zhang A, Liu Y, Wang N, Hao Z, San Vicente F, tropical maize population. G3 (bethesda) 7:2315–2326
Olsen MS, Prasanna BM, Lu Y, Zhang X (2019) Genome- Zhang HH, Yin LL, Wang MY, Yuan XH, Liu XL (2019) Factors 
wide association mapping and genomic prediction analyses affecting the accuracy of genomic selection for agricultural 
reveal the genetic architecture of grain yield and flowering economic traits in maize, cattle, and pig populations. Front 
time under drought and heat stress conditions in maize. Genet 10:189
Front Plant Sci 9:1919
Zhang A, Wang H, Beyene Y, Semagn K, Liu Y, Cao S, Cui Z, Publisher’s Note Springer Nature remains neutral with regard 
Ruan Y, Burgueno J, San Vicente F, Olsen M, Prasanna BM, to jurisdictional claims in published maps and institutional 
Crossa J, Yu H, Zhang X (2017a) Effect of trait heritability, affiliations.
training population size and marker density on genomic pre-
diction accuracy estimation in 22 bi-parental tropical maize 
populations. Front Plant Sci 8:1916
Zhang XC, Pérez-Rodríguez P, Burgueño J, Olsen M, Buckler E, 
Atlin G, Prasanna BM, Vargas M, San Vicente F, Crossa J 
Vol.: (0123456789)
1 3