Publisher: AGRONOMY; Journal: AGROJNL:Agronomy Journal; Copyright: Will notify... Volume: Will notify...; Issue: Will notify...; Manuscript: aj-2017-08-0123-a; DOI: ; PII: TOC Head: ; Section Head: ; Article Type: ARTICLE This article has been accepted for publication and undergone full peer review but has not been through the copyediting, typesetting, pagination and proofreading process, which may lead to differences between this version and the Version of Record. Please cite this article as doi: 10.1002/csc2.20003. This article is protected by copyright. All rights reserved. Page 1 of 42 Running Title Candidates for training populations and implication on accuracy Title Genomic prediction and QTL discovery in a cassava training population constructed from multiple breeding stages Authors Mohamed Somo 1* Heneriko Kulembeka 2 Kiddo Mtunda 2 Emmanuel Mrema 2 Kasele Salum 2 Marnin D. Wolfe 1 Ismail Y. Rabbi 3 Chiedozie Egesi 3 Robert Kawuki 4 This article is protected by copyright. All rights reserved. Alfred Ozimati 4 Roberto Lozano 1 Jean-Luc Jannink 1, 5 Affiliations (1) Dept. of Plant Breeding and Genetics, Cornell University., Ithaca, NY, USA (2) Tanzania Agricultural Research Institute, Kibaha and Mwanza, Tanzania (3) International Institute for Tropical Agriculture, Ibadan, Oyo, Nigeria (4) National Crops Resources Research Institute, Namulonge, Uganda (5) USDA--ARS, Ithaca, NY, USA Abbreviations Please list abbreviations in alphabetical order with the abbreviation first, separated from its definition by a comma. Please use semicolons to distinguish separate abbreviations. BLUPs, Best Linear Unbiased Predictors GBS, Genotyping-by-Sequencing GEBVs, Genomic Estimated Breeding Values GS, Genomic Selection This article is protected by copyright. All rights reserved. MAF, Minor Allele Frequency MAP, Months After Planting PCA, Principal Component Analysis SNP, Single Nucleotide Polymorphism TP, Training Population This article is protected by copyright. All rights reserved. Abstract Assembly of training population (TP) is an important component of effective genomic selection (GS) based breeding program. In this study, we examined the power of diverse germplasm assembled from two breeding programs in Tanzania, at different breeding stages to predict traits and discover QTL. This is the first GS and Genome-wide association study (GWAS) analysis on Tanzanian cassava data. We detected QTL associated with resistance to Cassava mosaic disease (CMD) on chromosomes 12 and 16, QTL conferring resistance to Cassava brown streak disease (CBSD) on chromosomes 9 and 11, and QTL on chromosomes 2, 3, 8, and 10 associated with resistance to CBSD for root necrosis (CBSDR). We also detected a QTL on chromosome 4 and two QTL on chromosome 12 conferring dual resistance to CMD and CBSD. Using clones in the same breeding stage to construct TP provided higher trait prediction accuracies than using TP with mixture of clones from multiple breeding stages. Moreover, clones in the early breeding stage provided more reliable trait prediction accuracies and therefore are better candidates for constructing TP. Although larger training population sizes have been associated with improved accuracies, in this study, adding clones from Kibaha location to Ukiriguru and vice versa did not improve prediction accuracy of either population. Similarly including Uganda TP in either population did not improve trait prediction accuracies. This study applied genomic prediction to understand the implication of constituting TP using clones at different breeding stages and pooled from different locations on trait accuracies in cassava. This article is protected by copyright. All rights reserved. Introduction Cassava is an important source of dietary calories for millions of people in tropical regions (Howeler et al., 2013; Salvador et al., 2014). Because of cassava‟s adaptation and resiliency to dry and marginal environments, it is likely to continue to be a lead contributor to food security for many. Traditionally, cassava breeders have used phenotypes for selection, but this strategy has not always generated adequate genetic gain for yield especially in Africa (Ceballos et al., 2012). Similarly, the use of marker assisted selection (MAS) on complex traits in cassava has remained largely ineffective because of the multiple minor loci which are difficult to detect and deploy (Ceballos et al., 2012). Genomic selection (GS) allows superior plants to be selected earlier on at seedling stage before phenotyping is started, and then used for crossing (Heffner et al., 2009; Lorenzana and Bernardo, 2009; Jannink et al., 2010). In principle, adoption of GS is expected to increase the rate of genetic gain and also reduce selection cycle time. For the case of clonal plants, the added advantage is that once elite clones are identified, they are genetically fixed, and thus can immediately be advanced to downstream evaluation for faster variety replacement. Recent implementation of GS in three breeding programs in Africa have improved trait prediction accuracies. Wolfe et al. (2017) reported a prediction accuracy increase of 57% for CMD in Nigerian cassava population. Similarly, Kayondo et al. (2017) reported improved prediction accuracy of 0.42 for both CBSD foliar severity and root severity in two Ugandan cassava populations. This article is protected by copyright. All rights reserved. In principle, successful implementation of GS program hinges on several critical factors, one of which is how TP is constituted. Mindful of breeding objectives, breeders can select individuals from available diverse germplasm, improved breeding lines, within and across breeding stages to construct TPs. They could also compose them using clones from different breeding programs or even clones from different countries. Most early crop GS studies used genotypes from the same breeding stages to form TPs. These individuals were either tested in replicated trials in multiple environments or were part of larger historical phenotypic data (Dawson et al., 2013; Storlie and Charmet. 2013; Ly et al., 2013; Rutkoski et al., 2015). Often breeding programs transitioning to GS-based approach may not have access to sufficient and diverse genetic pool and adequate historical data to choose TP candidates from. For the case of clonal crops, the inadequacy of planting materials of potential candidates could further reduce the number of genotypes available for selection. In GS breeding, large TP size is generally preferred because it is associated with better trait prediction accuracies (Zhong et al., 2009; Cericola et al., 2017). Recently, The Tanzania Agriculture Research Institute (TARI) transitioned to GS-based cassava breeding pipeline. Traditionally, TARI has maintained two distinct cassava breeding programs at Chambezi, Kibaha (KIB) and Ukiriguru (UKG), Mwanza, serving Coastal and Lake Zone parts of the country, respectively. The two breeding programs evolved independently due to restrictions on germplasm movement between regions to curb the spread of cassava brown streak disease (CBSD). Although the breeding objectives are quite similar between the two programs, cassava breeders at KIB largely selected for fresh root yield while those at UKG focused mainly on dry matter content. This dichotomy of trait This article is protected by copyright. All rights reserved. selection was largely driven by the consumption pattern of the local communities in the respective regions. The TARI TPs were constructed using clones from different trial types. The KIB trials included clones at preliminary yield trial (PYT) and advanced yield trial (AYT) selection stages while those at UKG consisted of clones at clonal evaluation trial (CET) and PYT. In a breeding pipeline, CET genotypes are minimally selected in seedling nurseries before cloning, whereas PYT and AYT would have undergone one and two specific selection rounds, respectively. In the present study, each breeding cycle represented several trials (Table 1). Each trial had different plot sizes, replications, and different number of plants per plot. Additionally, there were no common checks for the trials across the two programs. Because there were few clones in each trial to form sizeable TP, clones from different trial types, breeding stages, and locations were pooled together to form TPs. The Uganda (UG) training population was added to the clones from the TARI two programs to evaluate whether their inclusion could improve trait prediction accuracies. There is limited knowledge on the implication of using cassava clones from multiple trials, multiple locations, and clones from multiple breeding stages to construct TP on trait predictions. This study was therefore designed to: (1) examine which combination of breeding materials could provide the best training set, (2) evaluate trait prediction accuracies for within and across breeding stages, locations, and combined TARI data sets, (3) assess the impact of adding Ugandan clones to clones of either program on trait predictions, and (4) identify loci associated with disease resistance particularly for CMD, CBSD, and CBSDRS in Tanzanian cassava. The findings of this study will guide breeders in selecting the most suitable candidates as well as the best source of TP candidates when initiating GS-assisted breeding. We also expect the markers identified in this study will be valuable addition for MAS breeding in cassava. This article is protected by copyright. All rights reserved. Materials And Methods Germplasm and field design A diverse panel of breeding lines from two independent breeding programs were used to construct the training population. The panel comprised of 432 and 408 clones from Kibaha and Ukiriguru breeding programs, respectively. Kibaha clones were at PYT (PYTKIB) and AYT (AYTKIB) breeding stages while those from Ukiriguru were at CET (CETUKG) and PYT (PYTUKG). The KIB clones were drawn from nine PYTs and six AYTs while UKG clones consisted of two CETs and a PYT. Clones in each trial type arose from different progenitors and as such each trial in both programs had a unique set of individuals, varied number of replicates, and different plot sizes (Table 1). An additional 402 clones from Uganda (UG) training population previously described by Wolfe et al. (2016) were included in the study. For each trial set, clones were evaluated in a randomized complete block design during the 2016/2017 cropping season at Kibaha (lat/lng: 6.8138° S, 38.6949° E, Alt: 156 m above sea level) and Ukiriguru (lat/lng: 2° 43' 0" S, 33° 1' 0" E, Alt: 1194 m above sea level). Kibaha PYT trials were laid as a 2-row plot with 14 plants while AYT trials were 42 plants planted in 6-row plots. The CET plots at UKG consisted of a single-row of 5 plants while the PYT were 2 row plots with 10 plants. At both locations, the clones were spaced at 1 m x 1m. With exception of two AYT trials which were replicated three times, the rest of the trials in both locations were replicated twice (Table 1). For all KIB trials, Kiroba, Mfaransa, and Mkuranga1 were used as common checks while Mkombozi, Lwakitangaza, and Liongokwimba were the checks in UKG trials. For both locations, standard agronomic practices under rainfed cropping were applied. All the field management and phenotyping took place between March 2016 and May 2017. This article is protected by copyright. All rights reserved. Phenotyping Nine traits were evaluated for this study. Cassava mosaic disease severity (CMDS), foliar cassava brown streak disease severity (CBSDS), and cassava green mite severity (CGMS) were scored at 3, 6, and 9 months after planting (MAP) on a scale of 1 (No symptoms) to 5 (severe symptoms). For analysis we used the season wide mean severity (MCMDS, MCBSDS, and MCGMS) for 3, 6, and 9 MAP. For root necrosis severity (CBSDRS), necrotic symptoms on a scale of 1-5 were scored 12 MAP using cross sections of root (Hillocks and Thresh, 2000), where 1 = no necrosis and 5 is > 25% necrotic and severe root constriction. We then used average disease scores for each experimental plot for analysis. The root number (RTNO) are the number of fresh roots harvested per plot and used to calculate RTNO per hectare. Root weight (RTWT) and shoot weight (SHTWT) were both measured in kg per plot and then used to calculate below and above ground yield in tons per hectare, respectively. Harvest index (HI) was the ratio between RTWT and total biomass (RTWT+SHTWT) per unit area. Dry matter (DM) was expressed as a percentage of fresh root (FYLD) weight using the specific gravity method. The specific gravity of each sample (of the whole root, including the peel) was determined from the weights in air and in water (Kawano et al., 1987). The experimentation and phenotypic data capture of UG population is as described previously (Wolfe et al., 2016; Ozimati et al. 2018; Kayondo et al., 2018). Eight variables were common to both UG and TARI populations except MCGMS which is specific to TARI. Genotyping SNP marker genotypes were obtained using genotyping-by-sequencing (GBS; Elshire et al., 2011). The markers were called using the TASSEL 5.0 GBS pipeline version 2 (Glaubitz et al., 2014) after aligning the resulting reads to the Manihot esculenta Version 6 assembly available at This article is protected by copyright. All rights reserved. http://phytozome.jgi.doe.gov/. Genotype calls were accepted only when there was a minimum of two reads, otherwise the genotype was set to missing and imputed downstream. The markers were initially converted to a matrix of allele dosages, with REF/REF, REF/ALT, ALT/ALT genotypes coded as -1, 0 and +1, respectively. The GBS data were filtered so that clones with >80% missing markers and markers with >60% missing genotype calls were removed. Beagle version 4.1 was used for imputation of the data (Browning and Browning, 2009). After imputation we retained 121,246 bi- allelic SNP markers with AR2 (Estimated Allelic r-squared) threshold higher than 0.3. Of the imputed markers, 116,837 markers with minor allele frequency (MAF) higher than 0.01 were retained. These markers were then used for genomic prediction. Similarly, a total of 88,434 markers were retained after filtering with MAF threshold higher than at 0.05 and used for GWAS analysis. 37,776 markers common to both UG and TARI populations were filtered for minor allele frequency (0.01) and the resulting 36,847 markers were then used for downstream joint analysis. Statistical analysis Estimation of observed values To estimate these observed values, we fitted a mixed linear model across trial types and locations using lme4, R package (R Development Core Team 2010; Vazquez et al., 2010; Bates et al., 2015). For trials within locations, we fitted the model: y ZTrial (Trial: Rep)r , where y is raw phenotypic observations, β included a fixed effect for the population mean and for plot-basis traits (FYLD, RTNO, and SHTWT), the proportion of plants harvested per plot was included as a covariate; vector and corresponding incidence matrix represented a random effect for clone This article is protected by copyright. All rights reserved. where ( ) and I represented the identity matrix. The incidence of replication was nested in trial and was represented by matrix ZTrial (Trial.Rep) and random effects vector such that r ( ) and residuals were considered to be distributed as ( ). The model for combined analysis across locations was: y Zrep (Loc.Trial. Rep)b . The fixed and clone effects for combined across location were the same as those for each location described above. Trial and rep effects were nested in locations and incorporated as random with incidence matrix Zrep and the effects vector b ( ). All the random effects in each of the models are independent and identically distributed (iid). For both models described above, the best linear unbiased predictors (BLUPs) for clones were extracted and used as the “observed values” that were correlated with genomic estimated breeding values (GEBVs) to determine trait prediction accuracies. Under combined location model, our emphasis was on prediction of new genetic materials from another breeding program, and therefore we considered the environment effects to be nuisance parameters. Since we were not interested in location, trial or rep effects per se, but instead interested in genomic estimated breeding values (GEBVs). We assumed the three-way interaction of location, trial and rep is sufficient to correct for these nuisance terms, and no main effects are required to be fit as they are simply linear combinations of the interaction predictors. We also considered the issue of heterogeneity of error variances for different trials across the two locations. Although error variance estimates did differ across trials, those differences had little effect on the estimated BLUP values (the correlation between BLUPs with and without heterogeneous error variance were between 0.92 and 0.96 except for fresh yield (FYLD) which had reduced BLUP correlations (~0.48 data not shown). To overcome the challenges and complexity of modelling for heterogeneity of error variances particularly for yield, we did fit model for each trial type in each location separately and then extracted BLUPs from each experiment and then used combined BLUPs to correlate with genomic estimated breeding values to determine prediction accuracies. For the rest of the trait, because of the large number of trials (18), we felt that fitting an error variance for each trial This article is protected by copyright. All rights reserved. would result in too many parameters to estimate, and therefore we used a simpler homogenous error variance model. Variance Components, Heritability and Trait Correlation The variance components were extracted from the model used for the trials within location. The model statement is as described above under “Estimation of observed values.” Accordingly, broad sense heritability (H 2 ) was then computed for each trial using ( )⁄ where = clone variance, and = residual variance. The phenotypic and genotypic correlations between the nine traits were obtained using combined data from the two locations. We also assessed trait phenotypic correlations using data from each breeding stage in each program; UKG (CETUKG & PYTUKG), and KIB (PYTKIB & AYTKIB) to determine the consistency of trait correlations across different breeding stages. For the combined locations, we used raw plot data to estimate phenotypic correlations, without accounting for the experimental designs. For estimating the genetic correlations, we accounted for the trial, location, and replication effects as described above. We then extracted and used BLUPs to estimate genetic correlations. The estimates were plotted (Fig. 2) in R 3.4.1 (R Core Team 2017) using the „corrplot‟ package (Wei, 2013). For all the cases, correlation values were considered significantly different from zero at P value ≤ 0.05. Genomic Prediction To perform genomic prediction, we fitted separate genomic best linear unbiased prediction (GBLUP) model as for within breeding stages, across breeding stages, and across breeding programs. where y is a vector of the raw phenotype is a vector of fixed effect (different for different This article is protected by copyright. All rights reserved. scenarios) with design matrix X, the vector g is the random effect representing GEBV for each individual, Z is a design matrix linking observations to genomic values, and is a vector of residuals. For within breeding stages, the fixed is the grand mean, while the fixed effects for across breeding stages were grand mean and trial. For across breeding programs the fixed effect included trial and location. The GEBV was obtained assuming ( ), where the additive genetic variance and K is the square, symmetric genomic relationship matrix based on SNP markers. The genomic relationship matrix was constructed with the function A.mat in the R package rrBLUP (Endelman, 2011) that used VanRaden‟s (2008) method 1. The GBLUP predictions were made with the function emmreml in the R 3.1.0 package EMMREML (Akdemir and Okeke, 2015). Assessment of Prediction Accuracy In order to assess prediction accuracy, we used 30 replicates of 5-fold cross-validation. For each replicate, we divided the population randomly into five equal and mutually-exclusive subsets or folds. We then trained the prediction model on four (training sets) of the five folds to predict the fifth (validation set). For scenarios where we used one location to predict another location, our training and test sets were fixed and therefore were not replicated. The following prediction scenarios were explored in our analysis: (1) within breeding stages in each location: CETUKG, PYTUKG, PYTKIB, and AYTKIB, (2) combined locations (UKG+KIB), (3) within locations: prediction of UKG alone (CETUKG + PYTUKG) and KIB alone (PYTKIB + AYTKIB), (4) inter-location prediction: UKG to predict KIB and KIB to predict UKG, and UKG+KIB to predict UKG and vice versa. We further used UKG+UG to predict UKG, KIB+UG to predict KIB and UKG+KIB+UG to predict UKG and KIB. This was done to determine if adding UG clones to either TARI program would improve trait prediction accuracies. For scenarios 3 and 4, we maintained a fixed test set (20%) picked randomly from the population under prediction in each iteration. For each prediction, accuracies were computed as the Pearson correlation coefficient between the GEBV predicted for the test set and the corresponding estimated observed breeding values; BLUPs. Correlation values were considered significantly different from zero at P value ≤ 0.05. This article is protected by copyright. All rights reserved. Population structure To visualize population structure, principal component analysis (PCA) using common markers in the three breeding programs; KIB, UG, and UKG was performed. The prcomp function in R was used to generate PCs. The first two PCs were then used for plotting to visualize structure between the three populations (KIB, UKG and UG). Thereafter the loadings (eigenvector coefficients) for all the markers on 18 chromosomes on PC1 and PC2 were assessed to determine the markers contributing to the greatest variations in TARI alone and combined TARI and UG in the respective PCs. Genome-Wide Association Study GWAS was performed on different subsets of TARI accessions. The subsets included trial types (CET, PYT, AYT), locations (KIB and UKG), and four clusters generated based on marker kinship matrices of the individuals. We used these sets of data combination because we suspected that certain chunks of data could provide better results than others. Clustering was considered because we thought it is an objective way to find population structure. By doing GWAS within each cluster we were hoping to avoid population structure and find better allele frequencies. In implementing GWAS, we used BLUPs extracted from the linear mixed model described under estimated observed breeding values section as phenotypes. For each trial, the GWAS was carried out using genome-wide complex trait analysis tool (GTCA, Yang et al., 2011). This followed a leave one chromosome out approach in which the chromosome with the tested candidate SNP markers is excluded from the genomic relationship calculation. The linear mixed model: with var (y) = V = Kσ2g + Iσ 2 ɛ, were fitted for each case. Here is an n × 1 vector of phenotype (BLUPs) with n being the sample size, β is a vector of fixed effect (genetic marker information), g is an n × 1 vector of the total genetic effects of the individuals with g∼ N (0, Kσ2g), K is the genetic relationship matrix between individuals, same as symmetric genomic realized relationship matrix based on SNP markers, I is an n × n identity matrix, and ɛ is a vector of residual effects with ɛ ∼ N (0, Iσ2ɛ). This article is protected by copyright. All rights reserved. Bonferroni threshold was applied to correct for multiple testing for each data set used for GWAS. The cut off was computed as −log 10(α/t*n) where α is 0.05, which is the standard significance threshold, t is the number of different subset of data, and n is the number of SNPs. In this study, the Bonferroni correction significance cut off was − log10 (0.05/15*88,434) = 7.42. We used the marker-wised P value of 0.001 as the threshold to declare significant market trait association (MTA) and the false discovery rate (FDR) corrected P value of 0.2 to select the highly significant MTA. QTL boundaries were defined with linkage disequilibrium (LD) pairwise correlation coefficient r2 ≥ 0.7 coupled with the MTA information. We chose the tagging marker for each QTL with the criteria of the smallest marker-wised P value. Manhattan and QQplots plots were generated using R package ggplots2 (Wickham et al., 2016) with customization for joint Manhattan-QQplot display using R package ggpubr (Kassambara, 2018). We used the M. esculenta_305_v6.1 available in Phytozome to report presence of annotated genes that are related to plant defense system within the QTL region. Results and Discussion Population Structure We performed principal component analysis to assess structure within and among the UKG, KIB, and UG populations (Fig.1). This is necessarily to determine how individuals from these programs are related to each other in order to understand the effect of germplasm sharing between Uganda and Tanzania. It will also be useful to understand whether the existing restrictions on clone movement between KIB and UKG might have facilitated population differentiation within Tanzanian germplasm. In the combined data (UKG+KIB+UG), the first and second PC accounted for 11% and 5% of the genetic variation, respectively. In TARI (KIB+UKG), PC1 and PC2 explained 7% and 5% of the variation, respectively. Some population stratification within each breeding program was observed (Fig. 1). This article is protected by copyright. All rights reserved. Three clusters of clones were evident arrayed along PC1 (Fig. 1b). The loadings of markers on PC1 (Fig. 1d) showed a strong effect of chromosome 1 on this component, suggesting that the clusters are due to the known introgression from M. glaziovii on this chromosome (Bredeson et al., 2016). This clustering along PC1 has been observed previously in Ugandan cassava (Ozimati et al., 2019) and we suggest that it depends on the dosage of the chromosome 1 introgression, either 0, 1, or 2 copies. Interestingly, PC2 (Supplementary Fig. 1) strongly separated UKG from KIB while the UG accessions were intermediate between these two. When analyzing TARI alone, differences between UKG and KIB aligned along PC1 (Fig. 1a) whereas the chromosome 1 introgression aligned along PC2 (Fig. 1c, Supplementary Fig. 1). This analysis may also provide valuable information on the role of introgression in driving population stratification. Despite, Tanzania being the presumed origin of chromosome introgressions in the 1930s, (Hahn et al., 1979, 1980; Fauquet et al., 1990), based on this analysis, this introgression may have been subjected to more recombination events in Tanzanian population (Fig 1:c &d). The Uganda population was constituted in 2011 by combining progenitors sourced from CIAT, IITA, Tanzanian, and a few clones from within Uganda (Kawuki, 2019 personal communication). Because of this recent constitution of UG TP, it is possible that the UG introgression is newer than Tanzania resulting in differentiation between sub populations with introgression versus ones without it. However, further investigation is needed to validate these hypotheses. Trait Correlations This article is protected by copyright. All rights reserved. Trait phenotypic and genetic correlations were quite similar (Fig. 2; Supplementary Fig. 2). Phenotypic and genotypic correlations of the yield traits were moderately positive except for DM. The highest correlation was between FYLD and SHTWT, followed by RTNO and SHTWT, and RTNO and FYLD, (r=0.6, r=0.5, r=0.4, respectively). These values are only somewhat lower than previous findings; Kundy et al. (2014) reported a strong positive correlation between RTNO and FYLD (r=0.7), but fewer clones were tested which could have skewed their findings. In addition, effects associated with locations and trial types in the present study might have contributed to the observed deviations. Generally, the positive correlations of FYLD, RTNO, and SHTWT could be exploited for simultaneous trait improvements. We also observed a slight positive genetic correlation between MCMDS and MCBSDS (r=0.3) but a very low association between MCBSDS and CBSD root necrosis which agrees with findings in other studies (Rwegasira and Rey, 2012; Ozimati et al., 2019). The positive relationship between these two foliar diseases provides an opportunity to increase resistance against both concurrently. MCGMS showed weak negative correlations with MCMDS, SHTWT, and FYLD (r=-0.3, -0.3, -0.4, respectively). The negative correlation between MCGMS and FYLD corroborates Chipeta et al., (2008) result (r= -0.53). Breeders should keep these negative associations in mind and incorporate sources of resistance when breeding for yield. Assessments of trait correlations in each breeding stage showed similar patterns observed in combined data sets from across breeding stages. However, we also noticed slight improvement in some trait although we think it could just be noise in the data (Supplementary Fig. 2). Variance Components and Broad-Sense Heritability Estimates This article is protected by copyright. All rights reserved. Variance components and broad-sense heritability estimates for each of the nine traits are in Table 2. Generally, KIB trials yielded the largest genetic variance estimates for MCMDS, CBSDRS, FYLD, and SHTWT while UKG trials had the largest estimates for MCBSDS and RTNO. Conversely, both FYLD and SHTWT had the largest residual estimates in the AYTKIB and UKG trials, respectively. Errors associated with CBSDRS, MCGMS, RTNO, HI, and DM in AYTKIB were lower than other trials. Meanwhile both genetic and error estimates for RTNO in the UKG trials were higher than in the KIB trials. Additionally, HI and MCBSDS had similar error estimates across all the trials. Heritability estimates ranged between 0.06 and 0.91. Based on cassava heritability classification, only HI in AYTKIB, MCBSDS in PYTUKG, and MCMDS in both AYTKIB and PYTUKG could be classified as high (>0.50) (Bhateria et al., 2006). Other researchers reported higher heritability for MCMDS, MCBSDS, and HI in different cassava populations (Wolfe et al., 2017; Kanyondo et al., 2018; Adeniji et al., 2011). High heritability estimates for MCMDS and MCBSDS could be associated with varying levels of resistance in Tanzanian accessions owing to specific responses to various strains of CBSVs and CGMs. In the Lake Zone, there is a high prevalence of Africa cassava mosaic virus, Eastern African cassava mosaic virus (EACMV), the Ugandan variant of cassava mosaic virus, cassava brown streak virus (CBSV), and the Ugandan variant cassava brown streak virus. However, only EACMV and CBSV are known to occur in the Eastern Zone; includes Kibaha (Raya et al., 1993; Jeremiah et al., 2015). Extensive pathogen variation within and between the two regions of Tanzania could have caused the variability detected in the germplasm. Significant differences in the HI were reported among cassava cultivars (Kawano et al., 1978). There is intensive selection against clones with heavy branching in early generations. Clones with fewer branches are known to have low HI. It appears from this data that the clones from AYTKIB trials were heavily selected for low branching and consequently resulting in higher heritability (h2=0.80). The CBSDRS, SHTWT, FYLD, and DM in the CETUKG trials and MCBSDS and DM in the KIB trials, all had medium heritability estimates. Similar estimates were reported in other cassava This article is protected by copyright. All rights reserved. populations (Wolfe et al., 2017; Kayondo et al., 2018; Ozimati et al., 2019). We observed low FYLD heritability in AYTKIB than in any other trials. We can only speculate that intensive selection for disease resistance and DM in early breeding stages could have limited and reduced the amount of FYLD genetic variation in advanced lines (AYTKIB). Therefore, we need to balance between trait selection preference and genetic variation in breeding stages. For most traits, CET trials with a smaller plot size (5m 2 ) had lower error variance and a better heritability than trials with medium and large plots (e.g. 14m 2 and 20m 2 ). High trait genetic variances were reported in sugarcane clonal evaluation trials at Louisiana State University Agricultural Center, but the heritability estimates in different plot sizes were similar (Milligan et al., 2007). The substantial increase in genetic variance and trait heritability in smaller plots than larger plot sizes needs further investigations. This will help establish optimal plot sizes for cassava breeding for better prediction accuracies. Generally, trait heritability estimates varied across trials similar to the findings of Wolfe et al. (2017), Kayondo et al. (2018), and Ozimati et al. (2019). Confounding effects (locations and trials), environmental condition differences, trait selection priority in each of the TARI breeding programs, and differences in data collection between the two programs could have affected heritability estimates. Genomic prediction Within Breeding Stage Prediction Within breeding stage prediction accuracies for all the traits in CETUKG, PYTUKG, PYTKIB and AYTKIB were 0.23, 0.20, 0.15, and 0.20, respectively (Table 3). The accuracies within the breeding stages were lower than accuracies from other cassava populations for most traits except DM and FYLD (Wolfe et al., 2017). In this study, the use of data from CETUKG stage significantly improved prediction accuracies for all the traits except MCMDS and CBSDRS compared to PYTUKG, PYTKIB, and AYTKIB data sets. Overall prediction accuracies obtained from cross-validations were This article is protected by copyright. All rights reserved. higher for HI and FYLD when Individuals in CETUKG and PYTUKG were used for model training. On the contrary, the two breeding stages evaluated at Kibaha produced markedly lower prediction accuracies for most traits except for MCGMS and DM. Interestingly, there was consistent prediction of FYLD (~35%) in both CETUKG and PYTUKG. The stability in yield prediction in these breeding cycles at Ukiriguru suggests that breeders can opt to select candidates for training population from either cycle particularly when the interest is yield. An added advantage of equal prediction in CETUKG and PYTUKG is that clones can directly be transferred CET to AYT stage for evaluation. Skipping of PYT stage could help accelerate the varietal replacement process. Dry matter prediction accuracy improved significantly by 53% when clones in advanced Kibaha trials (AYTKIB) were used to model training than those in PYTKIB. Comparison between breeding stages in the two locations showed high FYLD prediction accuracy in CETUKG and a high DM in AYTKIB. One would have expected opposite results because of trait selection preferences between the two locations as well as the larger plot size associated with AYTs. We are not certain why clones in early generation evaluated in smaller plot s had higher yield than clones in advanced generation evaluated in larger plots. The difference could be attributed to high genetic variation and high heritability estimates for CET clones than AYT clones. We also observed that using clones from the same breeding stage for TP gave higher prediction accuracy estimates than clones aggregated from multiple breeding stages. Our estimates agree with those found by Hofheinz et al. (2012) using data from two consecutive breeding cycles of sugar beets and Michel et al. (2016) study on genomic selection using multiple breeding cycles in bread wheat. Ceballos et al. (2016) recommends the used of phenotypic information from clones in advanced breeding stage during genomic selection because of their “stable” genotypic performance, in this This article is protected by copyright. All rights reserved. study we observed that phenotypic records of clones in early breeding stage predicted test set better for most traits than advanced ones. From this study we suggest that consideration should be given to the within the breeding stage clones particularly from early generation when forming training populations. Additionally, CET clones are generally tested in smaller plots because of limited number of planting stakes. The use of smaller plots for trait evaluation can further help reduce phenotyping cost. Additional investigation using different cassava population is needed to validate our results. Within and Cross Location Prediction Using one location data to predict the performance in an independent location can help accelerate breeding process. We evaluated across-location prediction and compared the trait prediction accuracy estimates to the within location prediction accuracy estimates. Except for MCGMS, the cross location predictions were generally low, averaging between 0.10 and 0.14 when KIB was used to predict UKG and vice versa, respectively (Table 4). MCGMS was the highest and most consistently predicted trait for both within and between locations. Adding the two populations together to increase the training population size did not improve trait prediction accuracies. We also noticed that the use of KIB to predict UKG reduced accuracies for CBSDRS, RTNO, SHTWT, HI, and FYLD compared to the within-UKG scenario. The unrelatedness of materials across cycles, confounding effects (trials and locations), and variable heritability across locations could have caused low accuracies. Perhaps existing quarantine between the two TARI programs associated with subtle population differentiation (Fig.1) as well as the absence of common checks between trials may have caused low trait prediction estimates between locations. Reduced common ancestors overtime results in reduced kinship, which reduces accuracy (de los Campos et al., 2013). Lorenz and Smith. (2015) reported a decrease in prediction accuracy when unrelated lines are added to the calibration set. Similarly, Song et al. (2017) reported a decrease in This article is protected by copyright. All rights reserved. prediction accuracy when predicting yield across cycles compared to within cycles in wheat. Our results are also in agreement with Ly et al. (2013) findings. They observed a decrease in prediction accuracies across locations in cassava. Therefore, based on this population, cross-location prediction may not be useful for programs implementing GS. Although generally, it would be useful to implement GS across locations, using populations with structure and weak SNP-QTL LD associations across populations could limit GS-assisted breeding in cassava and other species. This limitation has also been observed in livestock, Hayes et al. (2009), reported that pooling animals from different populations did not improve trait predictions due to non-persistent association between SNPs and QTL across breeds (populations). In our study we did not find consistent relationship between heritability and prediction accuracy estimates across breeding cycles. In fact, we observed same accuracies for FYLD in both CET and PYT at Ukiriguru. Other studies have reported positive relationship between heritability and prediction accuracy across breeding cycles. For example, Sallam et al. (2015) reported high prediction accuracy for fusarium head blight resistance than yield in cross breeding cycles. This is because highly heritable traits have a less complex genetic architecture and therefore considered to be stable across multiple cycles. Combs and Bernardo (2013) and Daetwyler et al. (2010) have both attributed the positive relationship between heritability and accuracy to the preserved haplotype structures and relatedness across breeding cycles. Across-Program Prediction The UKG population had slightly better prediction accuracies for most traits than KIB (Table 5). We added the UKG clones to KIB and vice versa and predicted a fixed number (20%) of KIB or UKG This article is protected by copyright. All rights reserved. accessions to determine whether including these clones could improve trait prediction. Adding UKG clones to the KIB training set did not improve prediction accuracies for most traits. On the other hand, adding the KIB clones to the UKG training set reduced the prediction accuracies for most traits. This effect was more severe on FYLD than any other trait (from r=0.30 to r=0.08). Similarly, adding the UG clones to either KIB or UKG did not improve prediction accuracies for either program. However, the UKG+UG to predict UKG test set slightly improved the results compared to the KIB+UG to predict KIB test set. Further, we used UKG+KIB+UG to predict UKG and KIB to determine if adding UG clones to TARI clones would improve trait prediction accuracies. There was no improvement in trait prediction. In fact, there were decreases for CBSDS, CBSDRS, and SHTWT (r=0.16 to 0.07, r=0.14 to 0.07, r=0.16 to 0.08, respectively) when UKG+KIB+UG was used to predict KIB. We observed similar decreases for CBSDRS, SHTWT, and FYLD (r=0.23 to 0.15, r=0.20 to 0.14, 0.30 to 0.26, respectively) when UKG+KIB+UG was used to predict UKG. Decrease in accuracies as result of combining unrelated populations have been reported in other crops. For example, Lorenz and Smith (2015) reported low prediction accuracy estimates when they combined population from North Dakota state University barley program and a second population from University of Minnesota barley breeding program to form training population. Other researchers have also reported low accuracies in cross-population predictions (Wolfe et al., 2017; Crossa et al., 2010; Endelman, 2011). Our results and evidences from other studies suggests that breeders can achieve better and a more reliable prediction accuracy estimates using smaller population with closely related genotypes than using large population with unrelated individuals. Population structure, environmental condition differences, differences in experimental design, direction of trait selection in each of the TARI breeding program, and variation in heritability could have impacted the accuracy estimates. Based on published results, the Ugandan training population had slightly higher accuracies for most traits compared to TARI (Wolfe et al., 2017; Ozimati et al., 2018). More locations and replication of clones across environments could have improved their accuracies. The poor prediction results across-programs reported in our study will This article is protected by copyright. All rights reserved. make it harder for breeders to use training data from different locations, breeding programs, and countries. Genetic Architecture of Disease Resistance Genetic associations with MCMDS were distributed across all chromosomes except 4, 5 6, 8, 11, 13, and 18 (Fig. 3, Table 6, and Supplementary File1). Previous GWAS study for cassava mosaic disease severity resistance revealed CMD2 locus on chromosome 12 in other cassava populations (Wolfe et al., 2016). SNPs S12_5529819 and S12_11351823 delimit this QTL region (6.3-8.7 Mbps). S12_7270219 was the most significant marker tagging the QTL in all the TARI sub populations and accounted for 14 to 37% of the phenotypic variance explained (PVE). This result validates the presence of previously reported CMD2 locus. In addition to CMD2 locus, we also detected another major CMD QTL (QTL-cmds16-1) on chromosome 16 tagged by marker S16_421670 when we used PYTUKG population. This region accounted for 87% of PVE (Fig.3, Table 6, and Supplementary File1). There are no known CMD resistance genes in this region. However, Manes.16G036200.1, Manes.16G036300.1, and Manes.16G036400.1 genes have been annotated in this region and known to encode pentatricopeptide repeats. These genes are positively expressed when plants are under attack from pathogens (Park et al., 2014). A QTL on chromosome 4 (QTL-cbsd4|cmd-1) and two on chromosome 12 (QTL-cbsd12|cmd-1 and QTL-cbsd12|cmd-2) showed significant association with both MCBSDS and MCMDS resistance (Fig. 3, Table 6, and Supplementary File 1). Marker S12_7929439 tagging cbsd12|cmd-2 QTL accounted for 8% PVE for MCMDS and 33% MCBSDS in several subsets of TARI population. In the same region, a single marker (S12_929320) was found associated with MCMDS and MCBSDS resistance QTL (QTL-cbsd12|cmd-1) in PYTKIB dataset and accounted for 37% and 8% PVE, respectively. QTL-cbsd4|cmd-1 tagged by marker S4_24670203 was detected in K4_cluster1 dataset and explained 14% of PVE. Manes.04G113900.1 gene occur in the marker region detected on chromosome 4. This gene is known to encode for phospholipased α and activates the plant responses to pathogen attacks This article is protected by copyright. All rights reserved. (De Torres et al., 2002). There are no known annotated genes in the region of QTL-cbsd12|cmd-1 and QTL-cbsd12|cmd-2 QTL. Further studies need to be conducted to confirm the presence of QTL conferring resistance to both CMD and CBSD on chromosome 4 and 12. On chromosome 9 and 11, QTL-cbsd9-1 and QTL-cbsd11-2 tagged by SNP S9_10707044 and S11_22942418 were associated with responses to MCBSDS in UKG and PYTKIB sub populations, respectively. QTL-cbsd9-1 accounted for 9% PVE while QTL-cbsd11-2 accounted for 5% PVE. The annotated gene within the QTL-cbsd11-2 is Manes.11G120800.1. This gene is known to encode for a protein kinase. An overexpression of similar gene in tobacco stimulated plant defense response (Li et al., 2018). Similarly, the region on chromosome 9, with a significant marker S9_10707044, contains a single gene (Manes.09G074200.1) that encodes a kinase family protein. Wen-Yuan Song et al. (1995) reported that a receptor kinase-like protein is encoded by the rice disease resistance gene Xa21. The two genes on chromosomes 9 and 11 could play a role in cassava defense mechanism to confer CBSD resistance. However, further investigation to validate the presence of CBSD resistant is needed. Nine markers representing four loci on chromosomes 2, 3, 8, and 10 were significantly associated with CBSDRS responses (Table 6, Supplementary File 1). Two loci on chromosomes 3 (QTL- cbsdrs3-1) and 8 (QTL-cbsdrs8-1), occurring in AYTKIB accessions, accounted for 30% and 32% of the phenotypic variation (chromosomes 3 and 8, respectively). While the other two minor loci on chromosomes 2 (QTL-cbsdrs2-1) and 10 (QTL-cbsdrs10-1), detected in UKG accessions, explained 8% and 11% of the phenotypic variation (chromosomes 2 and 10, respectively). The Manes.08G079900.1 gene within the QTL region on chromosome 8 is known to express a wall- associated kinase like receptor. Shi et al. (2016) cloned Snn1 which is a member of the wall- associated kinase class receptors and found that they drive pathways for biotrophic pathogen resistance in wheat. No annotated genes within the region significant for resistance to CBSDRS on chromosome 3 are in cassava reference genome. This article is protected by copyright. All rights reserved. In conclusion, discovery of new loci and associated markers will facilitate early selection during the season so breeders can have adequate information early enough to make decisions. Although CBSDS resistance genes in some Ugandan accessions are thought to have originated from Tanzania during germplasm exchange, the MCBSDS genes discovered in this study are not localized on chromosomes 5, 11, and 18 as reported for Ugandan germplasm (Kayondo et al., 2018). From this study we observed that using fewer but closely related individuals particularly from same clusters improves the discovery of QTL compared to the use of large number of individuals. These results are similar to the GWAS findings of Bradbury et al. (2011) who reported that accounting for individual relatedness in barley improved detection of true QTL. Conclusions Genomic prediction and selection have been touted as tools that could greatly modernize plant breeding and accelerate genetic gain. In this study, we examined the power of diverse breeding lines assembled from two breeding programs, at different breeding stages to predict traits and discover QTL. TARI population differentiation could have resulted from existing restrictions on clonal movement between programs. This restriction was imposed to contain the spread of cassava foliar diseases between Lake and Coastal Zone (Legg and Thresh, 2000). Although, there is long tradition of germplasm sharing between Tanzania and Uganda, the introgression occurring in Uganda TP is large and restricted to chromosome 1 while those in Tanzania are spread across the entire 18 chromosomes. This could suggest that the Uganda introgression could be more recent than in Tanzania. This may explain the population differentiation between sub populations with the introgression versus ones without. There was no relationship between increase in plot size and decrease in error variance across all traits. For some traits, larger plots were preferable while for other traits smaller plots were. However, this conclusion needs more proven evidence to verify the results. If this finding is confirmed, then breeders need to reconsider whether utilizing larger plots is cost effective. This article is protected by copyright. All rights reserved. An inverse relationship between heritability estimates and trait prediction accuracies were observed for some traits, contrary to other plant studies (Combs and Bernardo, 2013; Lian et al., 2014). We are not certain whether this inverse relationship is due to noise in the data or is true and therefore further assessment is needed to determine this relationship. Prediction accuracies within and between locations (KIB and UKG) were generally low compared to other cassava populations. Although larger training population sizes have been associated with improved accuracies, in this study, adding clones from KIB to UKG and vice versa did not improve prediction accuracy of either population. Similarly, adding the UG clones to either KIB or UKG did not improve accuracies of either. The lack of relatedness between germplasm, population structure, and the impact of genotype-by-environment interaction negatively impacted accuracy estimates. Generally, across breeding cycle GS is more difficult in plants than in animals. This is because every year, plant breeders largely use new parents, some with unknow background from other breeding programs or competitors while animal breeders work in closed populations. This could make it challenging for the breeders to keep materials adequately related across breeding cycle GS. The successful use of GS is dependent on the close relationship between individuals in training and test set. Clones in similar breeding stages were more valuable than mixing clones from different breeding stages when constructing training population. This because clones in the same generation are likely to share ancestor few generations back and therefore marker-QTL linkage are preserved due to limited number of recombination events (Habier et al., 2013). It is also possible that closely related population share more polymorphic loci and share large fraction of genetic background causing sufficient genetic variation (Lorenz and Cohen, 2012; Mohammadi et al., 2015). Consistent with the findings of other researchers, we conclude that clones from within-breeding cycles are presently better option as candidates for GS training population (Song et al., 2017; Michel et al., 2016; Cericola et al., 2017; Riedelsheimer et al., 2013; Albrecht et al., 2014; Lehermeier et al., 2014). Also, it may not be useful to constitute training populations from programs with divergent populations or populations separated by barriers like the existing restriction on clone movement This article is protected by copyright. All rights reserved. between the two Tanzania programs. Across-breeding cycle GS especially those in more advanced breeding stages needs to be investigated further because prediction accuracies are very low. The impact of GxE modeling on unreplicated clones across environments need to be investigated further. Moreover, clones in the early breeding stage provided more reliable trait prediction accuracies because of their inherent genetic variation. Therefore, these clones are better candidates for training population construction. We identified accessions carrying MCMDS, MCBSDS, and CBSDRS resistance. Some of the loci identified in these accessions have been previously reported. However, other loci are new. These results will be valuable for cassava breeding against CMD, CBSDS, and CBSDRS. Although we have learned valuable lesson from this study, we still need to continue to improve our experimental designs, data capture, and constitution of training populations so that genomic prediction and accuracies will be more reliable. We echo the lessons from Wolfe et al. (2017) to continue to improve on data quality and selection of individuals to make training populations so that we can maximize genetic gain. Conflict of Interest The authors declare that there is no conflict of interest Acknowledgments This work was supported by the Next Generation Cassava Breeding Project through funds from the Bill and Melinda. Gates Foundation and the Department for International Development of the United Kingdom. We thank the technical field staff of the TARI and NaCRRI cassava breeding programs, who helped in the trial management and/or data collection (Nsajigwa Mwakyusa, Karoline Leonard Sichalwe, Joseph Orone, Charles Majara, This article is protected by copyright. All rights reserved. Gerald Adiga, and Vincent Kyaligonza). Special thanks to Guillaume Bauchet for initial curation of sequence data and the entire Next Generation Cassava Breeding team (www.nextgencassava.org) as a whole Data Availability The phenotypic and genotypic data generated and analyzed during this study are available in the CASSAVABASE repository, (https://www.cassavabase.org/) References Akdemir, D., and O.U. Godfrey. 2015. EMMREML: fitting mixed models with known covariance structures. R package version 3.1. Adeniji, O.T., Odo, P.E. and B. Ibrahim. 2011. Genetic relationships and selection indices for cassava root yield in Adamawa State, Nigeria. African Journal of Agricultural Research, 6(13):2931-2934. Bates, D., M. Maechler, B. Bolker, S. Walker, R.H.B. Christensen, H. Singmann, B. Dai, and G. Grothendieck. 2015. Package „lme4‟. Convergence, 12(1). Bhateria, S., Sood, S.P. and A. Pathania, 2006. Genetic analysis of quantitative traits across environments in linseed (Linum usitatissimum L.). Euphytica, 150(1-2):185-194. Bradbury, P., T. Parker, M.T. Hamblin, and J. L. Jannink. 2011. Assessment of power and false discovery rate in genome-wide association studies using the BarleyCAP germplasm. Crop Sci. 51(1):52–59. Bredeson, J., V. Lyons, J. B. Prochnik, S. E. Wu, G. A., Ha, C. M. Edsinger-Gonzales, E. and D. S. Rokhsar. 2016. Sequencing wild and cultivated cassava and related species reveals extensive interspecific hybridization and genetic diversity. Nature Biotechnology 34(5):562–570. Browning, B.L., and S.R. Browning. 2009. A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. The American Journal of Human Genetics, 84(2):210-223. Ceballos, H., P. Kulakow, and C. Hershey. 2012. Cassava breeding: current status, bottlenecks and the potential of biotechnology tools. Tropical plant biology, 5(1):73-87. Cericola, F., A. Jahoor, J. Orabi, J. R. Andersen, L. L. Janss, and J. Jensen. 2017. Optimizing training population size and genotyping strategy for genomic prediction using association study results and pedigree information. A case of study in advanced wheat breeding lines. PloS one, 12(1)0169606. Chipeta, M. M., J. M. Bokosi, V.W. Saka, and I. R. Benesi. 2013. Combining ability and mode of gene action in cassava for resistance to cassava green mite and cassava mealy bug in Malawi. J Plant Breed Crop Sci, 5(9):195-202. This article is protected by copyright. All rights reserved. Combs, E. and R. Bernardo. 2013. Accuracy of genomewide selection for different traits with constant population size, heritability, and number of markers. The Plant Genome, 6(1). Crossa. J., G. de los Campos, P. Pérez, D. Gianola, J. Burgueño, J. L. Araus, D. Makumbi, R. P. Singh, S. Dreisigacker, J. Yan, V. Arief, M. Banzinger, and H. J. Braun. 2010. Prediction of genetic values of quantitative traits in plant breeding using pedigree and molecular markers. Genetics 186:713–724. Dawson, J. C., J. B. Endelman, N. Heslot, J. Crossa, J. Poland, S. Dreisigacker, Y. Manès, M. E. Sorrells, and J. L. Jannink. 2013. The use of unbalanced historical data for genomic selection in an international wheat breeding program." Field Crops Research, 154:12-22. de los Campos, G., J.M. Hickey, R. Pong-Wong, H. D. Daetwyler, and M. P. Calus. 2013. Whole-genome regression and prediction methods applied to plant and animal breeding. Genetics, 193(2):327-345. Elshire, R.J., J.C. Glaubitz, Q. Sun, J.A Poland, K. Kawamoto, E.S. Buckler, and S.E. Mitchell. 2011. A robust, simple genotyping-by-sequencing (GBS) approach for high diversity species. PloS one, 6(5):19379. Endelman, J.B. 2011. Ridge regression and other kernels for genomic selection with R package rrBLUP. The Plant Genome 4(3):250-255. Fauquet, C., D. Fargette, and C. Munihor. 1990. African cassava mosaic virus: Etiology, epidemiology, and control. Plant Disease 74:404–411. Glaubitz, J.C., T.M. Casstevens, F. Lu, J. Harriman, R.J. Elshire, Q. Sun, and E.S. Buckler. 2014. TASSEL-GBS: a high capacity genotyping by sequencing analysis pipeline. PloS one, 9(2):e90346. Habier, D., R. L. Fernando, and D. J. Garrick. 2013. Genomic BLUP decoded: a look into the black box of genomic prediction. Genetics, 194(3):597-607. Hahn, S., E. Terry, and K. Leuschner. 1980. Breeding cassava for resistance to cassava mosaic disease. Euphytica 29:673–683. Hahn, S., E. Terry, K. Leuschner, I. Akobundu, C. Okali, and R. Lal. 1979. Cassava improvement in Africa. Field Crops Res. 2:193–226. Hayes, B.J., P. J. Bowman, A. J. Chamberlain, and M. E. Goddard. 2009. Invited review: Genomic selection in dairy cattle: Progress and challenges. Journal of dairy science, 92(2):433-443. Heffner, E.L., M.E. Sorrells, and J.L. Jannink. 2009. Genomic selection for crop improvement. Crop Science, 49(1):1-12. Hillocks, R.J. and J.M. Thresh. 2000. Cassava mosaic and cassava brown streak virus diseases in Africa: a comparative guide to symptoms and aetiologies. Roots, 7(1):1-8. Hofheinz, N., D. Borchardt, K. Weissleder, and M. Frisch. 2012. Genome-based prediction of test cross performance in two subsequent breeding cycles. Theoretical and Applied Genetics, 125(8):1639-1645. Howeler, R., N. Lutaladio, and G. Thomas. 2013. Save and grow: cassava. A guide to sustainable production intensification. FAO. Jannink, J.L., A.J. Lorenz, and H. Iwata. 2010. Genomic selection in plant breeding: from theory to practice. Briefings in functional genomics, 9(2):166-177. Jeremiah, S.C., I.L. Ndyetabula, G.S. Mkamilo, S. Haji, M.M. Muhanna, C. Chuwa, S. Kasele, H. Bouwmeester, J. N. Ijumba, J. P. Legg. 2015. The dynamics and This article is protected by copyright. All rights reserved. environmental influence on interactions between cassava brown streak disease and the whitefly, Bemisia tabaci. Phytopathology, 105(5):646-655. Kassambara, A. 2018. ggpubr:„ggplot2‟Based Publication Ready Plots. R package version 0.2. Kawano, K., W.M.G. Fukuda, and U. Cenpukdee. 1987. Genetic and Environmental Effects on Dry Matter Content of Cassava Root 1. Crop Science, 27(1):69-74. Kawano, K., P. Daza, A. Amaya, M. Rios and W. M. Goncalves. 1978. Evaluation of Cassava Germplasm for Productivity 1. Crop Science, 18(3):377-380. Kayondo, S.I., D.P. Del Carpio, R. Lozano, A. Ozimati, M. Wolfe, Y. Baguma, V. Gracen, S. Offei, M. Ferguson, R. Kawuki, and J.L. Jannink. 2018. Genome-wide association mapping and genomic prediction for CBSD resistance in Manihot esculenta. Sci. Rep. 8(1): 1–11. Kundy, A.C., G.S. Mkamilo, and R.N. Misangu. 2014. Correlation and Path Analysis between Yield and Yield Components in Cassava (Manihot esculenta Crantz) in Southern Tanzania. Journal of Natural Sciences Research, 4:6-10. Legg, J.P. and T, J.M Thresh. 2000. Cassava mosaic virus disease in East Africa: a dynamic disease in a changing environment. Virus research, 71(1-2):135-149. Li, W., X. Li, J. Chao, Z. Zhang, W. Wang, and Y. Guo. 2018. NAC family transcription factors in tobacco and their potential role in regulating leaf senescence. Frontiers in plant science, 9, p.1900. Lian, L., A. Jacobson, S. Zhong, and R. Bernardo. 2014. Genomewide prediction accuracy within 969 maize biparental populations. Crop Science, 54(4):1514-1522. Lian, L., A. Jacobson, S. Zhong and R. Bernardo. 2015. Prediction of genetic variance in biparental maize populations: Genomewide marker effects versus mean genetic variance in prior populations. Crop Science, 55(3):1181-1188. Lorenzana, R. E., and R. Bernardo. 2009. Accuracy of genotypic value predictions for marker-based selection in biparental plant populations. Theoretical and applied genetics 120(1):151-161. Lorenz, A.J. and K.P. Smith, 2015. Adding genetically distant individuals to training populations reduces genomic prediction accuracy in barley. Crop science, 55(6):2657- 2667. Lorenz, K. and B. A. Cohen. 2012. Small-and large-effect quantitative trait locus interactions underlie variation in yeast sporulation efficiency. Genetics, 192(3):1123-1132. Ly, D., M. Hamblin, I. Rabbi, G. Melaku, M. Bakare, H. G. Gauch, R. Okechukwu, A. G. Dixon, P. Kulakow, and J. L. Jannink. 2013. Relatedness and genotype× environment interaction affect prediction accuracies in genomic selection: a study in cassava. Crop Science, 53(4):1312-1325. Michel, S., C. Kummer, M. Gallee, J. Hellinger, C. Ametz, B. Akgol, D. Epure, H. Gungor, F. Loschenberger, H. Buerstmayr. 2018. Improving the baking quality of bread wheat by genomic selection in early generations. Theor. Appl. Genet. (131): 477–493. Milligan, S.B., M. Balzarini, K. A. Gravois, and K. P. Bischoff. 2007. Early stage sugarcane selection using different plot sizes. Crop science, 47(5):1859-1864. This article is protected by copyright. All rights reserved. Mohammadi M., T. Tiede, and K. P. Smith.2015. "PopVar: A genome-wide procedure for predicting genetic variance and correlated response in biparental breeding populations." Crop Science 55(5):2068-2077.Ozimati, A., R. Kawuki, W. Esuma, S. I. Kayondo, A. Pariyo, M. Wolfe, and J. L. Jannink. 2019. Genetic Variation and Trait Correlations in an East African Cassava Breeding Population for Genomic Selection. Crop Science. Park, Y.J., H.J., Lee, K.J., Kwak, K., Lee, S.W. Hong, and H. Kang2014. MicroRNA400- guided cleavage of pentatricopeptide repeat protein mRNAs renders Arabidopsis thaliana more susceptible to pathogenic bacteria and fungi. Plant and Cell Physiology, 55(9):1660- 1668. Raya, M.D., S. C. Jeremiah, and J. Legg. 1993. African Cassava Mosaic Virus (ACMV) and Brown Streak survey in Tanzania. Robertsen, C.D., R.L. Hjortshøj, and L. L. Janss, 2019. Genomic Selection in Cereal Breeding. Agronomy, 9(2):95. Rutkoski, J., R. P. Singh, J. Huerta-Espino, S. Bhavani, J. Poland, J. L. Jannink, J.L. and M. E. Sorrells. 2015. Efficient use of historical data for genomic selection: a case study of stem rust resistance in wheat. The Plant Genome,8(1). Rwegasira, G. M. and C. M. Rey, 2012. Response of selected cassava varieties to the incidence and severity of cassava brown streak disease in Tanzania. J Agric Sci, 4(7):237- 245. Salvador, E.M., V. Steenkamp, and C. M. E. McCrindle, C.M., E. 2014. Production, Consumption and Nutritional Value of Cassava (Manihot esculenta, Crantz) in Mozambique: An Overview, 6(3):29-38. Shi, G., Z., Zhang, T.L. Friesen, D. Raats, T. Fahima, R. S. Brueggeman, S. Lu, H. N. Trick, Z. Liu, W. Chao, and Z. Frenkel. 2016. The hijacking of a receptor kinase–driven pathway by a wheat fungal pathogen leads to disease. Science advances, 2(10):1600822. Song, J., B.F. Carver, C. Powers, L. Yan, J. Klápště, Y. A. El-Kassaby, and C. Chen. 2017. Practical application of genomic selection in a doubled-haploid winter wheat breeding program. Molecular breeding, 37(10):117. Song, Wen-Yuan, Guo-Liang Wang, Li-Li Chen, Han-Suk Kim, Li-Ya Pi, Tom Holsten, J. Gardner et al. 1995. A receptor kinase-like protein encoded by the rice disease resistance gene, Xa21." Science, 270(5243):1804-1806. Storlie E. and G. Charmet. 2013. Genomic selection accuracy using historical data generated in a wheat breeding program. The Plant Genome. VanRaden, P.M. 2008. Efficient methods to compute genomic predictions. Journal of dairy science, 91(11):4414-4423. Vazquez, A.I., D.M. Bates, G.J.M. Rosa, D. Gianola and K.A Weigel. 2010. An R package for fitting generalized linear mixed models in animal breeding 1. Journal of animal science, 88(2):497-504. Wei. T. 2013. Corrplot: Visualization of correlation matrix. R package version 0.2. Wickham, H., W. Chang, and M.H. Wickham, 2016. Package „ggplot2‟. Create Elegant Data Visualisations Using the Grammar of Graphics. Version, 2(1):1-189. Wolfe, M.D., D.P. Del Carpio, O. Alabi, L.C. Ezenwaka, U.N. Ikeogu, I.S. Kayondo, R. Lozano, U.G. Okeke, A.A. Ozimati, E. Williams, C. Egesi, R.S. Kawuki, P. Kulakow, This article is protected by copyright. All rights reserved. I.Y. Rabbi, and J.-L. Jannink. 2017. Prospects for Genomic Selection in Cassava Breeding. Plant Genome. Wolfe, M.D., I.Y. Rabbi, C. Egesi, M. Hamblin, R. Kawuki, P. Kulakow, R. Lozano, D.P. Del Carpio, P. Ramu, and J. L. Jannink.2016. Genome-Wide Association and Prediction Reveals Genetic Architecture of Cassava Mosaic Disease Resistance and Prospects for Rapid Genetic Improvement. Plant Genome 9. Yang, J. S. Hong Lee, M. E. Goddard, and P. M. Visscher. 2011. GCTA: a tool for genome- wide complex trait analysis. The American Journal of Human Genetics 88, (1):76-82. Zhong, S., J.C. Dekkers, R. L. Fernando, and J.L. Jannink. 2009. Factors affecting accuracy from genomic selection in populations derived from multiple inbred lines: a barley case study. Genetics, 182(1):355-364. This article is protected by copyright. All rights reserved. Figure Legends Figure 1. PCA plot of the TARI clones alone (A) and the combined TARI and UG clones (B) based on SNP marker matrix and the loadings values (eigenvector coefficients) for each marker on PC1 against marker positions on all cassava chromosomes for TARI alone (C) and combined TARI and UG (D). For TARI PCA, 121K SNPs were used and for combined PCA, we used 37K SNPs common to both TARI and UG. For both cases, marker loadings were based on common markers (37K) This article is protected by copyright. All rights reserved. Figure 2. Phenotypic (left) and genetic (right) correlations between nine traits in Cassava using combined TARI data. Blue and red colors represent positive and negative correlations respectively. The strength of trait relationships is depicted by the intensity of the color. Cells with correlation values that are not significant at P-value < 0.05 were left blank. This article is protected by copyright. All rights reserved. Figure 3. The Manhattan (A) and quantile–quantile (B) plots from mixed linear models summarizing genome-wide association results for three significant traits in all subsets of combined TARI population. The QQ plot demonstrate the differences between various population structure controls. CET=clonal evaluation trial, PYT=preliminary yield trial, AYT=advanced yield trial, K3 & K4 =cluster 3 and 4 generated using kinship relationship matrix. The horizontal solid line indicates the genome-wide Bonferroni significance threshold (−log10 (P-value) = 7.27), the dashed line indicates FDRpt1 and Dotted line indicate=FDRpt2. Additional detail for significant markers, Pvalue, and pFDR values are detailed in supplementary File 1 (Sheet 1). Additional plot for all the other traits below threshold are in Supplementary Figure 3. This article is protected by copyright. All rights reserved. Table 1. The trial structure for the materials used for training population. Reps are the number of replicates in each trial. The number of individual clones in each of the three replicates after sprouting are represented by Rep1, Rep2, and Rep3. Trial Stage Location No. Reps Clones/Rep Plot Area (M2) Rep1 Rep2 Rep3 Planted Harvested CET1 CET Ukiriguru 2 108 91 5 5 CET2 CET Ukiriguru 2 213 214 5 5 PYT1 PYT Ukiriguru 2 96 81 10 10 PYT1 PYT Kibaha 2 64 65 14 14 PYT2 PYT Kibaha 2 74 77 14 14 PYT3 PYT Kibaha 2 45 48 14 14 PYT4 PYT Kibaha 2 21 23 14 14 PYT5 PYT Kibaha 2 24 23 14 14 PYT6 PYT Kibaha 2 15 15 14 14 PYT7 PYT Kibaha 2 20 19 14 14 PYT8 PYT Kibaha 2 55 51 14 14 PYT9 PYT Kibaha 2 58 63 14 14 AYT1 PYT Kibaha 2 26 25 42 20 AYT2 PYT Kibaha 3 12 12 12 42 20 AYT3 PYT Kibaha 3 16 13 15 42 20 AYT4 PYT Kibaha 2 23 23 42 20 AYT5 PYT Kibaha 2 25 20 42 20 AYT6 PYT Kibaha 2 21 21 42 20 Table 2. Heritability, genetic and residual variance components for four plot sizes This article is protected by copyright. All rights reserved. Plot Size Variance Componen t MCMD S MCBSD S CBSDR S MCGM S RTNO SHTW T HI FYLD DM 5m2 (CETUKG ) σ2G 0.042 0.103 0.097 0.009 2195 0 12.71 0.00 5 11.01 7.11 σ2TR 0.000 0.002 0.000 0.184 9516 9.10 0.00 5 0.46 1.69 σ2e 0.049 0.121 0.215 0.142 7192 4 28.66 0.02 0 25.12 15.20 H2 0.46 0.46 0.31 0.06 0.23 0.31 0.20 0.30 0.32 10m2 (PYTUKG ) σ2G 0.050 0.114 0.008 0.037 2205 9 0.215 0.00 3 1.80 3.72 σ2TR 0.000 0.000 0.005 0.000 1571 8 4.155 0.00 3 0.06 2.67 σ2e 0.005 0.097 0.129 0.131 7714 2 10.605 0.01 7 5.80 16.84 H2 0.91 0.54 0.06 0.22 0.22 0.02 0.15 0.24 0.18 14m2 (PYTKIB) σ2G 0.349 0.077 0.042 0.029 1376 3 14.89 0.00 4 50.09 5.23 σ2TR 0.026 0.018 0.03 0.013 9203 28.08 0.00 2 24.69 1.74 σ2e 0.432 0.100 0.21 0.118 4391 1 105.48 0.01 8 129.2 1 12.28 H2 0.43 0.39 0.14 0.18 0.24 0.10 0.17 0.25 0.27 20m2 (AYTKIB) σ2G 0.28 0.083 0.031 0.051 4760 36.00 0.00 4 59.84 2.0885 σ2TR 0.04 0.020 0.001 0.036 6688 70.08 0.00 3 66.79 0.9732 σ2e 0.30 0.122 0.121 0.117 5530 3 197.90 0.00 1 211.4 5 10.252 6 H2 0.44 0.38 0.20 0.25 0.08 0.12 0.80 0.18 0.16 σ2G =Genetic variance among clones, σ 2 TR=Variance among reps in trials, σ 2 e=Residual error variance, H2 = plot-based broad-sense heritability estimates. CETUKG=Clonal Evaluation Trial at Ukiriguru, This article is protected by copyright. All rights reserved. PYTUKG=Preliminary Yield Trial at Ukiriguru, PYTKIB=Preliminary Yield Trial at Kibaha; and AYTKIB=Advanced Yield Trial at Kibaha. Table 3. Summary of cross-validated prediction accuracies by trait and breeding stages Trait CETUKG PYTUKG PYTKIB AYTKIB Mean MCMDS 0.11ns 0.06ns 0.17ns 0.32ns 0.14 MCBSDS 0.26* 0.33ns 0.08ns 0.24ns 0.23 CBSDRS 0.26ns 0.10ns 0.09ns 0.10ns 0.14 MCGMS 0.22* 0.09ns 0.21ns 0.43* 0.24 RTNO 0.28* 0.27ns 0.16ns 0.03ns 0.19 SHTWT 0.22ns 0.03ns 0.12ns 0.08ns 0.11 HI 0.22* 0.38* 0.13ns 0.19n 0.23 FYLD 0.36* 0.35* 0.10ns 0.09ns 0.23 DM 0.14ns 0.18ns 0.28* 0.43* 0.26 Mean 0.23 0.20 0.15 0.20 0.19 ns = non-significant prediction accuracies (r), * accuracy significantly different from zero (P ≤ 0.05). MCMDS=mean cassava mosaic disease severity, MCBSDS=mean cassava brown streak disease severity, CBSDRS=mean Cassava brown streak disease root severity, MCGMS=Mean Cassava green mite severity, RTNO=root number, SHTWT=shoot weight, HI=harvest index, FYLD=Fresh root yield, DM=Dry matter. Publisher: AGRONOMY; Journal: AGROJNL:Agronomy Journal; Copyright: Will notify... Volume: Will notify...; Issue: Will notify...; Manuscript: aj-2017-08-0123-a; DOI: ; PII: TOC Head: ; Section Head: ; Article Type: ARTICLE This article has been accepted for publication and undergone full peer review but has not been through the copyediting, typesetting, pagination and proofreading process, which may lead to differences between this version and the Version of Record. Please cite this article as doi: 10.1002/csc2.20003. This article is protected by copyright. All rights reserved. Page 40 of 42 Table 4. Average prediction trait accuracies for cross-location and combined TARI Trait TARI UKG to UKG KIB to UKG KIB to KIB UKG to KIB MCMDS 0.15* 0.09 0.02ns 0.18* 0.06ns MCBSDS 0.23* 0.25 0.11* 0.08ns 0.09ns CBSDRS 0.09* 0.21 0.16* 0.05ns 0.10* MCGMS 0.25* 0.22 0.23* 0.23* 0.25* RTNO 0.16* 0.24 0.08ns 0.09ns 0.11* SHTWT 0.11ns 0.22 0.19ns 0.09ns 0.01ns HI 0.16ns 0.18 0.15* 0.11* 0.12* FYLD 0.18* 0.29 0.13* 0.08ns 0.05ns DM 0.23* 0.13 0.16* 0.28* 0.24* Mean 0.17 0.20 0.14 0.13 0.10 ns = non-significant prediction accuracies, * accuracy significantly different from zero (P ≤ 0.05). The values in the bracket are the accuracies for within the location prediction. Table 5. Summary of cross-validated prediction accuracies by trait and cross programs Scenario MCMD S MCBSD S CBSDR S RTN O SHTW T HI FYLD DM Mea n KIB_to_KIB 0.19ns 0.16ns 0.14ns 0.16n s 0.15ns 0.12n s 0.09n s 0.29* 0.16 UKG_to_UKG 0.01ns 0.28* 0.23* 0.26* 0.20* 0.17n s 0.30* 0.13n s 0.20 KIB+UKG_to_KIB 0.17ns 0.14ns 0.14ns 0.16n s 0.14ns 0.14n s 0.09n s 0.30* 0.16 This article is protected by copyright. All rights reserved. UKG+KIB_to_UKG 0.01ns 0.28* 0.18ns 0.27* 0.14ns 0.19* 0.08n s 0.15n s 0.16 KIB+UG_to_KIB 0.20* 0.07ns 0.06ns 0.14n s 0.08ns 0.15n s 0.08n s 0.27* 0.13 UKG+UG_to_UKG 0.02ns 0.30* 0.15ns 0.24* 0.14ns 0.19* 0.26* 0.15n s 0.18 KIB+UKG+UG_to_KI B 0.17ns 0.11ns 0.06ns 0.18n s 0.10ns 0.15n s 0.08n s 0.28* 0.14 UKG+ KIB+UG_to_UKG -0.01ns 0.31* 0.17* 0.26* 0.12ns 0.20* 0.11n s 0.17n s 0.17 Mean 0.09 0.21 0.14 0.21 0.13 0.16 0.14 0.22 0.16 Table 6. Significant markers associated with MCMDS, MCBSDS, and CBSDRS resistance detected in TARI TP QTL Trait C hr . Reg ion (Mb s) Tag- SNP Positi on All ele Fr eq b Log10P value PVE (%) Popula tion QTL- cbsdrs2-1 CBSDRS 2 9.15 - 9.16 S2_9258 334 92583 34 A/ C 0.1 5 0. 11 6.843 8 UKG QTL- cbsdrs3-1 CBSDRS 3 10.1 6 S3_1016 5675 10165 675 A/ G 0.0 5 0. 20 6.137 3 AYTKI B QTL- cbsdrs8-1 CBSDRS 8 1.74 S8_1736 3721 17363 721 A/ G 0.0 5 0. 21 6.35 32 AYTKI B QTL- cbsdrs10-1 CBSDRS 10 0.16 S10_160 775 16077 5 G/ A 0.0 8 0. 16 5.24 11 CETU KG QTL-cbsd9- 1 MCBSDS 9 10.7 1 S9_1070 7044 10707 044 G/ A 0.5 0. 57 6.457 9 PYTKI B QTL- cbsd11-1 MCBSDS 11 22.8 8- 22.9 4 S11_229 42418 22942 418 A/ G 0.4 5 0. 08 6.01 5 UKG QTL- cmds16-1 MCMDS 16 0.42 S16_421 670 42167 0 T/ C 0.4 9 1. 86 9.844 87 PYTU KG QTL- cmds12-1 MCMDS 12 6.3- 8.7 S12_727 0219 72702 19 T/ C 0.4 9 0. 24 11.246 14 - 37 TARI QTL- cbsd4/cmd- MCBSDS| 4 24.6 S4_2467 24670 T/ 0.2 0. 5.441 14 K4_Clu This article is protected by copyright. All rights reserved. 1 MCMDS 7 0203 203 G 4 14 ster1 QTL- cbsd12/cmd -1 MBSDS|M CMDS 12 0.93 S12_929 320 92932 0 T/ G 0.0 6 0. 13 5.941 37|8 PYTKI B QTL- cbsd12/cmd -2 MBSDS|M CMDS 12 7.93 - 7.95 S12_792 9439 79294 39 G/ C 0.3 7 0. 23 8.749 8|33 Several