Academic Editor: Fermin Morales

Received: 10 December 2024

Revised: 10 January 2025

Accepted: 15 January 2025

Published: 21 January 2025

Citation: Montesinos-López, O.A.;

Kismiantini; Alemu, A.;

Montesinos-López, A.;

Montesinos-López, J.C.; Crossa, J.

Balancing Sensitivity and Specificity

Enhances Top and Bottom Ranking in

Genomic Prediction of Cultivars.

Plants 2025, 14, 308. https://doi.org/

10.3390/plants14030308

Copyright: © 2025 by the authors.

Licensee MDPI, Basel, Switzerland.

This article is an open access article

distributed under the terms and

conditions of the Creative Commons

Attribution (CC BY) license

(https://creativecommons.org/

licenses/by/4.0/).

Article

Balancing Sensitivity and Specificity Enhances Top and Bottom
Ranking in Genomic Prediction of Cultivars
Osval A. Montesinos-López 1,* , Kismiantini 2 , Admas Alemu 3 , Abelardo Montesinos-López 4,*,
José Cricelio Montesinos-López 5 and Jose Crossa 6,7,*

1 Facultad de Telemática, Universidad de Colima, Colima 28040, Colima, Mexico
2 Statistics Study Program, Universitas Negeri Yogyakarta, Yogyakarta 55281, Yogyakarta, Indonesia;

kismi@uny.ac.id
3 Department of Plant Breeding, Swedish University of Agricultural Sciences, P.O. Box 101,

SE-230 53 Alnarp, Sweden; admas.alemu.abebe@slu.se
4 Centro Universitario de Ciencias Exactas e Ingenierías (CUCEI), Universidad de Guadalajara,

Guadalajara 44430, Jalisco, Mexico
5 Department of Public Health Sciences, University of California Davis, Davis, CA 95616, USA
6 Colegio de Postgraduados, Montecillos 56230, Edo. de México, Mexico
7 International Maize and Wheat Improvement Center (CIMMYT), Carretera Mexico-Veracruz, Km 45,

San Pedro Tlaltizapan 52640, Edo. de México, Mexico
* Correspondence: osval78t@gmail.com (O.A.M.-L.); abelardo.montesinos@academicos.udg.mx (A.M.-L.);

jcrossa@cgiar.org (J.C.)

Abstract: Genomic selection (GS) is a predictive methodology that is revolutionizing
plant and animal breeding. However, the practical application of the GS methodology is
challenging since a successful implementation requires a good identification of the best
lines. For this reason, some approaches have been proposed to be able to select the top
(or bottom) lines with more Precision. Despite the varying popularity of methods, with
some being notably more efficient than others, this paper delves into the fundamentals
of these techniques. We used five models/methods: (1) RC, known as the Bayesian Best
Linear Unbiased Predictor (GBLUP); (2) R, which is like RC but uses a threshold; (3) RO,
Regression Optimum, that leverages the RC model in its training process to fine-tune the
threshold; (4) B, Threshold Bayesian Probit Binary model (TGBLUP) with a threshold of
0.5 to classify the cultivars as top or non-top; (5) BO is the TGBLUP but the threshold
used is an optimal probability threshold that guarantees similar Sensitivity and Specificity.
We also present a benchmark comparison of existing approaches for selecting the top (or
bottom) performers, utilizing five real datasets for comprehensive analysis. For methods
that necessitate a rigorous tuning process, we suggest a streamlined tuning approach that
significantly decreases implementation time without notably compromising performance.
Our analysis revealed that the regression optimal (RO) method outperformed other models
across the five real datasets, achieving superior results in terms of the F1 score. Specifically,
RO was more effective than models R, B, RC, and BO by 60.87, 42.37, 17.63, and 9.62%,
respectively. When looking at the Kappa coefficient, the RO model was better than models
B, BO, R, and RC by 37.46, 36.21, 52.18, and 3.95%, respectively. In terms of Sensitivity, the
RO model outperformed models B, R, and RC by 145.74, 250.41, and 86.20, respectively.
The second-best model was the model BO. It is important to point out that in the first stage,
the BO and RO approaches train a classification and regression model, respectively, to
classify the lines as the top (bottom) or not the top (not the bottom). However, both the BO
and RO approaches optimize a threshold in the second stage to perform the classification
of the lines that minimize the difference between the Sensitivity and Specificity. The
BO and RO methods are superior for the selection of the top (or bottom) lines. For this
reason, we encourage breeders to adopt these approaches to increase genetic gain in plant
breeding programs.

Plants 2025, 14, 308 https://doi.org/10.3390/plants14030308

https://doi.org/10.3390/plants14030308
https://doi.org/10.3390/plants14030308
https://creativecommons.org/licenses/by/4.0/
https://creativecommons.org/licenses/by/4.0/
https://www.mdpi.com/journal/plants
https://www.mdpi.com
https://orcid.org/0000-0002-3973-6547
https://orcid.org/0000-0002-6105-7647
https://orcid.org/0000-0001-7056-2699
https://orcid.org/0000-0001-5017-6192
https://doi.org/10.3390/plants14030308
https://www.mdpi.com/article/10.3390/plants14030308?type=check_update&version=1


Plants 2025, 14, 308 2 of 30

Keywords: plant breeding; genomics selection; top line selection; bottom line selection;
prediction performance

1. Introduction
The need to produce more food in less arable land arises from several key factors.

With a rapidly growing global population, it is essential to increase food production to
meet the rising demand. However, the availability of arable land is limited, and various
challenges such as urbanization, soil degradation, and climate change are facing it. By
maximizing food production on limited land, we can ensure environmental sustainability
by minimizing deforestation and reducing the need for chemical inputs. Additionally,
producing more food on less land contributes to food security, particularly in regions
vulnerable to hunger and malnutrition. It also improves economic efficiency in agriculture,
leading to enhanced profitability, rural development, and overall economic growth [1].

Breeding methodologies like genomic selection (GS) play a crucial role in improving
productivity worldwide due to several key factors. It enables precision breeding by iden-
tifying specific genes and markers associated with desirable traits, allowing for targeted
selection of high-yielding varieties or breeds. By predicting genetic potential early on,
GS accelerates breeding cycles and expedites the development of productive individuals.
The accuracy of trait predictions is enhanced through the utilization of genomic data,
enabling informed decisions for selecting individuals with higher productivity traits [2].
GS also facilitates the development of cultivars or breeds that are adaptable to diverse
environments, ensuring productivity across different regions [3]. Moreover, by optimizing
breeding efforts, GS promotes sustainable resource management, minimizing waste and
environmental impact while meeting global food demands.

Genomic selection leverages advances in genomics and data analysis as it allows
breeders to make more informed decisions in selecting individuals with desired traits, lead-
ing to faster and more precise breeding outcomes. By utilizing large-scale genomic data, GS
accurately predicts the genetic potential of plants at an early stage, significantly reducing
the time and resources required for traditional breeding methods. This acceleration of
breeding cycles enables the development of improved varieties with enhanced traits such
as yield, disease resistance, and nutritional quality. Furthermore, GS expands the scope
of breeding by identifying and incorporating favorable traits from diverse genetic back-
grounds, promoting genetic diversity and adaptability. Overall, GS is transforming plant
breeding by optimizing efficiency, precision, and genetic gains, ultimately contributing to
the development of sustainable and high-performing crop varieties [3,4].

However, it is important to recognize that the presence of genes or markers alone
does not guarantee a high-yielding variety that is well adapted to stress conditions. This
limitation arises because not all identified genes are necessarily functional in a given context,
particularly under abiotic or biotic stress conditions. To address this, transcriptomics and
associative transcriptomics play a crucial role in identifying functional genes linked to
specific traits, such as stress tolerance. These approaches enable the assessment of gene
expression and functionality, thereby offering deeper insights into which genes are actively
contributing to a trait. Incorporating transcriptomic data into genomic prediction models
can significantly enhance the precision of selecting stress-tolerant, high-yielding varieties,
providing a complementary layer of information beyond marker-based selection [5,6].

Other challenges associated with implementing GS in plant breeding programs are
when dealing with large and complex genomes, such as those of wheat and maize. Large
plant genomes often contain a substantial amount of repetitive DNA, which can complicate


Plants 2025, 14, 308 3 of 30

marker identification and analysis [7]. Additionally, the lack of comprehensive knowledge
about the genetic basis of many important agronomic traits further complicates the applica-
tion of GS, particularly for polygenic traits that are influenced by numerous loci with small
effects [8].

Furthermore, genes may behave differently depending on the genetic context in which
they are expressed, leading to variable performance across environments and genetic
populations [9]. The lack of robust, high-throughput phenotyping platforms capable of
capturing complex traits across environments remains a significant bottleneck in breeding
programs [10]. Addressing these limitations requires further integration of functional
genomics, high-throughput phenotyping, and better models that account for environmental
and genetic interactions. In summary, these points enrich the narrative with supporting
citations while addressing the challenges of genomic selection, such as large genomes,
insufficient phenotyping, and gene expression variability.

However, for a successful implementation of GS in breeding programs, high accuracy
is essential [11]. Accurate predictions enable breeders to make reliable decisions, maximiz-
ing the genetic potential of offspring and leading to improved traits and higher productivity.
It helps avoid selecting individuals that are not the best, saving valuable resources and
time. High accuracy allows breeders to focus on specific traits of interest, meeting market
demands and environmental challenges. Trust and confidence in the approach are built
through accurate predictions, facilitating wider adoption of GS. Ultimately, accurate pre-
dictions drive genetic improvement in crops and livestock, contributing to the success of
breeding programs.

To enhance the efficiency of the GS methodology, a wide array of statistical machine
learning models has been explored, encompassing both parametric approaches such as
mixed models, Bayesian models, and penalized regression, as well as nonparametric mod-
els such as random forest, gradient boosting machine, and deep learning [12]. Nevertheless,
it is essential to recognize the constraints imposed by the No Free Lunch Theorem in
statistical machine learning, which asserts the absence of a universal algorithm excelling
in all conceivable tasks. Consequently, any enhancement in performance for one task
necessitates a trade-off, potentially resulting in reduced performance in another domain.
This underscores the absence of one-size-fits-all solutions in the pursuit of optimal per-
formance across diverse problem domains. As a result, even though certain statistical
machine learning models have exhibited commendable performance within the genomic
prediction context, practical implementation often encounters challenges due to insufficient
prediction accuracies. This limitation can be attributed to the multifaceted nature of the GS
methodology, a predictive approach influenced by numerous factors.

In the realm of genomic selection (GS), where the primary objective is to identify
and select the most promising genetic lines with high accuracy, it becomes imperative
to incorporate robust metrics into the selection process. Sensitivity and Specificity serve
as pivotal measures in this regard. Sensitivity, traditionally applied in diagnostic testing
contexts, signifies the capacity of a test to accurately identify individuals possessing a
particular characteristic. Sensitivity delineates the effectiveness of a test in correctly pin-
pointing individuals who exhibit the desired characteristic [13]. A heightened Sensitivity
translates to a diminished rate of false negatives, ensuring that fewer instances of the
desired characteristic are overlooked or misjudged by the selection process. In the context
of plant breeding, where the aim is to pinpoint the most advantageous genetic lines, it is
paramount for the models to exhibit a commendable level of Sensitivity. This ensures that
the selection process accurately identifies and advances the most promising genetic lines,
thereby optimizing breeding outcomes.


Plants 2025, 14, 308 4 of 30

Conversely, Specificity characterizes the capability of a diagnostic test to accurately
exclude individuals lacking a specific characteristic. It quantifies the precision with which
the test identifies individuals devoid of the characteristic under consideration [13]. Within
the realm of plant breeding, it is equally crucial to uphold a reasonable level of Specificity.
This ensures that the selection process does not erroneously favor or advance genetic
lines that do not possess the desired characteristics. By maintaining a balanced approach
that encompasses both Sensitivity and Specificity, the selection process can effectively
discriminate between superior genetic lines and those that do not meet the desired criteria,
thereby enhancing the efficiency and efficacy of genomic selection in plant breeding endeav-
ors. However, since the selection of the best lines in the context of GS is performed with
predictions resulting from regression models, selecting those lines with larger predicted
phenotypic or breeding values when the trait of interest is grain yield or other traits where
the larger score in the trait is better.

On the other hand, lower predicted values can often be desired because they signify a
higher probability of desirable traits (disease resistance) in the selected lines. Therefore,
in genomic prediction for traits such as disease resistance or pest tolerance, where lower
values indicate superior performance, selecting lines with lower predicted values can
lead to the development of improved crop varieties with enhanced resilience to biotic
stresses. Under both scenarios of selecting predicted lines since the lines with larger (or
lower) predictions, the resulting selection guarantees considerably high Specificity and
considerably low Sensitivity. However, many plant breeders may not be fully aware of
this issue, as Sensitivity and Specificity metrics are not commonly integrated explicitly into
their selection processes for identifying the best genetic lines. However, it is important to
recognize that Sensitivity and Specificity are both critical measures of the accuracy and
reliability of selected lines in plant breeding programs.

For this reason, a balanced good level of Sensitivity and Specificity is desired in plant
breeding since high Sensitivity ensures that valuable cultivars with desirable characteristics
are not overlooked, thus maximizing the potential for identifying superior genetic material.
Conversely, high Specificity ensures that resources are efficiently allocated by effectively
excluding cultivars lacking the desired characteristics. For this reason, in this paper, we
explore some existing methods for selecting the best lines and we evaluate their Sensitivity
and Specificity. In this research, we aim to provide detailed explanations of each existing
selection method, introduce Sensitivity and Specificity as metrics for comparing the chosen
selection candidates, and conduct benchmarking to enhance empirical evidence of their
performance. Our benchmarking analysis encompasses five real datasets, comprising four
related to maize and one to soybean, each with multiple traits and environments.

2. Materials and Methods
This study included Maize Data 2–5 (Maize_1. Maize_2, Maize_3, Maize_4) and

Soybean 9 (Soybean_4) from the data employed by Montesinos-Lopez et al. [14] (see
Table A1 from Montesinos-Lopez et al. [14]).

The current analysis with different models is based on the various datasets described
in Table 1. Four datasets are from maize trials while the remaining are from soybean. The
population size of datasets ranged from 1864 to 999 genotypes while the number of applied
quality SNP markers ranged from 1803 to 4085. The number of environments for the four
maize datasets was 11 while the remaining soybean dataset had 8 environments. Four traits
were included on the maize datasets while the soybean dataset comprised six traits.


Plants 2025, 14, 308 5 of 30

Datasets

Table 1. Description of the datasets used for performing the benchmarking analysis. Data comes
from Alencar Xavier. et al. (2021) [15], recently cited by Montesinos-Lopez et al. 2024 [14].

Data Number Dataset Genotypes Markers Environments Traits Reference

Data 1 Maize_1 1000 4085 11 4 Maize Data 2 from [14,15]

Data 2 Maize_2 1000 4085 11 4 Maize Data 3 from [14,15]

Data 3 Maize_3 1000 4085 11 4 Maize Data 4 from [14,15]

Data 4 Maize_4 999 4085 11 4 Maize Data 5 from [14,15]

Data 5 Soybean_4 1864 1803 8 6 Soybean Data 9 from [14,15]

All statistical models provided in the next section were implemented for each environ-
ment of each dataset and the results were reported for each dataset across the environment
to summarize the results. For this reason, the statistical models given in the next section of
the predictor do not take the effect of environments into account.

2.1. Statistical Models

The 5 statistical models studied in this research can be grouped into 2 main classes:

1. GBLUP (RC), GBLUP with a threshold (R), and GBLUP with an optimal fine-tuned
threshold (RO). As described below, models R and RO are the basic RC models but
with certain refinements on how the thresholds are defined;

2. TGBLUP (threshold GBLUP) with a threshold = 0.5 that classifies candidates as top
and non-top (B) and a TBLUP with an optimal probability threshold is denoted as
BO model. The BO model is also trained with the TGBLUP but in place of using
a threshold of 0.5 to classify the lines as top and not top, an optimal probability
threshold that guarantees similar Sensitivity and Specificity is used.

Thus, our study includes models RC, R, and RO, as well as B and BO that are de-
scribed below.

2.1.1. Model RC

Model RC, known as the Bayesian Best Linear Unbiased Predictor (GBLUP) model, is
structured as a regression framework. The model is defined as follows:

Yi = µ + gi + ϵi (1)

Here, Yi represents the continuous response variable observed in the ith instance. Yi are
BLUEs resulting after accounting for environments and experimental factors (blocks, reps,
etc). µ stands for the general mean or intercept. gj, i = 1, . . . , J, signifies the random effect
associated with the ith genotype. Additionally, ϵi denotes the random error component for
the ith genotype, distributed as an independent normal random variable with a mean of 0
and a variance of σ2. It is assumed that g =

(
g1, . . . , gJ

)T ∼ NJ

(
0, σ2

gG
)

, where G is a linear
kernel referred to as the genomic relationship matrix, calculated using the method outlined
in [16]. This model has been implemented in the R statistical software [17], utilizing the
BGLR library [18].

For each fold of the cross-validation, we train the model (1) using the whole training
information, and the predictions are performed for the whole testing set and then the
predicted values are classified as top lines or not top lines. Given our interest in selecting
the top-performing lines for each trait, we introduce the threshold Yτ . This threshold is
determined by the empirical quantile τ of training response values (Y1, . . ., Yntr ). For our


Plants 2025, 14, 308 6 of 30

purposes, we have chosen τ = 0.8, but it is important to note that any other value between
0 and 1 can be employed. The classification of the lines in the testing set as top lines (1)
and not top lines (0) under this model was performed by first ordering the lines in the
testing set (of both observed and predicted) in decreasing order. Then, we identified how
many lines of the observed response were larger than the threshold Yτ , and then this same
number of lines was selected from the ordered vector of predicted lines. It is important
to point out that, from the predicted lines, we selected an equal number of lines as in the
vector of the observed response variable that had the best performance, that is, the top
lines, but in this case, they were not chosen regarding of the threshold Yτ . Next the lines
that matched between the two selected vectors were classified as top lines and those that
did not were classified as not top lines.

2.1.2. Model R

Under this model, the predictions were obtained exactly with the trained model RC
given above with Equation (1), but the process of classification was performed differently.
For this reason, model R stood for GBLUP with a threshold. Here, the threshold, Yτ , was
employed to classify the lines into two categories: top lines (denoted as 1) if Ŷi > Yτ , for
i = 1, . . . , ntst) and not top lines (denotes as 0) if Ŷi < Yτ , for i = 1, . . . , ntst. That is, under
this approach after obtaining the continuous predictions using the model RC, for any lines
with predicted values exceeding the threshold, Yτ , were classified as top lines, while those
with predicted values below the threshold were classified as not top lines [19].

2.1.3. Model RO

The acronym RO stands for Regression Optimum, and this model leverages the RC
model in its training process to fine-tune the threshold. For this reason, this model consists
of the GBLUP model with an optimal threshold. Also, here, the initial threshold was the
80% quantile of the response variable from the training set. However, this initial threshold
was adjusted to ensure a similar Sensitivity and Specificity. A schematic representation of
the procedural steps involved in the training process of Model RO is illustrated in Figure 1.
According to the previously applied approach by Montesinos-López et al. [19], these steps
are briefly elucidated as follows:

Plants 2025, 14, x FOR PEER REVIEW  7  of  32 
 

Figure 1. Schematic depiction of the model RO training process. X represents the input data, encom-

passing markers and other covariates, while Y signifies the continuous response variable. The final 

predictions are binary, with a value of 1 indicating top lines and 0 representing not top lines. 

It is noteworthy that this refined optimal rule can be expressed in terms of conven-

tional threshold values (𝑌ఛ). This equivalence arises from the classification of a line as top 
൫𝑌ప෡ ൐ 𝑌௢൯  being analogous to categorizing it if  𝑌෠௜

∗ ൐ 𝑌ఛ where  𝑌෠௜
∗ ൌ 𝑌ప෡ሺ𝑌ఛ/𝑌௢ሻ. These mod-

ified predicted values, or adjusted predicted values, are designed to ensure a congruent 

balance between Sensitivity and Specificity. 

It is essential to highlight that, for this RO method, we have introduced a more com-

putationally efficient version, which we refer to as the “Simple (S) RO method”. This ap-

proach involves training only one time in place of k times model (1) using the complete 

training (inner-training + validation) dataset and then using this trained model to predict 

the complete testing sets. This means that we utilize the predictions from model RC prior 

to classification and for this reason; this S RO method is computationally more efficient 

since only one  time  is  trained  the method RC. Subsequently, with  the results obtained 

from  the predicted values we divide  these predicted values  into an outer  training and 

validation set. This splitting process of the results of the S RO method is carried out for 

choosing the optimal threshold but uses the predicted values resulting in training only 

one time the full training set. Employing a 10-fold inner cross-validation, we determine 

the optimal threshold for classifying the lines in the testing dataset. The schematic repre-

sentation of this Simple RO method is akin to Figure 1, with the noteworthy difference 

being that we only train model (1) once. This eliminates the need to repeatedly train the 

model  (1)  for  the number of  folds specified  in  the  inner cross-validation, resulting  in a 

significantly more computationally efficient implementation of the Simple RO method. In 

all models under study, we focus on the selection process of selecting the top-performing 

lines, assuming that these are the best lines. For this reason, Sensitivity will be associated 

with how these top lines were selected while Specificity regards how the non-top lines are 

not selected. 

   
Figure 1. Schematic depiction of the model RO training process. X represents the input data,
encompassing markers and other covariates, while Y signifies the continuous response variable. The
final predictions are binary, with a value of 1 indicating top lines and 0 representing not top lines.

Step 1: Initiate by partitioning the data into distinct subsets, comprising the inner-
training, validation, and test sets;


Plants 2025, 14, 308 7 of 30

Step 2: Proceed to train Model RC using the inner-training set while utilizing the
original response variable;

Step 3: Utilize the trained Model RC (from Step 2) on the validation set to compute
predicted continuous values, ŶVal,l for i = 1, . . . , nval . Subsequently, employ these predicted
values to look for the optimal threshold Yo;

Step 4: This optimal threshold (Yo) is identified such that it guarantees that minimizes
the average squared difference between Sensitivity and Specificity;

Step 5: Next, with the complete training dataset (comprising both the inner-training
and validation sets), retrain Model RC. Use this refitted model to compute predicted values
for the testing set, resulting in ŶTest,l for i = 1, . . . , nTest;

Step 6: Subsequently, employing the optimal threshold (Yo) computed in Step 4 and
the predicted values from the testing set in Step 5, classify the lines. If ŶTest,l > Yo, categorize
the line as a top line (1); otherwise, classify it as a not top line (0).

Figure 1 visually illustrates the incorporation of model R within the training process
of model RO.

It is noteworthy that this refined optimal rule can be expressed in terms of conventional
threshold values (Yτ). This equivalence arises from the classification of a line as top
(Ŷi > Yo) being analogous to categorizing it if Ŷ∗

i > Yτ where Ŷ∗
i = Ŷi(Yτ/Yo). These

modified predicted values, or adjusted predicted values, are designed to ensure a congruent
balance between Sensitivity and Specificity.

It is essential to highlight that, for this RO method, we have introduced a more
computationally efficient version, which we refer to as the “Simple (S) RO method”. This
approach involves training only one time in place of k times model (1) using the complete
training (inner-training + validation) dataset and then using this trained model to predict
the complete testing sets. This means that we utilize the predictions from model RC prior
to classification and for this reason; this S RO method is computationally more efficient
since only one time is trained the method RC. Subsequently, with the results obtained from
the predicted values we divide these predicted values into an outer training and validation
set. This splitting process of the results of the S RO method is carried out for choosing the
optimal threshold but uses the predicted values resulting in training only one time the
full training set. Employing a 10-fold inner cross-validation, we determine the optimal
threshold for classifying the lines in the testing dataset. The schematic representation of
this Simple RO method is akin to Figure 1, with the noteworthy difference being that we
only train model (1) once. This eliminates the need to repeatedly train the model (1) for the
number of folds specified in the inner cross-validation, resulting in a significantly more
computationally efficient implementation of the Simple RO method. In all models under
study, we focus on the selection process of selecting the top-performing lines, assuming
that these are the best lines. For this reason, Sensitivity will be associated with how these
top lines were selected while Specificity regards how the non-top lines are not selected.

2.1.4. Model B

Model B, known as the Threshold Bayesian Probit Binary model (TGBLUP), operates
on the premise that given gi (covariates of dimension J), Ybi is a random variable taking
binary values, 0 and 1, with the following probabilities:

P(Ybi = 1|gi) = Φ(β0 + gi) = P(li > 0) (2)

where β0 represents the intercept parameter, gi signifies the random effect associated
with the ith genotype, distributed as per the definition in model (1). Φ is the cumulative
distribution function of the standard normal distribution. Furthermore, li = β0 + gi+ϵi

represents the latent continuous normal process that underlies the observed categories (top


Plants 2025, 14, 308 8 of 30

lines and not top lines), where ϵi is a normal random variable for errors with a mean of 0 and
a variance of 1. These li values are referred to as “liabilities” [20,21]. The binary categorical
phenotypes in model (2) are derived from the underlying phenotypic values, li, as follows:
ybi = 0 if −∞< li < 0, otherwise ybi = 1. Since model (2) is articulated within a Bayesian
framework, it assumes a flat prior distribution for β0 ( f (β0) ∝ 1). The TGBLUP has been
implemented in the BGLR package [18] within the R statistical software [17]. Under this
model after the training process has been computed the probability of P(Ybi = 1|gi) for
each line in the testing set and, if this probability is larger than 0.5, the line is classified as
top line; otherwise, the line is classified as not top line.

2.1.5. Model BO

Model BO is also trained with the TGBLUP using the model given in Equation (2),
but in place of using a threshold of 0.5 to classify the lines as top and not top, we used
an optimal probability threshold that guarantees similar Sensitivity and Specificity. For
estimating this optimal probability threshold (hyperparameter), in addition to dividing
the data into training and testing, the training set was divided into inner-training and
validation. According to Montesinos-López et al. [19], all the steps for implementing this
method are given next:

Step 1: Commence by converting the continuous response variable into a binary
response variable, utilizing the same threshold, Yτ , as employed previously using the
quantile 80% of the training set. Specifically, when the values of the continuous traits
surpass the designated threshold, assign them a value of one (1, denoting a top line);
otherwise, assign zero (0, signifying not top lines);

Step 2: Initially, partition the data into distinct subsets, namely the inner-training,
validation, and test sets;

Step 3: Proceed to train model B, a classification model, using the inner-training set;
Step 4: Employ the trained model B (from Step 3) on the validation set to compute

the predicted probabilities, P̂Val,l for i = 1, . . . , nval . Subsequently, utilize these predicted
probability values to estimate classification accuracy metrics, facilitating the selection of
the optimal probability threshold, (τ 0);

Step 5: Identify the optimal probability threshold, (τ 0), which minimizes the average
of the squared difference between Sensitivity and Specificity;

Step 6: Next, with the complete training dataset (comprising both the inner-training
and validation sets), retrain model B and generate probability predictions for the testing
set. These results are P̂Test,l for i = 1, . . . , nTest, within the testing set;

Step 7: Subsequently, employing the optimal probability threshold (τ0) determined
in Step 5 and the predicted probabilities from the testing set in Step 6, classify the lines. If
P̂Test,l > τ0, categorize the line as a top line (1); otherwise, classify it as a not top line (0).

For further elaboration on these steps, please refer to Figure 2.
In this specific BO method, we have introduced a more computationally efficient

version, which we call the “Simple (S) BO method”. Here, is how it works: We train
model (2) using the entire training (inner-training + validation) dataset and then use this
trained model to predict the entire dataset, leveraging the predictions from model B before
classification. This S BO method reduces computational resources since in place of training
k-times the model with the training data is trained only one time using the whole training
set because the classification model to compute the probabilities is trained only one time in
Figure 2. Next, we take the results from these predicted values for each fold and split them
into an outer training and validation set. Through a 10-fold inner cross-validation process,
we determine the best threshold for classifying the lines in the testing dataset. The visual
representation of the Simple BO method is similar to Figure 2, but the key difference is that


Plants 2025, 14, 308 9 of 30

we only train model (2) once. This eliminates the need to repeatedly train the model (2) for
the specified number of folds in the inner cross-validation, resulting in a significantly more
efficient implementation of the Simple BO method.

Plants 2025, 14, x FOR PEER REVIEW  9  of  32 
 

Figure 2. Schematic representation of the training process under model BO. X represents the input 

data, encompassing markers and additional covariates. Meanwhile, Y symbolizes the continuous 

response variable, and Yb signifies  the binary response variable generated  through  the  transfor-

mation of Y  into a binary  format.  In  the ultimate  classification,  top  lines are designated as “1”, 

whereas not top lines are denoted as “0”. 

In this specific BO method, we have introduced a more computationally efficient ver-

sion, which we call the “Simple (S) BO method”. Here, is how it works: We train model 

(2) using the entire training (inner-training + validation) dataset and then use this trained 

model to predict the entire dataset, leveraging the predictions from model B before clas-

sification. This S BO method reduces computational resources since in place of training k-

times the model with the training data is trained only one time using the whole training 

set because the classification model to compute the probabilities is trained only one time 

in Figure 2. Next, we take the results from these predicted values for each fold and split 

them into an outer training and validation set. Through a 10-fold inner cross-validation 

process, we determine the best threshold for classifying the lines  in the testing dataset. 

The visual representation of the Simple BO method is similar to Figure 2, but the key dif-

ference is that we only train model (2) once. This eliminates the need to repeatedly train 

the model (2) for the specified number of folds in the inner cross-validation, resulting in a 

significantly more efficient implementation of the Simple BO method. 

Ultimately, as all seven evaluated models (RC, R, RO, Simple RO, B, BO, and Simple 

BO) generate predictions in the form of binary outcomes (0 for not top lines and 1 for top 

lines), classification metrics have been computed to evaluate prediction accuracy for the 

testing sets. 

2.2. Evaluation of Prediction Performance 

In our study, we conducted a rigorous evaluation process for benchmarking the pro-

posed models using a nested cross-validation approach. This approach involved two lev-

els of cross-validation: outer-fold cross-validation and inner-fold cross-validation, as out-

lined in Montesinos López et al. [12] The outer-fold cross-validation aimed to assess the 

prediction accuracy of our models on unseen data. We utilized a 5-fold cross-validation 

strategy, where the dataset was randomly split into five subsets or “folds”. The model was 

trained on four of these folds while the remaining one was reserved for testing. This pro-

cess was repeated until each fold had served as the test set once. It is important to note 

that  the  test  sets  were  exclusively  used  for  evaluation  purposes  and  were  never 

Figure 2. Schematic representation of the training process under model BO. X represents the input
data, encompassing markers and additional covariates. Meanwhile, Y symbolizes the continuous
response variable, and Yb signifies the binary response variable generated through the transformation
of Y into a binary format. In the ultimate classification, top lines are designated as “1”, whereas not
top lines are denoted as “0”.

Ultimately, as all seven evaluated models (RC, R, RO, Simple RO, B, BO, and Simple
BO) generate predictions in the form of binary outcomes (0 for not top lines and 1 for top
lines), classification metrics have been computed to evaluate prediction accuracy for the
testing sets.

2.2. Evaluation of Prediction Performance

In our study, we conducted a rigorous evaluation process for benchmarking the
proposed models using a nested cross-validation approach. This approach involved two
levels of cross-validation: outer-fold cross-validation and inner-fold cross-validation, as
outlined in Montesinos López et al. [12] The outer-fold cross-validation aimed to assess the
prediction accuracy of our models on unseen data. We utilized a 5-fold cross-validation
strategy, where the dataset was randomly split into five subsets or “folds”. The model was
trained on four of these folds while the remaining one was reserved for testing. This process
was repeated until each fold had served as the test set once. It is important to note that
the test sets were exclusively used for evaluation purposes and were never incorporated
into the model training process. The average performance across these 5 testing sets was
reported, employing four distinct metrics, as elaborated in the subsequent sections.

Additionally, we computed prediction performance metrics based on the average results
obtained from the 5-fold cross-validation. These metrics included the Kappa coefficient,
Sensitivity, Specificity, and F1 score. For models B, R, and RC, no hyperparameter tuning was
necessary, but for models BO and RO, we fine-tuned a critical hyperparameter: the probability
threshold for model BO and a threshold for the RO model. This was carried out to ensure a
balance between Sensitivity and Specificity. To achieve this balance, we conducted inner-fold
cross-validation using ten-folds. The goal was to optimize the threshold values, which were
selected as the average threshold value across the ten-folds of the inner cross-validation. These
optimized thresholds were subsequently employed in the classification of lines into top and
non-top categories within each testing set, as illustrated in Figures 1 and 2.

Subsequently, we computed various metrics based on the predictions generated by
three distinct models (R, RC, RO, B, and BO) for each testing dataset. These metrics


Plants 2025, 14, 308 10 of 30

are elucidated as follows, Kappa Coefficient (κ): The Kappa coefficient is a statistical
measure used to assess the degree of agreement among raters, accounting for chance.
It is defined as: κ = P0−Pe

1−Pe
, Where P0 represents the agreement between the predicted

and observed values and is computed as (TP + TN)/N, where TN denotes the num-
ber of true negatives, TP signifies the number of true positives, FN represents the num-
ber of false negatives, FP indicates the number of false positives, and N is defined as
N = TP + TN + FP + FN. Pe denotes the probability of agreement and is calculated as
Pe = (TP + FN)/N × (TP + FP)/N + (FP + TN)/N × (FN + TN)/N. Sensitivity: Sensitivity
is the probability of obtaining a positive test result when the true condition is indeed
positive. It is expressed as: Sensitivity = TP/(TP + FN). Specificity: Specificity represents
the probability of obtaining a negative test result when the true condition is negative. It
is formulated as: Specificity = TN/(TN + FP). Precision: Precision measures the ratio of
correctly predicted positive observations to the total predicted positive observations and is
defined as: Precision = TP/(TP + FP). A higher Precision value corresponds to a lower false
positive rate and, importantly, signifies superior prediction accuracy. These metrics serve
as critical indicators for evaluating the performance of our models, providing valuable
insights into their predictive capabilities and the accuracy of their assessments.

The F1 Score, as a composite metric, provides a balanced evaluation by considering
both Sensitivity and Precision. This holistic approach acknowledges the impact of false
negatives and false positives in the assessment, making it particularly valuable when
dealing with datasets that exhibit imbalanced class distribution. Accuracy, on the other
hand, is most effective when the costs associated with false positives and false negatives
are comparable. If there is a significant disparity in the cost implications of these errors, it
is advisable to examine both Precision and Sensitivity [13].

To facilitate interpretation, we compared model RO vs. R, RC, B, and BO and model
BO vs. R, RC, B, and RO. We computed the relative efficiencies (RE) in terms of the Kappa
score, denoted as REKappa. This calculation is expressed as follows:

REKappa =
Kappay

Kappaz

Here, Kappay and Kappaz represent the Kappa coefficients of one of the five models
(RO, R, RC, B, and BO). Similarly, concerning Sensitivity, the relative efficiency (RE) denoted
as RESensitivity, was computed as:

RESensitivity =
Sensitivityy

Sensitivityz

Again, this calculation involved any of the five models. The same approach was
employed to calculate the relative efficiency for the F1 score and Specificity. Under all four
metrics (Kappa, Sensitivity, Specificity, F1), if REx > 1, where x represents Kappa, Sensitivity,
Specificity, or F1, indicates that the method y yielded superior prediction performance.
Conversely, when REx < 1, the preferable method is z. In cases where REx = 1, both
methods exhibit equal efficiency in their predictive capabilities. This systematic evaluation
approach aids in identifying the most effective method for the given context.

3. Results
The results of the genomic prediction models are extensive and include detailed

predictions across a range of environments and traits. While the statistical analysis is
rigorous, it is important to acknowledge that without a clear understanding of the original
data sources or specific traits, the predictions may seem complex or difficult to interpret


Plants 2025, 14, 308 11 of 30

from a practical breeding perspective. However, to address this concern, we have validated
the model predictions against observed phenotypic data to ensure their relevance.

By comparing the genomic estimated breeding values (GEBVs) with real-world field
performance, we can confirm that the predictions correspond well to actual breeding
outcomes. This validation reinforces the utility of the models, moving beyond a theoretical
exercise and demonstrating their applicability in guiding the selection of top-ranking
cultivars for breeding programs. Thus, while the statistical framework is detailed, the
practical value of these predictions is rooted in their strong correspondence to reality,
making them a valuable tool for enhancing selection efficiency in plant breeding.

The results are given in five sections. Sections 1–4 present the prediction performance
of datasets: Maize_1, Maize_2, Maize_3 and Soybean, while Section 5 provides a summary
of the prediction performance across all datasets. Moreover, the results for the dataset
Maize_4 are provided in Appendix B.

3.1. Maize_1 Data

Figure 3 and Table A1 (Appendix A) present the results for the Maize_1 dataset
from the evaluation of two methods for selecting the best candidate lines in a genetic
improvement program, the simplified method (S), and the original method (O) proposed
by Montesinos-López et al. [12]. The results in this table are presented for five models:
Binomial (B), Optimized Binomial (BO), conventional regression with a threshold (R),
threshold-free conventional regression (RC), and optimized regression (RO), under four
classification metrics: F1 score, Kappa coefficient, Sensitivity, and Specificity. The reported
values are the mean across traits and environments. For each of these metrics, the higher
the mean value and the lower the standard error (SE), the better the results obtained by
each model and, consequently, by each method.

Plants 2025, 14, x FOR PEER REVIEW  12  of  32 
 

Figure 3. Prediction performance for dataset Maize_1 using original (O) and simplified (S) methods. 

The results are presented for models B, BO, R, RC, and RO in terms of the metrics: F1 score, Kappa 

coefficient, Sensitivity, and Specificity. 

3.1.1. F1 Score 

Under the F1 Score metric, concerning the “O” method, the best-performing model 

in terms of the average F1 score value was the RO model, with an average value of 0.5126. 

The second-best model was BO with an average value of 0.4813, representing a significant 

improvement of 32.30% compared to the B model, which had an average value of 0.3638 

(See Figure 3 and Table A1). However, BO was outperformed by RO by 6.50%. The worst 

performance was observed in the R model, with an average value of 0.2895, representing 

a decrease of 39.84% compared to BO and 43.52% compared to RO. These results can be 

observed graphically in Figure 3. 

When comparing the “O” and “S” methods using the BO and RO models, it is ob-

served that, in both cases, the “O” method slightly outperforms the “S” method in terms 

of the F1 Score. For BO, the average F1 Score value in the “O” method (0.4813) is 1.56% 

higher  than  in  the “S” method (0.4739). For RO, the average F1 Score value  in the “O” 

method (0.5126) is 1.39% higher than in the “S” method (0.5056). These results reflect that 

the “O” method has a slight advantage over the “S” method in terms of accuracy in se-

lecting the top lines; however, the “S” method is considerably less computationally de-

manding but sacrifices some accuracy. 

3.1.2. Kappa Coefficient 

The results under the Kappa metric, in the context of the “O” method, are detailed in 

Table A1 and Figure 3. The RO model stands out as the best, with an average Kappa value 

of 0.3004. It is followed by the RC model with an average value of 0.2914, and then BO, B, 

and R in that order. RO outperforms BO (average value of 0.2462) by 22%. In contrast, the 

R model shows the worst performance, with an average Kappa value of 0.1734, marking 

a significant decrease of 29.58% compared to BO and 42.28% compared to RO (see Figure 

3 and Table 1). 

On the other hand, when comparing the “O” and “S” methods in terms of the average 

results of BO and RO, it can be observed in Table A1 and Figure 3 that the “O” method 

outperformed the “S” method. For O, the average Kappa value of BO (0.2462) is 20.26% 

Figure 3. Prediction performance for dataset Maize_1 using original (O) and simplified (S) methods.
The results are presented for models B, BO, R, RC, and RO in terms of the metrics: F1 score, Kappa
coefficient, Sensitivity, and Specificity.

3.1.1. F1 Score

Under the F1 Score metric, concerning the “O” method, the best-performing model in
terms of the average F1 score value was the RO model, with an average value of 0.5126.
The second-best model was BO with an average value of 0.4813, representing a significant
improvement of 32.30% compared to the B model, which had an average value of 0.3638


Plants 2025, 14, 308 12 of 30

(See Figure 3 and Table A1). However, BO was outperformed by RO by 6.50%. The worst
performance was observed in the R model, with an average value of 0.2895, representing
a decrease of 39.84% compared to BO and 43.52% compared to RO. These results can be
observed graphically in Figure 3.

When comparing the “O” and “S” methods using the BO and RO models, it is observed
that, in both cases, the “O” method slightly outperforms the “S” method in terms of the F1
Score. For BO, the average F1 Score value in the “O” method (0.4813) is 1.56% higher than
in the “S” method (0.4739). For RO, the average F1 Score value in the “O” method (0.5126)
is 1.39% higher than in the “S” method (0.5056). These results reflect that the “O” method
has a slight advantage over the “S” method in terms of accuracy in selecting the top lines;
however, the “S” method is considerably less computationally demanding but sacrifices
some accuracy.

3.1.2. Kappa Coefficient

The results under the Kappa metric, in the context of the “O” method, are detailed in
Table A1 and Figure 3. The RO model stands out as the best, with an average Kappa value
of 0.3004. It is followed by the RC model with an average value of 0.2914, and then BO, B,
and R in that order. RO outperforms BO (average value of 0.2462) by 22%. In contrast, the
R model shows the worst performance, with an average Kappa value of 0.1734, marking a
significant decrease of 29.58% compared to BO and 42.28% compared to RO (see Figure 3
and Table 1).

On the other hand, when comparing the “O” and “S” methods in terms of the average
results of BO and RO, it can be observed in Table A1 and Figure 3 that the “O” method
outperformed the “S” method. For O, the average Kappa value of BO (0.2462) is 20.26% higher
than the value achieved by the “S” method (BO = 0.2048). In the case of RO, the average Kappa
value in the “O” method is 0.3004, which is 10.66% higher than that of the “S” method (0.2715).
However, it is essential to remember that the “S” method offers a considerable advantage in
computational efficiency in its operation compared to the “O” method.

3.1.3. Sensitivity

In the Maize_1 dataset, five models (B, BO, R, RC, and RO) were evaluated under the
“O” and “S” methods in terms of the Sensitivity metric, and the results are shown in Table 1.
This metric is essential for evaluating a model’s ability to correctly identify “top lines when
they are actually top”, thus complementing the information provided by the F1 and Kappa
metrics that have also been previously analyzed in the same context. Figure 3 provides a
graphical representation of the results of this evaluation.

In the “O” method, the RO model stands out by obtaining the highest average value
of Sensitivity, with an impressive 0.6936. It is closely followed by the BO model with an
average value of 0.6784, representing a significant increase of 127.54% compared to the B
model, which has an average value of 0.2982. In contrast, the R model shows the lowest
performance with an average Sensitivity of 0.1586, demonstrating a challenge in its ability
to identify “top lines”.

Comparing the “O” and “S” methods directly using the BO and RO models, it is
observed that, in both cases, the “S” method strongly outperforms the “O” method in terms
of Sensitivity. For BO, the average Sensitivity value in the “S” method (0.8091) is 19.3%
higher than in the “O” method (0.6784). Likewise, for RO, the average Sensitivity value in
the “S” method (0.7673) is 10.62% higher than in the “O” method (0.6936).

3.1.4. Specificity

In the “O” method, the R model stands out by achieving the highest average value of
Specificity (0.9786), indicating a strong ability to identify the “non-top” lines. It is closely


Plants 2025, 14, 308 13 of 30

followed by the B and RC models with average values of 0.9036 and 0.9017, respectively.
On the other hand, the RO and BO models show the lowest performances, with an average
Specificity of 0.6797 and 0.6340, respectively. This indicates, for example, that BO has a
35.22% lower ability to effectively identify the non-top lines compared to the R model.

When comparing the “O” and “S” methods directly using the BO and RO models,
it is observed that, in both cases, the “O” method surpasses the “S” method in terms of
a higher Specificity value. For the BO model, the average Specificity value in the “O”
method (0.6339) is 28.4% higher than in the “S” method (0.4939). For RO, the average
Specificity value in the “O” method (0.6797) is 13.0% higher than in the “S” method (0.6016).
Consequently, and bearing in mind that a very high value of Specificity is not necessarily
the best, as it affects Sensitivity, the S method exhibits better behavior.

In conclusion, this exhaustive study of the “O” and “S” methods applied to the Maize_1
dataset, evaluating crucial metrics such as F1, Kappa, Sensitivity, and Specificity, reveals
clear patterns. The BO and RO models consistently demonstrate superior performance in
accurately selecting “outstanding lines”.

Although the “O” method generally exhibits better performance, the “S” method
stands out for its computational efficiency, especially evident in improving Sensitivity in
BO and RO models. Furthermore, it is emphasized that, for genetic improvement, high
values in Sensitivity, Kappa, and F1 are more desirable than in Specificity, aligning with the
objective of effectively identifying the top lines. These results reflect the necessary balance
that must exist between accuracy and efficiency in selecting “outstanding lines”.

3.2. Maize_2

Figure 4 (Table A2, Appendix A) shows the comparison of the performance achieved
by the “O” and “S” methods using the results obtained by the BO and RO models. It is
observed that in terms of the F1 Score, the “O” methods outperforms the “S” methods
in both models. For example, for the BO model, the average F1 Score value in the “O”
method (0.5086) is 7.81% higher than in the “S” method (0.4717). In the case of the RO
model, the average F1 Score value in the “O” method (0.5562) is 6.46% higher than in the
“S” method (0.5224). However, it is noted that the difference between the models is not
very high. Additionally, it should be considered that the S model has a simpler operation,
requiring fewer computational requirements than the O model.

Plants 2025, 14, x FOR PEER REVIEW  14  of  32 
 

3.2. Maize_2 

Figure 4 (Table A2, Appendix A) shows the comparison of the performance achieved 

by the “O” and “S” methods using the results obtained by the BO and RO models. It is 

observed that in terms of the F1 Score, the “O” methods outperforms the “S” methods in 

both models.  For  example,  for  the BO model,  the  average F1  Score  value  in  the  “O” 

method  (0.5086)  is 7.81% higher than  in the “S” method  (0.4717). In  the case of  the RO 

model, the average F1 Score value in the “O” method (0.5562) is 6.46% higher than in the 

“S” method (0.5224). However, it is noted that the difference between the models is not 

very high. Additionally, it should be considered that the S model has a simpler operation, 

requiring fewer computational requirements than the O model. 

 
Figure 4. Prediction performance for dataset Maize_2 using original (O) and simplified (S) methods. 

The results are presented for models B, BO, R, RC, and RO in terms of the metrics: F1 score, Kappa 

coefficient, Sensitivity, and Specificity. 

3.2.1. F1 Score 

Under the F1 Score metric, the RO model excels in the “O” method with an average 

value of 0.5562. It is closely followed by the BO model with a value of 0.5086, representing 

a significant increase of 23.84% compared to the B model, which had an average value of 

0.4107. However, BO was surpassed by RO by 9.36%. In third place is the RC model with 

an F1 value of 0.4713, surpassing B by 14.76%. The R model exhibited the poorest perfor-

mance with an average value of 0.3739, marking a significant decrease of 32.78% com-

pared to RO and 26.49% compared to BO. 

3.2.2. Kappa Coefficient 

Under the Kappa metric, in the “O” method, the RO model stands out with an aver-

age value of 0.3716. It is followed by the RC model with 0.3576, and then BO (0.2979), B 

(0.2885), and R (0.2828) in that order, as shown in Figure 4 and Table A2. According to 

these  results, RO surpasses BO by 24.74%. Regarding  the R model, which exhibits  the 

poorest performance, it marks a significant decrease of 23.89% compared to RO. 

When directly comparing the “O” and “S” methods using the BO and RO models, it 

is  observed  that,  in  terms  of Kappa,  the  “O” method  evidently  outperforms  the  “S” 

method. For the BO case, the average Kappa value in the “O” method (0.2979) is 49.87% 

higher than in the “S” method (0.1988). For the RO model, the average Kappa value in the 

“O” method (0.3716) is 26.13% higher than in the “S” method (0.2946). Given these results, 

Figure 4. Prediction performance for dataset Maize_2 using original (O) and simplified (S) methods.
The results are presented for models B, BO, R, RC, and RO in terms of the metrics: F1 score, Kappa
coefficient, Sensitivity, and Specificity.


Plants 2025, 14, 308 14 of 30

3.2.1. F1 Score

Under the F1 Score metric, the RO model excels in the “O” method with an average
value of 0.5562. It is closely followed by the BO model with a value of 0.5086, representing
a significant increase of 23.84% compared to the B model, which had an average value
of 0.4107. However, BO was surpassed by RO by 9.36%. In third place is the RC model
with an F1 value of 0.4713, surpassing B by 14.76%. The R model exhibited the poorest
performance with an average value of 0.3739, marking a significant decrease of 32.78%
compared to RO and 26.49% compared to BO.

3.2.2. Kappa Coefficient

Under the Kappa metric, in the “O” method, the RO model stands out with an average
value of 0.3716. It is followed by the RC model with 0.3576, and then BO (0.2979), B
(0.2885), and R (0.2828) in that order, as shown in Figure 4 and Table A2. According to these
results, RO surpasses BO by 24.74%. Regarding the R model, which exhibits the poorest
performance, it marks a significant decrease of 23.89% compared to RO.

When directly comparing the “O” and “S” methods using the BO and RO models, it is
observed that, in terms of Kappa, the “O” method evidently outperforms the “S” method.
For the BO case, the average Kappa value in the “O” method (0.2979) is 49.87% higher
than in the “S” method (0.1988). For the RO model, the average Kappa value in the “O”
method (0.3716) is 26.13% higher than in the “S” method (0.2946). Given these results, it is
important to emphasize that before settling for a method, it is necessary to carefully review
all the evaluated metrics.

3.2.3. Sensitivity

Under this metric, the RO and BO models stand out significantly compared to the
RC, B, and R models, under the “O” method. This is because RO and BO obtain the
highest average Sensitivity values, 0.7398 and 0.7003, respectively. For example, for RO,
this represents a significant increase of 179.65% compared to the R model, which had the
lowest average value (0.2645) among all models. In the same vein, BO surpasses the R
model by 164.74%.Thus, the difference between RO and BO is only 5.31%, with BO being
slightly lower, placing both models as the best options in this method.

Finally, we compare the “O” and “S” methods using only the BO and RO models. In
Figure 4, it is observed that, in terms of Sensitivity, the “S” method strongly surpasses the
“O” method, which had not happened in the results of the F1 and Kappa metrics for the
Maize_2 dataset. For BO, the average Sensitivity value in the “S” method (0.8687) is 24.03%
higher than in the “O” method (0.7004). Similarly, for RO, the average Sensitivity value in
the “S” method (0.8461) is 14.37% higher than in the “O” method (0.7398). These results
highlight the importance of considering multiple studies before settling on a particular
method and/or model.

3.2.4. Specificity

The R model stands out with the highest average Specificity value (0.9681), indicating
a strong ability to identify the “non-top” lines, for the “O” method. It is closely followed by
the RC and B models with average values of 0.9182 and 0.9135, respectively. On the other
hand, the lowest performances are obtained by the RO and BO models, with an average
Specificity of 0.7148 and 0.6709, respectively. This indicates, for example, that BO has a
30.70% lower capacity to effectively identify the non-top lines compared to the R model.
Remembering that achieving very high values of Specificity is not desirable, and it is better
to excel in other metrics.


Plants 2025, 14, 308 15 of 30

When comparing the “O” and “S” methods directly using the BO and RO models, it is
observed that, in terms of Specificity, the “O” method outperforms the “S” method. For
BO, the average Specificity value in the “O” method (0.6709) is 50.55% higher than in the
“S” method (0.4456). For RO, the average Specificity value in the “O” method (0.7148) is
24.43% higher than in the “S” method (0.5745). Consequently, and remembering that a very
high value of Specificity is not necessarily the best, as it affects Sensitivity, the S method
exhibits better behavior.

In summary, this exhaustive study on the “O” and “S” methods applied to the Maize_2
dataset, evaluating crucial metrics such as F1, Kappa, Sensitivity, and Specificity, reveals
clear patterns. The BO and RO models consistently demonstrate superior performance
in the precise selection of outstanding lines. Although the “O” method generally exhibits
better performance, the “S” method stands out for its computational efficiency, especially
evident in improving the Sensitivity value in BO and RO models. Furthermore, it is
emphasized that, for genetic improvement, high values in Sensitivity, Kappa, and F1 are
more desirable than in Specificity, aligning with the objective of effectively identifying the
top lines. These results reflect the necessary balance that must exist between Precision and
efficiency in the selection of outstanding lines.

3.3. Maize_3

The obtained results, for the Maize:3 dataset, are presented in Figure 5 and Table A3
(Appendix A). The results are organized for each method, for each of the five models
(B, BO, R, RC, and RO), and across the four classification metrics (F1, Kappa, Sensitivity,
and Specificity).

Plants 2025, 14, x FOR PEER REVIEW  16  of  32 
 

BO, R, RC, and RO), and across the four classification metrics (F1, Kappa, Sensitivity, and 

Specificity). 

 
Figure 5. Prediction performance for dataset Maize_3 using original (O) and simplified (S) methods. 

The results are presented for models B, BO, R, RC, and RO in terms of the metrics: F1 score, Kappa 

coefficient, Sensitivity, and Specificity. 

3.3.1. F1 Score 

Under the F1 Score metric, the RO model stands out in the “O” method with an av-

erage value of 0.5214. It is closely followed by the BO model with a value of 0.4778, repre-

senting a significant increase of 38.93% compared to the B model, which had an average 

value of 0.3439. However, BO was surpassed by RO by 9.15%. In  third place  is the RC 

model with an F1 value of 0.4284, surpassing B by 24.58%. The R model exhibited  the 

poorest performance with an average value of 0.2856, marking a significant decrease of 

45.23% compared to RO, and 40.22% compared to BO. 

When directly comparing the “O” and “S” methods using the BO and RO models, it 

is observed  that  in  terms of F1 Score, both methods achieve very similar performance, 

slightly surpassed by the “O” method in both models. For BO, the average F1 Score value 

in the “O” method (0.4778) is 1.29% higher than in the “S” method (0.4716). For RO, the 

average F1 Score value in the “O” method (0.5214) is 0.5% higher than in the “S” method 

(0.5189). Besides this small difference, it should be considered that the S model has a sim-

pler operation, demanding fewer computational requirements than the O model. 

3.3.2. Kappa Coefficient 

Under the Kappa metric, in the “O” method, the RO model stands out with an aver-

age value of 0.3114. It is followed by the RC model with 0.2885, and then declining is the 

BO model (0.2413), B (0.2124), and R (0.1817), in that order. According to these results, RO 

surpasses BO by 29.05%. Regarding the R model, which exhibits the poorest performance, 

it marks a significant decrease of 41.67% compared to the RO model. 

When comparing the “O” and “S” methods directly using the models BO and RO, it 

is observed that, in terms of Kappa, the method “O” surpasses the method “S”. For the 

case of BO, the average Kappa value in the “O” method (0.2413) is 20.89% higher than in 

Figure 5. Prediction performance for dataset Maize_3 using original (O) and simplified (S) methods.
The results are presented for models B, BO, R, RC, and RO in terms of the metrics: F1 score, Kappa
coefficient, Sensitivity, and Specificity.

3.3.1. F1 Score

Under the F1 Score metric, the RO model stands out in the “O” method with an
average value of 0.5214. It is closely followed by the BO model with a value of 0.4778,
representing a significant increase of 38.93% compared to the B model, which had an
average value of 0.3439. However, BO was surpassed by RO by 9.15%. In third place is the
RC model with an F1 value of 0.4284, surpassing B by 24.58%. The R model exhibited the


Plants 2025, 14, 308 16 of 30

poorest performance with an average value of 0.2856, marking a significant decrease of
45.23% compared to RO, and 40.22% compared to BO.

When directly comparing the “O” and “S” methods using the BO and RO models,
it is observed that in terms of F1 Score, both methods achieve very similar performance,
slightly surpassed by the “O” method in both models. For BO, the average F1 Score value
in the “O” method (0.4778) is 1.29% higher than in the “S” method (0.4716). For RO, the
average F1 Score value in the “O” method (0.5214) is 0.5% higher than in the “S” method
(0.5189). Besides this small difference, it should be considered that the S model has a
simpler operation, demanding fewer computational requirements than the O model.

3.3.2. Kappa Coefficient

Under the Kappa metric, in the “O” method, the RO model stands out with an average
value of 0.3114. It is followed by the RC model with 0.2885, and then declining is the BO
model (0.2413), B (0.2124), and R (0.1817), in that order. According to these results, RO
surpasses BO by 29.05%. Regarding the R model, which exhibits the poorest performance,
it marks a significant decrease of 41.67% compared to the RO model.

When comparing the “O” and “S” methods directly using the models BO and RO, it
is observed that, in terms of Kappa, the method “O” surpasses the method “S”. For the
case of BO, the average Kappa value in the “O” method (0.2413) is 20.89% higher than in
the “S” method (0.1996). For the RO model, the average Kappa value in the “O” method
(0.3114) is 6.42% higher than in the “S” method (0.2926). Given these results, it is important
to emphasize that before settling for a method, it is necessary to carefully review all the
evaluated metrics.

3.3.3. Sensitivity

In the “O” method, the models RO and BO stand out significantly compared to the
models RC, B, and R; since RO and BO obtain the highest average Sensitivity values of
0.6985 and 0.6657, respectively. This represents, for example, for RO, a significant increase
of 302.0% compared to the R model, which had the lowest average value (0.1738) among all
the models. In the case of the BO model, it surpasses the R model by 283.10%, very similar
to RO’s performance.

When comparing the “O” and “S” methods directly using the models BO and RO, it is
observed that, in terms of Sensitivity, the method “S” strongly outperforms the method “O”,
which did not happen in the results of the F1 and Kappa metrics for the Maize_3 dataset. For
BO, the average Sensitivity value in the “S” method (0.7884) is 18.43% higher than in the “O”
method (0.6657). Likewise, for RO, the average Sensitivity value in the “S” method (0.7613)
is 8.99% higher than in the “O” method (0.6986). These results highlight the importance of
considering several studies before settling for a specific method and/or model.

3.3.4. Specificity

In the “O” method, the R model stands out with the highest average Specificity value
(0.9726), indicating a strong ability to identify the “non-top” lines. It is closely followed by
the B and RC models with average values of 0.9074 and 0.8979, respectively. On the other
hand, the RO and BO models show the lowest performances, with an average Specificity
of 0.6828 and 0.6361, respectively. This indicates, for example, that BO has a 34.6% lower
capacity to effectively identify the non-top lines compared to the R model.

When comparing the “O” and “S” methods directly using the models BO and RO, it is
observed that, in terms of Specificity, the method “O” outperforms the method “S”. For BO,
the average Specificity value in the “O” method (0.6361) is 27.47% higher than in the “S”
method (0.4990). For RO, the average Specificity value in the “O” method (0.6828) is 9.74%
higher than in the “S” method (0.6222). Consequently, and remembering that a very high


Plants 2025, 14, 308 17 of 30

value of Specificity is not necessarily the best, as it affects Sensitivity, the S method exhibits
better behavior.

In summary, this comprehensive study on the “O” and “S” methods applied to the
Maize_3 dataset, evaluating crucial metrics such as F1, Kappa, Sensitivity, and Specificity,
reveals clear patterns. The models BO and RO consistently demonstrate superior perfor-
mance in the precise selection of outstanding lines. Although the “O” method generally
exhibits better performance, the “S” method stands out for its computational efficiency,
especially evident in improving the value of Sensitivity in BO and RO models. Furthermore,
it is emphasized that, for genetic improvement, high values in Sensitivity, Kappa, and F1
are more desirable than in Specificity, aligning with the objective of effectively identifying
the top lines. These results reflect the necessary balance that must exist between Precision
and efficiency in the selection of outstanding lines.

3.4. Soybean

The results for the Soybean dataset have been presented in Figure 6 and Table A4
(Appendix A). The arrangement of results has been structured, considering each method,
the five distinct models (B, BO, R, RC, and RO), and the four crucial classification metrics
(F1, Kappa, Sensitivity, and Specificity).

Plants 2025, 14, x FOR PEER REVIEW  18  of  32 
 

Figure 6. Prediction performance for dataset Soybean using Original (O) and simplified (S) methods. 

The results are presented for models B, BO, R, RC, and RO in terms of the metrics: F1 score, Kappa 

coefficient, Sensitivity, and Specificity. 

3.4.1. F1 Score 

Under the F1 Score, the RO model emerges as the leader in the “O” method with a 

prominent average value of 0.5055. It is closely followed by the RC model, registering a 

value of 0.4335, representing a notable increase of 38.66% compared to the B model, which 

averaged 0.3126. However, RC was surpassed by RO by 16.62%. In third place is the BO 

model, with an F1 value of 0.4226, surpassing B by 35.17%. The R model exhibited the 

lowest performance with an average value of 0.3083, marking a substantial decrease of 

63.96% compared to RO and 27.03% compared to BO. 

When directly comparing the “O” and “S” approaches applying the BO and RO mod-

els, it becomes evident that in terms of the F1 Score, both methods achieve very similar 

performance. For BO, the average F1 Score value in the “S” approach (0.4505) surpasses 

by 6.61% that obtained in the “O” approach (0.4226). In contrast, for RO, the average F1 

Score value in the “O” approach (0.5055) is 3.02% higher than in the “S” approach (0.4907). 

In addition to this slight difference, it is crucial to consider that the S model is character-

ized by a simpler operation, requiring fewer computational capabilities compared to the 

O model. 

3.4.2. Kappa Coefficient 

Under the Kappa criteria, in the “O” method, the RO model stands out with an aver-

age value of 0.3025. It is closely followed by the RC model with 0.2993, followed by the R 

model (0.1778), B (0.1636), and BO (0.0967), in that order, showing a marked decline in 

performance. With these results, RO surpasses RC by 1.07%. In contrast, the BO model 

exhibits the worst performance, with a significant decrease of 68.05% compared to the RO 

model. 

When contrasting the “O” and “S” methods directly using the BO and RO models, a 

balance in terms of Kappa is evident. For BO, the average Kappa value in the “S” approach 

(0.1871)  is notably higher, with  an  increase of  93.62%  compared  to  the  “O”  approach 

(0.0966). On the other hand, for RO, the average Kappa value in the “O” approach (0.3025) 

is higher, with an increase of 17.49%, compared to the value in the “S” approach (0.2574). 

Figure 6. Prediction performance for dataset Soybean using Original (O) and simplified (S) methods.
The results are presented for models B, BO, R, RC, and RO in terms of the metrics: F1 score, Kappa
coefficient, Sensitivity, and Specificity.

3.4.1. F1 Score

Under the F1 Score, the RO model emerges as the leader in the “O” method with a
prominent average value of 0.5055. It is closely followed by the RC model, registering a
value of 0.4335, representing a notable increase of 38.66% compared to the B model, which
averaged 0.3126. However, RC was surpassed by RO by 16.62%. In third place is the BO
model, with an F1 value of 0.4226, surpassing B by 35.17%. The R model exhibited the
lowest performance with an average value of 0.3083, marking a substantial decrease of
63.96% compared to RO and 27.03% compared to BO.

When directly comparing the “O” and “S” approaches applying the BO and RO
models, it becomes evident that in terms of the F1 Score, both methods achieve very similar
performance. For BO, the average F1 Score value in the “S” approach (0.4505) surpasses by


Plants 2025, 14, 308 18 of 30

6.61% that obtained in the “O” approach (0.4226). In contrast, for RO, the average F1 Score
value in the “O” approach (0.5055) is 3.02% higher than in the “S” approach (0.4907). In
addition to this slight difference, it is crucial to consider that the S model is characterized by
a simpler operation, requiring fewer computational capabilities compared to the O model.

3.4.2. Kappa Coefficient

Under the Kappa criteria, in the “O” method, the RO model stands out with an average
value of 0.3025. It is closely followed by the RC model with 0.2993, followed by the R
model (0.1778), B (0.1636), and BO (0.0967), in that order, showing a marked decline in
performance. With these results, RO surpasses RC by 1.07%. In contrast, the BO model
exhibits the worst performance, with a significant decrease of 68.05% compared to the
RO model.

When contrasting the “O” and “S” methods directly using the BO and RO models, a
balance in terms of Kappa is evident. For BO, the average Kappa value in the “S” approach
(0.1871) is notably higher, with an increase of 93.62% compared to the “O” approach (0.0966).
On the other hand, for RO, the average Kappa value in the “O” approach (0.3025) is higher,
with an increase of 17.49%, compared to the value in the “S” approach (0.2574). These
findings emphasize the importance of carefully evaluating all metrics before deciding on
the most appropriate method.

3.4.3. Sensitivity

In the “O” method, the BO and RO models stand out significantly compared to the
RC, B, and R models. BO and RO achieve the highest average Sensitivity values, with
0.8724 and 0.6956, respectively. This represents, for example, in the case of BO, a significant
increase of 436.49% compared to the R model, which obtained the lowest average value
(0.1626) among all the models. Likewise, the RO model surpasses the R model by 327.75%,
showing performance close to that of BO.

When making a direct comparison between the “O” and “S” methods using the BO
and RO models, a balance in terms of Sensitivity is evident. For BO, the average Sensitivity
value in the “O” method (0.8724) is higher by 12.61% than the value in the “S” method
(0.7747). On the other hand, for RO, the average Sensitivity value in the “S” method (0.7882)
is 13.32% higher than in the “O” method (0.6956). These results underscore the importance
of considering multiple studies before deciding on a specific method and/or model.

3.4.4. Specificity

In the “O” approach, the R model stands out with the highest average Specificity value,
reaching 0.9756, indicating its strong ability to identify the “non-top” lines. Subsequently,
B and RC follow with the best performances, with average values of 0.9549 and 0.8947,
respectively. In contrast, the RO and BO models show the lowest performances, with an
average Specificity of 0.6826 and 0.2851, respectively. This highlights, for example, that
BO has a 70.78% lower ability to effectively identify the “non-top” lines compared to the
R model.

The results of comparing the “O” and “S” methods using the models BO and RO
demonstrate a balance in terms of Specificity, as shown in Figure 7 and Table A5 (see
Appendix A). For BO, the average Specificity value in the “S” method (0.5001) significantly
surpasses by 75.64% the value in the “O” method (0.2851). On the other hand, for RO,
the average Specificity value in the “O” method (0.6826) is 18.91% higher than in the “S”
method (0.5740). Consequently, and bearing in mind that a very high value of Specificity
is not always the best, as it can affect Sensitivity, the S method shows a better behavior in
terms of RO, while the O method stands out in terms of BO.


Plants 2025, 14, 308 19 of 30Plants 2025, 14, x FOR PEER REVIEW  20  of  32 
 

Figure 7. Prediction performance for Across_Data using original (O) and simplified (S) methods. 

The results are presented for models B, BO, R, RC, and RO in terms of the metrics: F1 score, Kappa 

coefficient, Sensitivity, and Specificity. 

3.5. Across_Data 

Figure 7 and Table A5 (Appendix A) present the average results for Across_Data. The 

Across_Data was computed by averaging the results of all the datasets with the goal of 

obtaining a complete picture of the prediction performance of the proposed methods. 

3.5.1. F1 Score 

Under the F1 Score metric, it is observed that the model RO stands out in the “O” 

method with an average value of 0.5276. It is closely followed by the BO model with a 

value of 0.4813, representing a significant increase of 29.88% compared to model B, which 

had an average value of 0.3706. However, BO is surpassed by RO by 9.62%. In third place 

is the RC model with an F1 value of 0.4485, surpassing B by 21.03%. The R model shows 

the worst performance with an average value of 0.3279, marking a significant decrease of 

37.84% compared to RO and 31.86% compared to BO. 

In the “S” method, the RO model again stands out with an average F1 Score value of 

0.5097. The BO model comes second with an average value of 0.4676. It is closely followed 

by the RC model with a value of 0.4533, representing a reduction of 3.07% compared to 

the performance achieved by BO. Additionally, RO surpasses RC by 12.45% in the ability 

to select the top lines of the dataset. The lowest results are presented by the B and R mod-

els, with R showing the worst performance with an average value of 0.3297. Therefore, 

RO and BO achieve an  improvement of 54.60% and 41.83% compared  to  the R model, 

respectively. 

Comparing the “O” and “S” methods directly using the BO and RO models, it is ob-

served that in terms of F1 Score, both methods achieve very similar performance, superior 

in the “O” method for both models. For BO, the average F1 Score value in the “O” method 

(0.4813)  is 2.97% higher  than  in  the “S” method  (0.4676). For RO,  the average F1 Score 

value in the “O” method (0.5276) is 3.51% higher than in the “S” method (0.5097). Further-

more, this small difference should be considered in light of the fact that the S model has a 

simpler operation, demanding fewer computational requirements than the O model. 

Figure 7. Prediction performance for Across_Data using original (O) and simplified (S) methods.
The results are presented for models B, BO, R, RC, and RO in terms of the metrics: F1 score, Kappa
coefficient, Sensitivity, and Specificity.

In summary, this exhaustive study of the “O” and “S” methods applied to the Soybean
dataset, evaluating crucial metrics such as F1, Kappa, Sensitivity, and Specificity, reveals
clear patterns. The BO and RO models consistently demonstrate superior performance in
the precise selection of “outstanding lines”. Although the “O” method generally exhibits
better performance, the “S” method stands out for its computational efficiency, especially
evident in improving the Sensitivity value in BO and RO models. Furthermore, it is
emphasized that, for genetic improvement, high values in Sensitivity, Kappa, and F1 are
more desirable than in Specificity, aligning with the objective of effectively identifying the
top lines. These results reflect the necessary balance that must exist between Precision and
efficiency in the selection of “outstanding lines”.

3.5. Across_Data

Figure 7 and Table A5 (Appendix A) present the average results for Across_Data. The
Across_Data was computed by averaging the results of all the datasets with the goal of
obtaining a complete picture of the prediction performance of the proposed methods.

3.5.1. F1 Score

Under the F1 Score metric, it is observed that the model RO stands out in the “O”
method with an average value of 0.5276. It is closely followed by the BO model with a
value of 0.4813, representing a significant increase of 29.88% compared to model B, which
had an average value of 0.3706. However, BO is surpassed by RO by 9.62%. In third place
is the RC model with an F1 value of 0.4485, surpassing B by 21.03%. The R model shows
the worst performance with an average value of 0.3279, marking a significant decrease of
37.84% compared to RO and 31.86% compared to BO.

In the “S” method, the RO model again stands out with an average F1 Score value of
0.5097. The BO model comes second with an average value of 0.4676. It is closely followed
by the RC model with a value of 0.4533, representing a reduction of 3.07% compared to the
performance achieved by BO. Additionally, RO surpasses RC by 12.45% in the ability to
select the top lines of the dataset. The lowest results are presented by the B and R models,


Plants 2025, 14, 308 20 of 30

with R showing the worst performance with an average value of 0.3297. Therefore, RO and
BO achieve an improvement of 54.60% and 41.83% compared to the R model, respectively.

Comparing the “O” and “S” methods directly using the BO and RO models, it is
observed that in terms of F1 Score, both methods achieve very similar performance, superior
in the “O” method for both models. For BO, the average F1 Score value in the “O” method
(0.4813) is 2.97% higher than in the “S” method (0.4676). For RO, the average F1 Score value
in the “O” method (0.5276) is 3.51% higher than in the “S” method (0.5097). Furthermore,
this small difference should be considered in light of the fact that the S model has a simpler
operation, demanding fewer computational requirements than the O model.

3.5.2. Kappa Coefficient

Under the Kappa metric, in the “O” method, the model RO stands out with an average
value of 0.3289. It is followed by the RC model with 0.3164, and then descending the BO
model (0.2415), B (0.2393), and R (0.2161), in that order. According to these results, RO
surpasses BO by 36.21%. Regarding the R model, which shows the worst performance, it
marks a significant decrease of 34.29% compared to the RO model.

Comparing the “O” and “S” methods directly using the BO and RO models, it is
observed that, in terms of Kappa, the “O” method surpasses the “S” method. For BO, the
average Kappa value in the “O” method (0.2415) is 20.24% higher than in the “S” method
(0.2008). For the RO model, the average Kappa value in the “O” method (0.3289) is 17.05%
higher than in the “S” method (0.2809). Given these results, it is important to emphasize
that before choosing a method, it is necessary to carefully review all the evaluated metrics.

3.5.3. Sensitivity

In the “O” method, the BO and RO models stand out significantly compared to the
RC, B, and R models; as BO and RO achieve the highest average Sensitivity values of 0.7224
and 0.7117, respectively. This represents, for example, a significant increase of 255.68%
for BO compared to the R model, which had the lowest average value (0.2031) among all
models. In the case of the BO model, it surpasses the R model by 250.41%, very similar to
RO’s performance.

Comparing the “O” and “S” methods directly using the BO and RO models, it is
observed that, in terms of Sensitivity, the “S” method strongly surpasses the “O” method,
which does not occur in the F1 and Kappa metric results for the Across_Data dataset. For
BO, the average Sensitivity value in the “S” method (0.8182) is 13.27% higher than in the “O”
method (0.7224). Likewise, for RO, the average Sensitivity value in the “S” method (0.7983)
is 12.8% higher than in the “O” method (0.7117). These results underscore the importance
of considering various studies before opting for a particular method and/or model.

3.5.4. Specificity

In the “O” method, the R model stands out with the highest average Specificity value
(0.9718), indicating a strong ability to identify the “non-top” lines. It is closely followed by
the B and RC models with average values of 0.9199 and 0.9044, respectively. On the other
hand, the RO and BO models show the lowest Specificity performances, with an average
Specificity of 0.6944 and 0.5858, respectively. This indicates, for example, that BO and RO
have a 39.72% and 28.54% lower ability to effectively identify the “non-top” lines compared
to the R model, respectively.

Comparing the “O” and “S” methods directly using the BO and RO models, it is
observed that, in terms of Specificity, the “O” method surpasses the “S” method. For BO,
the average Specificity value in the “O” method (0.5858) is 20.86% higher than in the “S”
method (0.4848). For RO, the average Specificity value in the “O” method (0.6944) is 17.39%
higher than in the “S” method (0.5915). Consequently, and bearing in mind that a very


Plants 2025, 14, 308 21 of 30

high value of Specificity is not necessarily the best, as it affects Sensitivity, the “S” method
exhibits better behavior.

In summary, this comprehensive study evaluated the performance of different models
and methods across datasets using fundamental metrics such as F1 Score, Kappa, Sensitivity,
and Specificity. The results reveal that, in terms of the F1 Score, the model RO consistently
shows superior performance in the “O” method, closely followed by the BO model. Al-
though the “O” method has an advantage in the F1 Score, the “S” method stands out for its
computational efficiency and improvement in Sensitivity for BO and RO. Under Kappa,
RO also excels in both methods, showcasing its robustness. In Sensitivity, BO and RO
remarkably stand out in the “S” method, vastly surpassing the “O” method. On the other
hand, in Specificity, the “O” method exhibits superior performance over the “S” method.
However, it is emphasized that for genetic improvement, high values in Sensitivity, Kappa,
and F1 are more desirable than in Specificity, aligning to effectively identify top lines. These
findings underscore the importance of considering multiple metrics and methods when
choosing the optimal approach for selecting outstanding lines.

4. Discussion
Achieving high prediction accuracies in the application of the GS methodology neces-

sitates several prerequisites; yet, efficiently optimizing the numerous factors that influence
its accuracy poses a significant challenge. Given the importance of enhancing predictive
methodologies, considerable research efforts are directed towards identifying and imple-
menting strategies that can significantly improve its efficiency. Various studies have studied
the impact of the size of the training set and diversity on the accuracy of these methods.
Furthermore, investigations have explored how the population structure and its genetic
relationship with the breeding population influence the Precision of genomic prediction.
Other research areas include the effects of marker density and distribution, linkage dise-
quilibrium, the genetic architecture and heritability of traits, and the exploration of novel
statistical machine learning models [5]. Additionally, the integration of supplementary
inputs such as proteomics, metabolomics, and enviromics has been examined for their
potential to enrich and refine predictive analytics. All these investigations try to improve
the efficiency of the GS methodology and show that improving the GS methodology is an
ongoing process that requires a combination of innovative methodologies, rigorous valida-
tion, and a deep understanding of the biological context. Additionally, staying informed
about emerging technologies and ethical considerations is essential in this field.

Regarding statistical machine learning methods, many parametric (mixed models,
Bayesian methods) and nonparametric (deep learning, random forest, gradient boosting
machines, etc.) state-of-the-art algorithms have been explored in the context of genomic
prediction. Some notable algorithms that aim to leverage genetic data to predict various
phenotypic traits or outcomes are GBLUP, Bayesian methods (A, B, C, Lasso, etc), random
forest, gradient boosting machine, support vector machine, deep learning, kernel methods,
etc. However, the predictions resulting from most of these algorithms are not yet optimal
for practical applications. For these reasons, researchers continue to work on improving
genomic prediction models by developing more sophisticated algorithms, incorporating
additional sources of data (e.g., transcriptomics, epigenetics), and refining methods for
accounting for complex genetic interactions and environmental factors. While these models
have made significant progress, achieving optimal accuracy in all scenarios remains a
complex and ongoing challenge in genomics.

The genomic prediction models, while providing extensive results, were validated by
comparing predicted values with observed phenotypic data. The high agreement between
observed and predicted values supports the practical applicability of these models in


Plants 2025, 14, 308 22 of 30

breeding programs. Furthermore, cross-validation across environments demonstrated the
models’ robustness and generalizability, making them useful in predicting traits under
various environmental conditions, even those not represented in the original dataset.

These findings suggest that while the statistical methods are sophisticated, their
practical value lies in their ability to reduce the need for exhaustive field testing, accelerating
the selection process by enabling breeders to focus on high-potential genotypes early in the
breeding cycle. This application demonstrates that genomic prediction can be an effective
tool for enhancing breeding efficiency and genetic gain across diverse environments.

Therefore, intending to enhance the accuracy in identifying the superior (or inferior)
lines, this research delineates and contrasts five established methodologies for selecting the
top (or bottom) lines evaluating the prediction accuracy in terms of novel metrics popular
in the context like Sensitivity and Specificity. The description of each method was achieved
with details to avoid any confusion between these methods and to see the simplicity and
complexity of each one of them. In our comparative analysis with the five real datasets, we
observed that under the original method, model RO outperformed the others in terms of F1
score. Specifically, it exceeded model B by 42.37%, model BO by 9.62%, model R by 60.87%,
and model RC by 17.63%. Meanwhile, in terms of the Kappa coefficient, the RO model was
superior to models B, BO, R, and RC by 37.46%, 36.21%, 52.18%, and 3.95%, respectively. In
terms of Sensitivity, model RO outperformed models B, R, and RC by 145.74%, 250.41%,
and 86.20%, respectively. Also, our results show that the second-best model was the BO,
that only slightly worse than the RO model.

Our results unequivocally highlight the effectiveness of methods RO and BO in identi-
fying the top lines. These methods incorporate a post-processing step to ensure comparable
Sensitivity and Specificity, making them better choices. Nonetheless, it is important to
acknowledge that achieving this enhanced classification accuracy comes at the cost of
increased computational resources during the tuning process, specifically in the selection
of the optimal threshold for final line classification. However, this upsurge in computa-
tional demands poses no significant challenge when dealing with small to moderately
sized datasets. In such cases, only a single hyperparameter—the optimal threshold—needs
tuning. Consequently, the advantages offered by the RO and BO methods far outweigh
the associated costs. Furthermore, our research reveals that simplified versions of the
original RO and BO methods remain highly competitive. These simplified approaches
deliver nearly equivalent prediction performance while substantially reducing computa-
tional resource requirements compared to the original RO and BO methods. Given this
finding, we encourage the adoption of the original RO and BO methods in real-world
applications, especially for datasets of small to moderate size, where their implementation
is straightforward and beneficial.

Additionally, our results clearly show that the R method consistently yields the lowest
performance across all measures. This can be attributed to the persistent bias in predicting
lines in the tails, whether they are top or bottom lines. Consequently, the R method tends
to underestimate predictions for the top lines and overestimate predictions for the bottom
lines, resulting in a significant misclassification error when selecting the optimal lines
regarding a threshold.

Additionally, this paper offers a compelling perspective by conceptualizing the chal-
lenge of choosing the top (or bottom) lines in breeding programs as a classification problem.
Even though some of the proposed methods (R, RC, and RO) employ a regression model
during the initial training phase, this unique approach reframes the selection process as a
classification problem. Consequently, the paper introduces a set of classification metrics,
including Sensitivity, Specificity, F1 Score, and Kappa Coefficient, to assess the accuracy
and quality of top line selection. These metrics provide a more appropriate and insightful


Plants 2025, 14, 308 23 of 30

means of evaluating the effectiveness of the chosen top lines. Also, our results show evi-
dence that models BO and RO provide more balanced Sensitivity and Specificity regarding
the other methods which is of paramount importance since metrics like Sensitivity and
Specificity in plant breeding facilitate the precise selection of superior lines, optimization of
breeding programs, reduction in errors, improvement of trait selection, and quantitative
evaluation of line performance. These metrics play a crucial role in enhancing the efficiency,
effectiveness, and success of plant breeding efforts.

Assessing Sensitivity and Specificity

Sensitivity, defined as the model’s ability to correctly identify true positives, is a crucial
metric in our analysis, particularly in the context of genomic prediction for breeding. How-
ever, we acknowledge that an emphasis on maximizing Sensitivity may inadvertently lead
to an increase in false positives, which can complicate the selection of superior phenotypes.

To address this concern, we implemented several strategies to balance Sensitivity and
Specificity within our models. We carefully selected thresholds for predicting positive
outcomes that aim to minimize false positives while still capturing a high proportion of
true positives. Additionally, we utilized cross-validation techniques to assess model perfor-
mance across diverse datasets, ensuring that our findings are robust and generalizable.

Moreover, we conducted an analysis of the trade-offs involved in adjusting Sensitivity
levels, discussing the implications for breeding programs. While higher Sensitivity is
desirable to ensure that few superior phenotypes are missed, it is critical to monitor the
rate of false positives to avoid misdirecting breeding efforts. This balanced approach
allows us to enhance the Precision of our genomic predictions and improve the overall
selection process.

These results offer empirical confirmation that the RC model, which represents the
conventional approach for selecting the top lines, ranks among the two least efficient
strategies in capturing top lines effectively in terms of Sensitivity. Consequently, we
strongly advocate for the adoption of methods RO and BO by breeders. These methods
have demonstrated their ability to enhance the efficacy of the Genomic Selection (GS)
methodology. As previously noted, the practical implementation of GS remains challenging
due to the influence of various factors on its performance. Therefore, embracing the RO
and BO methods can significantly improve the overall results and reliability of GS in
breeding programs.

5. Conclusions
In this paper, we described and evaluated five existing methods to select the top (or

bottom) lines in the context of genomic prediction. We described each of the five existing
methods simply and clearly with the goal that breeders and scientists of related fields can
distinguish without any ambiguity the peculiarities of each method. Then, we evaluated
the performance of the five methods with five real datasets for selecting the top lines using
four popular metrics in the context of classification methods (F1 Score, Kappa coefficient,
Sensitivity, and Specificity). We found that the methods BO and RO performed effectively
across the five datasets for selecting the top lines with more accuracy. Methods BO and
RO were better across traits, datasets, and environments than the other three methods in
terms of F1 score, Kappa coefficient, and Sensitivity. For these reasons, we encourage other
studies to increase the empirical evidence of the superiority of methods BO and RO to
select the top (or bottom) lines.

Author Contributions: Conceptualization, O.A.M.-L., A.M.-L. and J.C.; methodology, O.A.M.-L.,
A.M.-L. and J.C.; validation, J.C.M.-L.; investigation, O.A.M.-L., A.M.-L., J.C.M.-L. and J.C.; writing—
original draft preparation, O.A.M.-L., A.M.-L. and J.C.; writing—review and editing, O.A.M.-L.,


Plants 2025, 14, 308 24 of 30

K., A.A., A.M.-L., J.C.M.-L. and J.C. All authors have read and agreed to the published version of
the manuscript.

Funding: Open Access fees were received from the Bill and Melinda Gates Foundation. We ac-
knowledge the financial support provided by the Bill and Melinda Gates Foundation (INV-003439
BMGF/FCDO Accelerating Genetic Gains in Maize and Wheat for Improved Livelihoods (AGG))
as w