R E S E A R C H Open Access

© The Author(s) 2025. Open Access  This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, 
sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and 
the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this 
article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included 
in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will 
need to obtain permission directly from the copyright holder. To view a copy of this licence, visit ​h​t​t​p​​:​/​/​​c​r​e​a​​t​i​​v​e​c​​o​m​m​​o​n​s​.​​o​r​​g​/​l​i​c​e​n​s​e​s​/​b​y​/​4​.​0​/.

Pierre et al. BMC Genomic Data           (2025) 26:73 
https://doi.org/10.1186/s12863-025-01359-6

focal point within this field. However, these interactions 
are difficult to identify because they occur at different 
molecular levels in plants and are strongly influenced 
by environmental factors (i.e., climate change). The new 
challenges include identifying these interactions and 
spanning diverse molecular entities contributing to phe-
notypic expression. This endeavor necessitates a holis-
tic approach, incorporating insights from different data 
stacks into a comprehensive model to unravel the true 
functioning of biological systems.

For researchers, navigating through vast amounts 
of dispersed information across multiple online data-
bases—each with distinct data models, scales, and access 
modes—is a major challenge. This is particularly evident 
in genetic association studies such as genome–wide asso-
ciation studies (GWAS), which establish links between 

Introduction
Agronomic research is witnessing an unprecedented 
revolution in the acquisition of various data, such as 
phenotypic and genomic data, as well as data related to 
the functional characterization of specific genes. Under-
standing the intricate interactions between genotypes 
and phenotypes that lead to particular traits is a key 

BMC Genomic Data

*Correspondence:
Larmande Pierre
pierre.larmande@ird.fr
1DIADE, IRD, CIRAD, Univ. Montpellier, Ave Agropolis, Montpellier, France
2AGAP, CIRAD, INRAE, Univ. Montpellier, Ave Agropolis, Montpellier, France
3French Institute of Bioinformatics (IFB) - South Green Bioinformatics 
Platform, Bioversity, CIRAD, INRAE, IRD, F-34398 Montpellier, Montpellier, 
France
4Bioversity International, Bioversity, Parc Scientifique Agropolis II, 
Montpellier, France

Abstract
Background  The demand for food is expected to grow substantially in the coming years. To address this challenge, 
especially in the context of climate change, a deeper understanding of genotype-phenotype relationships is crucial 
for improving crop yields. Recent advances in high-throughput technologies have transformed the landscape of plant 
science research. However, there is an urgent need to integrate and consolidate complementary data to understand 
the biological system.

Results  We introduce AgroLD, a knowledge graph that uses Semantic Web technologies to seamlessly integrate 
plant science data. AgroLD is designed to facilitate hypothesis formulation and validation within the scientific 
community. With approximately 1.08 billion triples, it integrates and annotates data from more than 151 datasets 
across 19 distinct sources.

Conclusion  The overarching goal is to provide a specialized knowledge platform addressing complex biological 
questions in the plant sciences, including gene participation in plant disease resistance and adaptive responses to 
climate change.

Keywords  Knowledge graphs, FAIR, Linked data, Bioinformatics, Plant sciences

AgroLD: a knowledge graph for the plant 
sciences
Larmande Pierre1,3*, Pittolat Bertrand2,3, Tando Ndomassi1,3, Pomie Yann1, Happi Happi Bill Gates1, 
Guignon Valentin2,3,4 and Ruiz Manuel2,3

http://creativecommons.org/licenses/by/4.0/
https://doi.org/10.1186/s12863-025-01359-6
http://crossmark.crossref.org/dialog/?doi=10.1186/s12863-025-01359-6&domain=pdf&date_stamp=2025-9-17


Page 2 of 10Pierre et al. BMC Genomic Data           (2025) 26:73 

genomic regions (loci) and phenotypic traits. GWAS loci 
often encompass multiple genes, necessitating thorough 
analysis to identify relevant genes. A similar challenge 
exists in transcriptomic studies, where researchers must 
interpret extensive lists of differentially expressed genes 
and determine which genes merit further laboratory 
investigation. Inevitably, researchers must decide which 
genes warrant further investigation in the laboratory, a 
decision often on the basis of subjectivity and incomplete 
data reviews. Today’s significant challenges are related 
to developing methods to integrate these heterogeneous 
data and enrich biological knowledge. Scientists also 
need methods to explore this large amount of data and to 
highlight relevant information that can be used to iden-
tify key genes.

The Semantic Web introduces techniques and tech-
nologies to transform vast amounts of data into action-
able knowledge. It is a fundamental component of the 
Findable, Accessible, Interoperable, and Re-usable 
(FAIR) principles  [1] - by enhancing data interoperabil-
ity. This achievement hinges on establishing standard-
ized vocabularies and ontologies, which systematically 
capture domain knowledge and translate it into seman-
tic resources, empowering computers to index, search, 
and reason over data. Notably, the Resource Descrip-
tion Framework (RDF)  [2] has gained widespread uti-
lization for web-based data publication, leading to the 
creation of the Web of Data. Recently, numerous initia-
tives have emerged within the biomedical and bioinfor-
matics domains, each aiming to provide comprehensive 
platforms for building scientific hypotheses around gene 
functions, phenotypic expression, and disease emer-
gence. Illustrative examples include Bio2RDF  [3], Uni-
Prot RDF [4], PubChem [5] and WikiPathways [6]. In the 
domain of human biology, notable contributions have 
been made through the establishment of platforms such 
as the DisGeNET RDF [7] and the Monarch Initiative [8]. 
Similarly, the field of plant science has yielded the Knet-
miner  [9], a graph database designed to unravel plant 
molecular networks for analogous objectives. In this con-
text, AgroLD  [10, 11] was developed with the ambition 
of providing the tools and methods needed to exploit 
the data and knowledge produced within the plant com-
munity. AgroLD has been actively developed. Currently, 
AgroLD contains more than 1,08 billion triples, resulting 
from the integration of approximately 151 datasets gath-
ered in 33 named graphs.

Methods
Information content
AgroLD was designed to accommodate the molecular 
and phenotypic information available on various plant 
species with a large focus on tropical crops. Since the 
first release  [10], 40 new species (6 since the previous 

release  [11]) have been integrated, including cereals, 
legumes, and fruit trees. The list of the 51 species is avail-
able in Table 1.

AgroLD is built incrementally and spans many aspects 
of plant molecular interactions. Initially, it integrated 
information on genes, proteins, metabolic pathways, 
and genetic studies built from several resources such 
as Ensembl Plants  [12], UniProtKB  [4], Gene Ontology 
Annotation  [13], Gramene  [14], Oryzabase  [15], RAP-
DB  [16], and MSU  [17]. In its current version, AgroLD 
adds predictions of homologous genes from Ensembl 
Compara and biological networks from StringDB  [18], 
RiceNetV2 [19], PlantTFDB [20], and PlantRegMap [21]. 
The size of the knowledge base has expanded by 16% 
since the last release [11], reaching 1.08 billion triples.

The biological community has guided the choice of 
these sources, as they are widely used and strongly impact 
the user’s confidence. We have also integrated resources 
developed by the local SouthGreen platform  [22] such 
as TropGeneDB  [23], a tropical plant genetics database; 
GreenPhylDB  [24], a comparative genomics database 
for tropical plants; OryzaTagLine  [25], a rice phenotype 
database and SniPlay [26], a rice genomic variation data-
base. These resources combine experimental data from 
research groups in Montpellier and southern France. The 
online documentation provides an overview of the inte-
grated data sources [27]. Table 2 provides an overview of 
the data sources integrated into AgroLD.

We initially developed the conceptual framework of 
AgroLD on a custom vocabulary which also included 
mappings on well-established ontologies and controlled 
vocabularies in the fields of molecular biology and plant 
sciences such as Sequence Ontology  [28], Gene Ontol-
ogy  [29], Plant Ontology  [30] or Plant Trait Ontol-
ogy [31]. Most of these ontologies are hosted by the Open 
Bio-Ontologies (OBO) Foundry project  [32]. For this 
updated version, we modified the backbone schema (i.e., 
its vocabulary) by reusing other existing ontologies such 
as Semantic Science Ontology (SIO) [33], Feature Anno-
tation Location Description Ontology (FALDO)  [34], 
and Relation Ontology (RO) [35] to increase its interop-
erability with other knowledge graphs. Additionally, we 
included general ontologies such as Resource Description 
Framework Schema (RDFS), Simple Knowledge Organi-
zation System (SKOS), and Dublin Core to describe some 
properties of the biological entities. The online documen-
tation shows the complete list of the ontologies used. Fig-
ure 1 shows a subset of the global schema of AgroLD [36].

AgroLD integration pipelines
We developed various RDF conversion pipelines for 
large genomic and agronomic datasets. Although several 
generic tools exist within the Semantic Web community, 
such as Tarql [37], RML.io [38] or SPARQL-Generate [39] 


Page 3 of 10Pierre et al. BMC Genomic Data           (2025) 26:73 

Table 1  The 51 plant species integrated in AgroLD
Species name Common name Taxon ID
Aegilops tauschii subsp. strangulata rough-spike hard grass 200361
Amborella trichopoda Amborella 13333
Ananas comosus pineapple 4615
Arabidopsis halleri subsp. gemmifera 63677
Arabidopsis lyrata subsp. lyrata Cardaminopsis lyrata 81972
Arabidopsis thaliana thale cress 3702
Beta vulgaris ssp. vulgaris sugar beet 3555
Brachypodium distachyon stiff brome 15368
Brassica napus rape 3708
Brassica oleracea var. oleracea wild cabbage 109376
Brassica rapa field mustard 3711
Citrus x clementina clementine 85681
Coffea canephora robusta coffee 49390
Daucus carota subsp. sativus carrot 79200
Digitaria exilis White fonio 1010633
Glycine max soybean 3847
Gossypium raimondii Peruvian cotton 29730
Helianthus annuus domesticated sunflower 4232
Hordeum vulgare subsp. vulgare two-rowed barley 112509
Malus domestica apple 3750
Manihot esculenta cassava 3983
Musa acuminata subsp. malaccensis wild Malaysian banana 214687
Nicotiana attenuata wild tobacco 49451
Olea Europaea Mediterranean olive tree 158383
Oryza barthii African wild rice 65489
Oryza brachyantha malo sina 4533
Oryza glaberrima African rice 4538
Oryza glumipatula 40148
Oryza longistaminata long-staminate rice 4528
Oryza meridionalis Australian wild rice 40149
Oryza nivara 4536
Oryza punctata red rice 4537
Oryza rufipogon common wild rice 4529
Oryza sativa Indica Group long-grained rice 39946
Oryza sativa Japonica Group Japanese rice 39947
Phaseolus vulgaris common bean 3885
Prunus avium Sweet cherry 42229
Prunus dulcis almond 3755
Prunus persica peach 3760
Saccharum spontaneum wild sugarcane 62335
Setaria italica foxtail millet 4555
Solanum lycopersicum tomato 4081
Solanum tuberosum potato 4113
Sorghum bicolor sorghum 4558
Theobroma cacao cacao 3641
Triticum aestivum bread wheat 4565
Triticum dicoccoides wild emmer wheat 85692
Triticum turgidum subsp. durum durum wheat 4567
Triticum urartu red wild einkorn wheat 4572
Vitis vinifera wine grape 29760
Zea may maize 4577


Page 4 of 10Pierre et al. BMC Genomic Data           (2025) 26:73 

to name a few, none of them have been adapted to con-
sider the complexity of data formats in the biological 
domain (e.g. Variant Call Format (VCF)  [40]) or even 
the complexity of the information they could contain. A 
simple example illustrates this complexity through the 
Generic Feature Format (GFF)  [41], which represents 
genomic data in a Tab Separated Value (TSV) format. 
It contains a column with a variable length key = value 
type information and different information depending on 
the data source. In this case, the transformation must be 
adapted according to the data source. Moreover, the large 
volume of data was a limiting factor for the abovemen-
tioned tools.

In this context, we developed RDF conversion tools 
adapted to various genomics data standards such as 
GFF, Gene Ontology Annotation File (GAF)  [42], and 
VCF. Moreover, we are currently working on packaging 
these Extraction, Transform, and Load (ETL) tools in an 

Application Programming Interface (API) [43]. RDF con-
version tools are Python-based scripts that can be run 
independently. Furthermore, the tools are tailored to run 
locally or use high-performance computing resources. 
More than forty scripts are available to process either 
data standards (e.g., GFF) or database-specific data (e.g., 
TAIR, RAPDB, and Oryzabase). Some parameters, such 
as the base Uniform Resource Identifier (URI), local 
paths, and RDF prefixes, can be defined globally. Param-
eters specific to a script can be defined at runtime. Docu-
mentation is available as a docstring for each script and 
explains how to run them. Moreover, the GitHub reposi-
tory provides documentation on how to deploy and use 
the tools. Table 3 lists all the resources and tools available 
for AgroLD.

To ensure that AgroLD remains updated with the lat-
est data, the entire knowledge base is reconstructed 
annually. Additionally, new datasets are incorporated 
multiple times a year, typically every four months. Regu-
larly updating the data presents challenges, as the origi-
nal databases often lack automatic tracking of changes 
between versions. On the basis of our experience, com-
pletely reconstructing the knowledge base regularly is an 
effective strategy to bypass the complexities of handling 
data differences (Fig. 2).

URI design and data linking
In the transformation pipelines, RDF graphs share a com-
mon namespace and are named according to the corre-
sponding data sources. Entities in RDF graphs are linked 
by the common URI principle. We generally build URIs 
by referring to Identifiers.org [44], which provides design 
patterns for each registered source—for example, genes 
integrated from Ensembl Plants (​h​t​t​p​​:​/​/​​i​d​e​n​​t​i​​f​i​e​​r​s​.​​o​r​g​/​​e​
n​​s​e​m​​b​l​.​​p​l​a​n​​t​/​​E​n​t​i​t​y​_​I​D). When Identifiers.org does not 
provide them, new URIs are constructed, and in this case, 
URIs take the form (​h​t​t​p​​:​/​/​​p​u​r​l​​.​a​​g​r​o​​l​d​.​​o​r​g​/​​r​e​​s​o​u​r​c​e​/​E​n​t​i​
t​y​_​I​D) In addition, the properties linking the entities are 
constructed in various forms (​h​t​t​p​​:​/​/​​p​u​r​l​​.​a​​g​r​o​​l​d​.​​o​r​g​/​​v​o​​c​a​
b​u​l​a​r​y​/​p​r​o​p​e​r​t​y).

To link identical entities from different data sources, 
we used an approach based on URI pattern matching. Its 
principle is to scan the URIs to look for similar patterns 
in the terminal part of the URI (i.e., Entity_ID). In addi-
tion, we also follow the common URI approach, which 
recommends using the same URI pattern for two identi-
cal entities. Therefore, this allowed us to aggregate infor-
mation from different RDF graphs for the same entity. In 
addition, we used cross-reference links by transforming 
them to URIs and linking the resource to the rdfs predi-
cate seeAlso. This significantly increases the number of 
outbound links by reaching almost 80 million links, mak-
ing AgroLD better integrated with other data sources. 
We plan to implement a similarity-based entity profile 

Table 2  Data sources integrated in AgroLD
Data sources Nb of 

datasets
File format Ontology 

used
Nb of 
triples

Oryzabase 2 TSV GO,PO,TO 347 K
GO 
Associations

2 GAF GO 6,440 K

Genome Hub 7 GFF GO, SO 12,233 K
Gramene 6 Custom flat 

file
All 159 K

Ensembl 51 GFF All 838,874 K
UniprotKB 2 Uniprot GO, PO 60,034 K
Oryza Tag Line 2 Custom flat 

file
PO, TO, CO 282 K

TropGeneDB 2 Custom flat 
file

PO, TO, CO 20 K

GreenPhylDB 2 Custom flat 
file

GO, PO 3,627 K

SNiPlay 1 HapMap, VCF GO 16,204 K
Q-TARO 2 TSV PO, TO 20 K
MSU 2 Custom flat 

file
PO, TO 2,068 K

RiceNetDB 6 Custom flat 
file

PO, TO 5,879 K

StringDB 45 Custom flat 
file

GO 131,559 K

RapDB 3 GFF PO, TO 1,026 K
PlantTftDB 12 Custom flat 

file
PO, TO 86 K

Interpro 1 Custom flat 
file

PO, TO 196 K

CEGResources 2 GFF PO, TO 1,031 K
OBO 
ontologies

12 OWL 15,131 K

TOTAL 151 1,077,303 
K

Ontologies are referenced as GO gene ontology, PO plant ontology, TO plant 
trait ontology, EO plant environnment ontology, SO sequence ontology, CO 
crop ontology (plant specific traits)

http://identifiers.org/ensembl.plant/Entity_ID
http://identifiers.org/ensembl.plant/Entity_ID
http://purl.agrold.org/resource/Entity_ID
http://purl.agrold.org/resource/Entity_ID
http://purl.agrold.org/vocabulary/property
http://purl.agrold.org/vocabulary/property


Page 5 of 10Pierre et al. BMC Genomic Data           (2025) 26:73 

approach to identify matches between entities with dif-
ferent URIs.

Results
To increase the accessibility of a broader user base, we 
developed a web application for AgroLD with multiple 
query interfaces. The initial interface facilitates keyword 
searches across the entire database content, enabling 
users to navigate the knowledge base. A more advanced 
search interface allows users to combine free text and 
apply filters on the basis of class types, properties, and 
external web services. This feature supports the aggrega-
tion of distributed data.

We introduced a SPARQL Protocol and RDF Query 
Language (SPARQL) editor to address the challenge 
of handling SPARQL query language complexities, 

Table 3  Links to AgroLD resources and tools
Name of resource or tool and description, URL
Data
 AgroLD datasets, ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​g​/​​1​0​.​​5​2​8​1​​/​z​​e​n​o​d​o​.​4​6​9​4​5​1​8
 List of graphs, ​h​t​t​p​​:​/​/​​w​w​w​.​​a​g​​r​o​l​​d​.​o​​r​g​/​d​​o​c​​u​m​e​n​t​a​t​i​o​n​.​j​s​p
 List of ontologies, ​h​t​t​p​​:​/​/​​w​w​w​.​​a​g​​r​o​l​​d​.​o​​r​g​/​d​​o​c​​u​m​e​n​t​a​t​i​o​n​.​j​s​p
 AgroLD vocabulary, ​h​t​t​p​​s​:​/​​/​g​i​t​​h​u​​b​.​c​​o​m​/​​S​o​u​t​​h​G​​r​e​e​​n​P​l​​a​t​f​o​​r​m​​/​A​g​​r​o​L​​D​_​
E​T​​L​/​​t​r​e​e​/​m​a​s​t​e​r​/​m​o​d​e​l
 AgroLD SPARQL Endpoint, ​h​t​t​p​​:​/​/​​a​g​r​o​​l​d​​.​s​o​​u​t​h​​g​r​e​e​​n​.​​f​r​/​s​p​a​r​q​l
 Example queries, ​h​t​t​p​​:​/​/​​w​w​w​.​​a​g​​r​o​l​​d​.​o​​r​g​/​s​​p​a​​r​q​l​e​d​i​t​o​r​.​j​s​p
Tools
 Web application, ​h​t​t​p​​s​:​/​​/​g​i​t​​h​u​​b​.​c​​o​m​/​​S​o​u​t​​h​G​​r​e​e​​n​P​l​​a​t​f​o​​r​m​​/​A​g​r​o​L​D​_​w​
e​b​a​p​p
 RDF conversion pipelines (GFF2RDF, GAF2RDF, VCF2RDF, Datasets), ​h​t​t​p​
s​:​​​/​​/​g​i​t​h​u​​​b​.​​c​o​​m​​/​S​o​u​​t​h​​G​r​e​​e​n​P​​l​a​t​f​​o​​​r​m​/​A​g​r​o​​L​D​_​E​T​L

Fig. 1  The AgroLD schema

 
https://doi.org/10.5281/zenodo.4694518
http://www.agrold.org/documentation.jsp
http://www.agrold.org/documentation.jsp
https://github.com/SouthGreenPlatform/AgroLD_ETL/tree/master/model
https://github.com/SouthGreenPlatform/AgroLD_ETL/tree/master/model
http://agrold.southgreen.fr/sparql
http://www.agrold.org/sparqleditor.jsp
https://github.com/SouthGreenPlatform/AgroLD_webapp
https://github.com/SouthGreenPlatform/AgroLD_webapp
https://github.com/SouthGreenPlatform/AgroLD_ETL
https://github.com/SouthGreenPlatform/AgroLD_ETL


Page 6 of 10Pierre et al. BMC Genomic Data           (2025) 26:73 

particularly for bioinformaticians and biologists. This 
editor provides an interactive tool for query formulation 
and result manipulation. Consequently, the AgroLD plat-
form offers several entry points:

 	• Quick Search: This plugin, powered by Virtuoso, 
uses faceted search capabilities to enable keyword-
based searches and content navigation within 
AgroLD. Figure 3A illustrates the results of a 
keyword search, with GRP2 used as an example. The 

results are ranked by the frequency of occurrence 
across various entity fields. The Named Graph 
column indicates the data source, whereas the Title 
and Entity columns display the entity names and 
their URIs, respectively. Clicking on a link provides 
a comprehensive view of the entity, and users can 
traverse entities via the provided HTTP links.

 	• Advanced Search: This interface allows targeted 
searches on the basis of entity classes, incorporating 
an aggregation engine for external resources. Built 

Fig. 3  Overview of AgroLD Web interfaces. A displays the Faceted search interface. B displays results from the KnetMaps tool [45]. C displays results from 
the advanced search interface

 
Fig. 2  AgroLD ETL pipelines

 
Page 7 of 10Pierre et al. BMC Genomic Data           (2025) 26:73 

upon a Representational State Transfer (REST) API 
(described below), the Advanced Search conceals 
the technical intricacies of SPARQL queries. The 
integration of the AgroLD API facilitates interactive 
searches across the knowledge base and external 
services such as Pubmed or EMBL. Figure 3C shows 
the user interaction: selecting the entity type (e.g., 
Gene) and providing keywords (e.g., TBP1) yield 
results presented in a sortable and downloadable 
table. Each row contains entity attributes, including 
the ID, data source, and context of the matching 
keywords. To obtain more details, users must click 
on the display link below the entity ID. This will open 
a new window (not shown). This window takes the 
name and description of the biological object in its 
header and then comprises several panels. Each of 
them shows one feature of the object displayed. They 
can differ according to the type of entity displayed 
(e.g. Proteins, Pathways, Publications, Terms 
associated, View as Graph, Expression, and See also 
panels). Figure 3B shows the View as Graph panel. 
It was adapted from Knetmaps [45]. It displays a 
window divided into two parts. On the left part, the 
entity is represented within a graph showing other 
entities linked to it (in this case, Pathways). More 
detailed information corresponding to the entity 
highlighted in green (not shown) is displayed on the 
right part. When the users select another entity in 
the graph, this content changes dynamically;

 	• AgroLD Restful API: This programming interface 
supports interaction with the knowledge graph 
database. It comprises function calls grouped by 
entity classes within AgroLD (e.g., Genes, Proteins). 
For example, under the Gene class, functions exist 
for obtaining gene lists within genomic regions 
(genes/byLocus), genes matching specific keywords 

(genes/byKeyword), and genes encoding specific 
proteins (genes/encodingProtein).

 	• The SPARQL Editor: We developed a SPARQL query 
editor with an interactive environment, employing 
YASQE and YASR tools [46] adapted for our system. 
The editor features modular and customizable query 
patterns aligned with user requirements. Figure 
4 illustrates the editor’s layout, which is divided 
into three areas. The main area serves as the query 
field with syntax highlighting, error checking, 
autocompletion, and editing functions. Users can 
load and save queries and execute predefined query 
templates. The results appear beneath the editor, 
initially as a sortable table. JSON or graphical 
formats are also available for display and download.

Discussion
The process of creating a knowledge graph is complex 
and challenging. In this section, we will present some 
of the challenges we had to address, particularly those 
related to managing the heterogeneity of the datasets and 
their sizes. We will discuss the challenges in aligning the 
entities and assessing the data quality.

With respect to data heterogeneity, the main prob-
lem was the variety of data formats, which we solved via 
RDF in a unified format. We propose several pipelines 
that can handle this variety and manage the dataset size. 
Indeed, as discussed in the pipelines section, in most 
cases, we preferred to develop our solutions rather than 
use generic tools to better manage the complexity or size 
of the datasets. Another problem is the heterogeneity of 
the genomic coordinates (i.e., different denominations 
of the chromosome identifier, missing information, etc.). 
We solve it by choosing a unique representation and 

Fig. 4  The SPARQL query editor. The Query patterns frame allows users to select a query from a natural language question. The Query text frame allows 
the visualization and modification of the SPARQL query. The results frame displays results returned from the query

 
Page 8 of 10Pierre et al. BMC Genomic Data           (2025) 26:73 

transforming all coordinates into URI templates follow-
ing the FALDO ontology representation [34].

With respect to the problem of entity linking (i.e., 
the same entities with different names or identifiers), 
we have only partially solved this problem, using pat-
tern matching in URIs or database cross-linking to 
identify matches between entities. Indeed, in the case 
where the entities have a different namespace URIs (e.g., 
namespace1:identifier1 and namespace2:identifier1), 
we look for matching patterns in the URIs and create a 
new URI to establish the correspondence between them. 
If the entities have different URIs without matching pat-
terns but with synonymous properties (i.e., skos:altLabel, 
skos:prefLabel, skos:synonym or specific properties), we 
look for matches with these properties and the patterns 
of the URIs. For entities that do not contain the above 
information, we take a more global approach based on 
property and value analysis. However, this is an open 
challenge that is currently being addressed.

With respect to the processes followed for data qual-
ity assessment, preprocessing quality assessments such 
as input file format, raw line, and missing value check-
ing were developed for the resources used by the ETL 
pipeline. Next, the syntax of the triple products was 
validated via built-in libraries (e.g., with RDFlib). Other 
assessments include counting the number of entities (e.g., 
genes, proteins, chromosomes, etc.) and checking the 
presence/absence of properties with SPARQL query sets. 
More complex quality assessments, such as type restric-
tions on properties, are planned for the future.

Conclusion
Data in the agronomic field are highly heterogeneous, 
multi-scale, and dispersed. For plant scientists to success-
fully address the challenges of their daily work, it is essen-
tial to integrate information on a global scale. Semantic 
Web technologies are central to data integration and 
knowledge management. The biomedical domain offers 
a good example to follow for capitalizing on previous 
experiences and considering the lessons learned. We have 
developed the AgroLD KG to leverage this approach in 
agronomy. AgroLD exploits the power of seamless data 
integration offered by RDF. It contains more than 1,08 
billion triples, resulting from the integration of approxi-
mately 151 datasets gathered in 33 named graphs. 
However, the coverage of its species and data sources 
is expected to expand with subsequent releases. To our 
knowledge, AgroLD is one of the first initiatives to apply 
Semantic Web practices to the agronomic domain, play-
ing a complementary role in the integrative approaches 
adopted by the community.

AgroLD is being actively developed on the basis of 
feedback from domain experts. It has also benefited from 

the support of the SouthGreen Bioinformatics Platform 
since its beginning in 2015 by providing IT support and 
infrastructure to host data and web applications. South-
Green is one of the core platforms of the French Elixir-
EU node and thus provides long-lasting support for 
AgroLD. AgroLD is strongly linked to several use-cases 
of the D2KAB (​h​t​t​p​​s​:​/​​/​d​2​k​​a​b​​.​m​y​​s​t​r​​i​k​i​n​​g​l​​y​.​c​o​m) and 
DIG-AI projects (National Research Agency funded proj-
ect) to demonstrate the benefits of linked data to discover 
gene-phenotype interactions. With the achievement of 
the current phase, user feedback reveals some limitations 
and challenges in the current version. Thus, several issues 
are a matter of ongoing or future work.

On the one hand, we must extend the KG coverage to 
more biological entities (e.g., miRNA, lncRNA, transpos-
able elements) and relations (e.g., co-expression, regu-
lation, and interaction networks) to capture a broader 
view of the molecular interactions. For example, we need 
to integrate information on gene expression and gene 
regulatory networks. On the other hand, the ETL pro-
cess for KG creation is mostly based on domain-specific 
approaches, thus limiting its reusability. We will inves-
tigate approaches that use declarative functions for its 
creation.

Knowledge augmentation methods must be applied 
and adapted to the data. Indeed, we observed that some 
information remains hidden in the literal content of 
RDF, such as biological entities or relationships between 
them. Moreover, a large amount of related knowledge 
is available from external sources. We are currently 
developing methods to extract information embedded 
in unstructured data, such as KG text fields or external 
web documents and scientific publications, and bring 
this information in a structured form to the knowledge 
base. Finally, we extend the state-of-the-art data-linking 
techniques by considering the specificity of the biological 
domain.

Abbreviations
API	� Application Programming Interface
ETL	� Extraction, Transform, and Load
FAIR	� Findable, Accessible, Interoperable, and Re-usable
FALDO	� Feature Annotation Location Description Ontology
GAF	� Gene Ontology Annotation File
GFF	� Generic Feature Format
GWAS	� Genome–Wide Association Studies
KG	� Knowledge Graph
OBO	� Open Bio-Ontologies
RDF	� Resource Description Framework
RDFS	� Resource Description Framework Schema
RO	� Relation Ontology
REST	� Representational State Transfer
SIO	� Semantic Science Ontology
SKOS	� Simple Knowledge Organization System
SPARQL	� SPARQL Protocol and RDF Query Language
TSV	� Tab Separated Value
URI	� Uniform Resource Identifier
VCF	� Variant Call Format

https://d2kab.mystrikingly.com


Page 9 of 10Pierre et al. BMC Genomic Data           (2025) 26:73 

Acknowledgements
The authors thank the South Green Bioinformatic Platform and the I-Trop IRD 
supercomputer for their long-standing support. AgroLD is a service delivery 
plan selected resource by the ELIXIR-FR IFB (French Institute of Bioinformatics). 
This work was granted access to the HPC resources of IDRIS under the 
allocation 2024-A0160315119 made by GENCI.

About this supplement
This article has been published as part of BMC Genomic Data, Volume 26 
Supplement 1, 2025: International SWAT4HCLS Conference – Semantic 
Web Applications and Tools for Health Care and Life Sciences 2023. The full 
contents of the supplement are available at ​h​t​t​p​​s​:​/​​/​b​m​c​​g​e​​n​o​m​​d​a​t​​a​.​b​i​​o​m​​e​d​c​​
e​n​t​​r​a​l​.​​c​o​​m​/​a​​r​t​i​​c​l​e​s​​/​s​​u​p​p​​l​e​m​​e​n​t​s​​/​v​​o​l​u​m​e​-​2​6​-​s​u​p​p​l​e​m​e​n​t​-​1.

Authors’ contributions
PL wrote the manuscript and contributed to the construction of AgroLD KG, 
the development of ETL pipelines, and the maintenance of web applications. 
NT and BP contributed to the database and web server administration. 
BGHH, VG, and MR contributed to the data integration. YP contributed to the 
development of the Web application. All the authors read and approved the 
final manuscript.

Funding
Several research projects have supported the AgroLD platform. ETL pipelines 
have been endorsed by the D2KAB (ANR-18-CE23-0017) and FOOSIN (ANR-
19-DATA-0019-03) projects. The AgroLD Knowledge Graph (Knowledge Graph 
(KG)) development has been supported by the IBC (ProjetIA-11-BINF-0002) 
and DIG-AI (ANR-22-CE23-0012) projects. The Web application has been 
supported by the IFB project (ProjetIA-11-INBS-0013) and IRD funding support.

Data availability
The AgroLD datasets can be found at the Zenodo repository ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​g​/​​
1​0​.​​5​2​8​1​​/​z​​e​n​o​d​o​.​4​6​9​4​5​1​8. The Web application is available at ​h​t​t​p​​s​:​/​​/​g​i​t​​h​u​​b​.​c​​
o​m​/​​S​o​u​t​​h​G​​r​e​e​​n​P​l​​a​t​f​o​​r​m​​/​A​g​r​o​L​D​_​w​e​b​a​p​p and the RDF conversion pipelines 
(GFF2RDF, GAF2RDF, VCF2RDF, and datasets) are available at ​h​t​t​p​​s​:​/​​/​g​i​t​​h​u​​b​.​c​​o​
m​/​​S​o​u​t​​h​G​​r​e​e​​n​P​l​​a​t​f​o​​r​m​​/​A​g​r​o​L​D​_​E​T​L.

Declarations

Ethics approval and consent to participate
Not applicable.

Consent to publication
Not applicable.

Competing interests
The authors declare that the research was conducted without commercial or 
financial relationships that could lead to a potential conflicts of interest.

Received: 16 August 2023 / Accepted: 18 August 2025

References
1.	 Wilkinson MD, Dumontier M, Aalbersberg IJ, Appleton G, Axton M, Baak A, et 

al. The FAIR guiding principles for scientific data management and steward-
ship. Sci Data. 2016;3(1):160018. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​g​/​​1​0​.​​1​0​3​8​​/​s​​d​a​t​a​.​2​0​1​6​.​1​8.

2.	 W3C. RDF 1.1 Concepts and Abstract Syntax. 2014. Accessed 31 July 2025. ​h​t​t​
p​​s​:​/​​/​w​w​w​​.​w​​3​.​o​​r​g​/​​T​R​/​r​​d​f​​1​1​-​c​o​n​c​e​p​t​s​/

3.	 Nolin MA, Corbeil J, Lamontagne L, Dumontier M. Bio2RDF: Convert. Provide 
And Reuse Nat Precedings. 2010. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​g​/​​1​0​.​​1​0​3​8​​/​n​​p​r​e​.​2​0​1​0​.​5​0​6​0​.​1.

4.	 The UniProt Consortium. UniProt: a worldwide hub of protein knowledge. 
Nucleic Acids Res. 2018;47:D506–15. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​g​/​​1​0​.​​1​0​9​3​​/​n​​a​r​/​g​k​y​1​0​4​9.

5.	 Fu G, Batchelor C, Dumontier M, Hastings J, Willighagen E, Bolton E. Pub-
ChemRDF: towards the semantic annotation of PubChem compound and 
substance databases. J Cheminform. 2015. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​g​/​​1​0​.​​1​1​8​6​​/​s​​1​3​3​2​1​-​0​1​
5​-​0​0​8​4​-​4.

6.	 Kutmon M, Riutta A, Nunes N, Hanspers K, Willighagen EL, Bohler A, et al. 
Wikipathways: capturing the full diversity of pathway knowledge. Nucleic 
Acids Res. 2016. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​g​/​​1​0​.​​1​0​9​3​​/​n​​a​r​/​g​k​v​1​0​2​4.

7.	 Queralt-Rosinach N, Pinero J, Bravo A, Sanz F, Furlong LI. DisGeNET-RDF: 
harnessing the innovative power of the semantic web to explore the genetic 
basis of diseases. Bioinformatics. 2016. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​g​/​​1​0​.​​1​0​9​3​​/​b​​i​o​i​​n​f​o​​r​m​a​t​​i​c​​s​
/​b​t​w​2​1​4.

8.	 Shefchek KA, Harris NL, Gargano M, Matentzoglu N, Unni D, Brush M, et al. 
The Monarch Initiative in 2019: an integrative data and analytic platform 
connecting phenotypes to genotypes across species. Nucleic Acids Res. 
2020;48:D704–15. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​g​/​​1​0​.​​1​0​9​3​​/​n​​a​r​/​g​k​z​9​9​7.

9.	 Hassani-Pak K, Singh A, Brandizi M, Hearnshaw J, Parsons JD, Amberkar S, et 
al. KnetMiner: a comprehensive approach for supporting evidence-based 
gene discovery and complex trait analysis across species. Plant Biotechnol J. 
2021;19(8):1670–8. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​g​/​​1​0​.​​1​1​1​1​​/​p​​b​i​.​1​3​5​8​3.

10.	 Venkatesan A, Tagny Ngompe G, Hassouni NE, Chentli I, Guignon V, Jonquet 
C, et al. Agronomic Linked Data (AgroLD): A knowledge-based system to 
enable integrative biology in agronomy. PLOS ONE. 2018;13(11):1–17. ​h​t​t​p​​s​:​/​​/​
d​o​i​​.​o​​r​g​/​​1​0​.​​1​3​7​1​​/​j​​o​u​r​​n​a​l​​.​p​o​n​​e​.​​0​1​9​8​2​7​0.

11.	 Larmande, P., Todorov, K. (2021). AgroLD: A Knowledge Graph for the Plant 
Sciences. In: Hotho, A., et al. The Semantic Web – ISWC 2021. ISWC 2021. 
Lecture Notes in Computer Science(), vol 12922. Springer, Cham. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​
r​g​/​​1​0​.​​1​0​0​7​​/​9​​7​8​-​​3​-​0​​3​0​-​8​​8​3​​6​1​-​4​_​2​9

12.	 Bolser D, Staines DM, Pritchard E, Kersey P. Ensembl Plants: Integrating Tools 
for Visualizing, Mining, and Analyzing Plant Genomics Data. Methods Mol 
Biol. 2016;1374:115–40. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​g​/​​1​0​.​​1​0​0​7​​/​9​​7​8​-​​1​-​4​​9​3​9​-​​3​1​​6​7​-​5​_​6.

13.	 Huntley RP, Sawford T, Mutowo-Meullenet P, Shypitsyna A, Bonilla C, Martin 
MJ, et al. The GOA database: gene Ontology annotation updates for 2015. 
Nucleic Acids Res. 2015;43:D1057-1063. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​g​/​​1​0​.​​1​0​9​3​​/​n​​a​r​/​g​k​u​1​1​1​3.

14.	 Tello-Ruiz MK, Naithani S, Stein JC, Gupta P, Campbell M, Olson A, et al. 
Gramene 2018: unifying comparative genomics and pathway resources for 
plant research. Nucleic Acids Res. 2018;46:D1181–9. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​g​/​​1​0​.​​1​0​9​3​​/​n​​
a​r​/​g​k​x​1​1​1​1.

15.	 Kurata N, Yamazaki Y. Oryzabase. An integrated biological and genome 
information database for rice. Plant Physiol. 2006;140(1):12. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​g​/​​1​0​.​​
1​1​0​4​​/​p​​p​.​1​0​5​.​0​6​3​0​0​8.

16.	 Sakai H, Lee SS, Tanaka T, Numa H, Kim J, Kawahara Y, et al. Rice annotation 
project database (RAP-DB): an integrative and interactive database for rice 
genomics. Plant Cell Physiol. 2013;54(2):e6. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​g​/​​1​0​.​​1​0​9​3​​/​p​​c​p​/​p​c​s​1​
8​3.

17.	 Kawahara Y, de la Bastide M, Hamilton JP, Kanamori H, McCombie WR, Ouy-
ang S, et al. Improvement of the Oryza sativa Nipponbare reference genome 
using next generation sequence and optical map data. Rice. 2013;6(1):4. ​h​t​t​p​​
s​:​/​​/​d​o​i​​.​o​​r​g​/​​1​0​.​​1​1​8​6​​/​1​​9​3​9​-​8​4​3​3​-​6​-​4.

18.	 Szklarczyk D, Kirsch R, Koutrouli M, Nastou K, Mehryary F, Hachilif R, et al. The 
STRING database in 2023: protein–protein association networks and func-
tional enrichment analyses for any sequenced genome of interest. Nucleic 
Acids Res. 2022;51(D1):D638–46. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​g​/​​1​0​.​​1​0​9​3​​/​n​​a​r​/​g​k​a​c​1​0​0​0.

19.	 Lee T, Oh T, Yang S, Shin J, Hwang S, Kim CY, et al. RiceNet v2: an improved 
network prioritization server for rice genes. Nucleic Acids Res. 2015;43:W122–
7. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​g​/​​1​0​.​​1​0​9​3​​/​n​​a​r​/​g​k​v​2​5​3.

20.	 Jin J, Tian F, Yang DC, Meng YQ, Kong L, Luo J, et al. PlantTFDB 4.0: toward a 
central hub for transcription factors and regulatory interactions in plants. 
Nucleic Acids Res. 2017;45(D1):D1040–5. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​g​/​​1​0​.​​1​0​9​3​​/​n​​a​r​/​g​k​w​9​8​
2.

21.	 Tian F, Yang DC, Meng YQ, Jin J, Gao G. PlantRegMap: charting functional 
regulatory maps in plants. Nucleic Acids Res. 2020;48(D1):D1104–13. ​h​t​t​p​​s​:​/​​/​
d​o​i​​.​o​​r​g​/​​1​0​.​​1​0​9​3​​/​n​​a​r​/​g​k​z​1​0​2​0.

22.	 South Green collaborators. The South Green portal: a comprehensive 
resource for tropical and Mediterranean crop genomics South Green collabo-
rators. Curr Plant Biol. 2016;78:6–9. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​g​/​​1​0​.​​1​0​1​6​​/​j​​.​c​p​b​.​2​0​1​6​.​1​2​.​0​0​2.

23.	 Hamelin C, Sempere G, Jouffe V, Ruiz M. TropGeneDB, the multi-tropical crop 
information system updated and extended. Nucleic Acids Res. 2013;41. ​h​t​t​p​​s​:​
/​​/​d​o​i​​.​o​​r​g​/​​1​0​.​​1​0​9​3​​/​n​​a​r​/​g​k​s​1​1​0​5.

24.	 Valentin G, Abdel T, Gaëtan D, Jean-François D, Matthieu C, Mathieu R. Green-
PhylDB v5: a comparative pangenomic database for plant genomes. Nucleic 
Acids Res. 2020;49:D1464–71. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​g​/​​1​0​.​​1​0​9​3​​/​n​​a​r​/​g​k​a​a​1​0​6​8.

25.	 Larmande P, Gay C, Lorieux M, Périn C, Bouniol M, Droc G, et al. Oryza Tag 
Line, a phenotypic mutant database for the Génoplante rice insertion line 
library. Nucleic Acids Res. 2008;36:1022–7. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​g​/​​1​0​.​​1​0​9​3​​/​n​​a​r​/​g​k​m​7​
6​2.

26.	 Dereeper A, Homa F, Andres G, Sempere G, Sarah G, Hueber Y, et al. SNiPlay3: 
a web-based application for exploration and large scale analyses of genomic 
variations. Nucleic Acids Res. 2015;43:W295-300. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​g​/​​1​0​.​​1​0​9​3​​/​n​​a​r​/​
g​k​v​3​5​1.

https://bmcgenomdata.biomedcentral.com/articles/supplements/volume-26-supplement-1
https://bmcgenomdata.biomedcentral.com/articles/supplements/volume-26-supplement-1
https://doi.org/10.5281/zenodo.4694518
https://doi.org/10.5281/zenodo.4694518
https://github.com/SouthGreenPlatform/AgroLD_webapp
https://github.com/SouthGreenPlatform/AgroLD_webapp
https://github.com/SouthGreenPlatform/AgroLD_ETL
https://github.com/SouthGreenPlatform/AgroLD_ETL
https://doi.org/10.1038/sdata.2016.18
https://www.w3.org/TR/rdf11-concepts/
https://www.w3.org/TR/rdf11-concepts/
https://doi.org/10.1038/npre.2010.5060.1
https://doi.org/10.1093/nar/gky1049
https://doi.org/10.1186/s13321-015-0084-4
https://doi.org/10.1186/s13321-015-0084-4
https://doi.org/10.1093/nar/gkv1024
https://doi.org/10.1093/bioinformatics/btw214
https://doi.org/10.1093/bioinformatics/btw214
https://doi.org/10.1093/nar/gkz997
https://doi.org/10.1111/pbi.13583
https://doi.org/10.1371/journal.pone.0198270
https://doi.org/10.1371/journal.pone.0198270
https://doi.org/10.1007/978-3-030-88361-4_29
https://doi.org/10.1007/978-3-030-88361-4_29
https://doi.org/10.1007/978-1-4939-3167-5_6
https://doi.org/10.1093/nar/gku1113
https://doi.org/10.1093/nar/gkx1111
https://doi.org/10.1093/nar/gkx1111
https://doi.org/10.1104/pp.105.063008
https://doi.org/10.1104/pp.105.063008
https://doi.org/10.1093/pcp/pcs183
https://doi.org/10.1093/pcp/pcs183
https://doi.org/10.1186/1939-8433-6-4
https://doi.org/10.1186/1939-8433-6-4
https://doi.org/10.1093/nar/gkac1000
https://doi.org/10.1093/nar/gkv253
https://doi.org/10.1093/nar/gkw982
https://doi.org/10.1093/nar/gkw982
https://doi.org/10.1093/nar/gkz1020
https://doi.org/10.1093/nar/gkz1020
https://doi.org/10.1016/j.cpb.2016.12.002
https://doi.org/10.1093/nar/gks1105
https://doi.org/10.1093/nar/gks1105
https://doi.org/10.1093/nar/gkaa1068
https://doi.org/10.1093/nar/gkm762
https://doi.org/10.1093/nar/gkm762
https://doi.org/10.1093/nar/gkv351
https://doi.org/10.1093/nar/gkv351


Page 10 of 10Pierre et al. BMC Genomic Data           (2025) 26:73 

27.	 The AgroLD online documentation. Accessed 31 July 2025. ​h​t​t​p​​:​/​/​​w​w​w​.​​a​g​​r​o​l​​
d​.​o​​r​g​/​d​​o​c​​u​m​e​n​t​a​t​i​o​n​.​j​s​p

28.	 The Sequence Ontology consortium. Sequence Ontology. 2015. Accessed 31 
July 2025. ​h​t​t​p​​:​/​/​​w​w​w​.​​s​e​​q​u​e​​n​c​e​​o​n​t​o​​l​o​​g​y​.​o​r​g

29.	 The Gene Ontology Consortium. The Gene Ontology Resource: 20 years and 
still GOing strong. Nucleic Acids Res. 2019;47:D330–8. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​g​/​​1​0​.​​1​0​9​3​​
/​n​​a​r​/​g​k​y​1​0​5​5.

30.	 Walls RL, Cooper L, Elser J, Gandolfo MA, Mungall CJ, Smith B, et al. The Plant 
Ontology Facilitates Comparisons of Plant Development Stages Across Spe-
cies. Front Plant Sci. 2019;10. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​g​/​​1​0​.​​3​3​8​9​​/​f​​p​l​s​.​2​0​1​9​.​0​0​6​3​1.

31.	 Cooper L, Meier A, Laporte MA, Elser JL, Mungall C, Sinn BT, et al. The 
planteome database: an integrated resource for reference ontologies, plant 
genomics and phenomics. Nucleic Acids Res. 2018. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​g​/​​1​0​.​​1​0​9​3​​/​n​​
a​r​/​g​k​x​1​1​5​2.

32.	 Smith B, Ashburner M, Rosse C, Bard J, Bug W, Ceusters W, et al. The OBO 
foundry: coordinated evolution of ontologies to support biomedical data 
integration. Nat Biotechnol. 2007;25(11):1251–5. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​g​/​​1​0​.​​1​0​3​8​​/​n​​b​t​1​
3​4​6.

33.	 Dumontier M, Baker CJ, Baran J, Callahan A, Chepelev L, Cruz-Toledo J, et al. 
The semanticscience integrated ontology (SIO) for biomedical research and 
knowledge discovery. J Biomed Semant. 2014;5(1):14. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​g​/​​1​0​.​​1​1​8​6​​
/​2​​0​4​1​-​1​4​8​0​-​5​-​1​4.

34.	 Bolleman JT, Mungall CJ, Strozzi F, Baran J, Dumontier M, Bonnal RJP, et al. 
FALDO: a semantic standard for describing the location of nucleotide and 
protein feature annotation. J Biomed Semant. 2016;7:39. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​g​/​​1​0​.​​1​1​
8​6​​/​s​​1​3​3​2​6​-​0​1​6​-​0​0​6​7​-​z.

35.	 Relation Ontology consortium. OBO Relation Ontology. 2018. Accessed 31 
July 2025. https://oborel.github.io

36.	 The global schema of AgroLD. Accessed 31 July 2025. ​h​t​t​p​​s​:​/​​/​g​i​t​​h​u​​b​.​c​​o​m​/​​S​o​
u​t​​h​G​​r​e​e​​n​P​l​​a​t​f​o​​r​m​​/​A​g​​r​o​L​​D​_​E​T​​L​/​​t​r​e​e​/​m​a​s​t​e​r​/​m​o​d​e​l

37.	 Cyganiak R. Tarql: SPARQL for Tables. 2018. Accessed 31 July 2025. ​h​t​t​p​s​:​/​/​t​a​r​q​l​
.​g​i​t​h​u​b​.​i​o​​​​​​​

38.	 Dimou A, Sande M, Colpaert P, Verborgh R, Mannens E, Van De Walles R. RML: 
A generic language for integrated RDF mappings of heterogeneous data. 
In: CEUR Workshop Proc. 2014.Dimou A, Sande M, Colpaert P, Verborgh R, 
Mannens E, Van De Walles R. RML: A generic language for integrated RDF 
mappings of heterogeneous data. The 7th Workshop on Linked Data on the 
Web (LDOW2014) published in CEUR Workshop Proc. 2014 ​h​t​t​p​​s​:​/​​/​c​e​u​​r​-​​w​s​.​​o​r​
g​​/​V​o​l​​-​1​​1​8​4​​/​l​d​​o​w​2​0​​1​4​​_​p​a​p​e​r​_​0​1​.​p​d​f

39.	 Lefrançois, M., Zimmermann, A., Bakerally, N. (2017). A SPARQL Extension for 
Generating RDF from Heterogeneous Formats. In: Blomqvist, E., Maynard, D., 
Gangemi, A., Hoekstra, R., Hitzler, P., Hartig, O. (eds) The Semantic Web. ESWC 
2017. Lecture Notes in Computer Science(), vol 10249. Springer, Cham. ​h​t​t​p​​s​:​/​​
/​d​o​i​​.​o​​r​g​/​​1​0​.​​1​0​0​7​​/​9​​7​8​-​​3​-​3​​1​9​-​5​​8​0​​6​8​-​5​_​3

40.	 The 1000 Genome project Consortium. the Variant Call Format VCF. 2012. 
Accessed 31 July 2025. ​h​t​t​p​​:​/​/​​s​a​m​t​​o​o​​l​s​.​​g​i​t​​h​u​b​.​​i​o​​/​h​t​s​-​s​p​e​c​s​/

41.	 The Sequence Ontology Consortium. The formal specification of GFF3. 2014. 
Accessed 31 July 2025. ​h​t​t​p​​:​/​/​​w​w​w​.​​s​e​​q​u​e​​n​c​e​​o​n​t​o​​l​o​​g​y​.​o​r​g

42.	 The Gene Ontology Consortium. Gene Annotation File GAF. 2014. Accessed 
31 July 2025. ​h​t​t​p​​:​/​/​​g​e​n​e​​o​n​​t​o​l​​o​g​y​​.​o​r​g​​/​p​​a​g​e​​/​g​o​​-​a​n​n​​o​t​​a​t​i​​o​n​-​​f​i​l​e​​-​f​​o​r​m​a​t​-​2​0

43.	 The ETL API of AgroLD. Accessed 31 July 2025. ​h​t​t​p​​s​:​/​​/​g​i​t​​h​u​​b​.​c​​o​m​/​​S​o​u​t​​h​G​​r​e​e​​
n​P​l​​a​t​f​o​​r​m​​/​A​g​r​o​L​D​_​E​T​L

44.	 Laibe C, Wimalaratne S, Juty N, Le Novère N, Hermjakob H. Identifiers. org: 
integration tool for heterogeneous datasets. Dils 2014. 2014;14. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​
g​/​​1​0​.​​6​0​8​4​​/​m​​9​.​f​​i​g​s​​h​a​r​e​​.​1​​2​3​2​1​2​2​.​v​1.

45.	 Singh A, Rawlings CJ, Hassani-Pak K. Knetmaps: a BioJS component to visual-
ize biological knowledge networks. F1000Res. 2018. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​g​/​​1​0​.​​1​2​6​8​​8​
/​​f​1​0​​0​0​r​​e​s​e​a​​r​c​​h​.​1​6​6​0​5​.​1.

46.	 Rietveld L, Hoekstra R. The YASGUI family of SPARQL clients. Semant Web J. 
2015;30:10127–34.

Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in 
published maps and institutional affiliations.

http://www.agrold.org/documentation.jsp
http://www.agrold.org/documentation.jsp
http://www.sequenceontology.org
https://doi.org/10.1093/nar/gky1055
https://doi.org/10.1093/nar/gky1055
https://doi.org/10.3389/fpls.2019.00631
https://doi.org/10.1093/nar/gkx1152
https://doi.org/10.1093/nar/gkx1152
https://doi.org/10.1038/nbt1346
https://doi.org/10.1038/nbt1346
https://doi.org/10.1186/2041-1480-5-14
https://doi.org/10.1186/2041-1480-5-14
https://doi.org/10.1186/s13326-016-0067-z
https://doi.org/10.1186/s13326-016-0067-z
https://oborel.github.io
https://github.com/SouthGreenPlatform/AgroLD_ETL/tree/master/model
https://github.com/SouthGreenPlatform/AgroLD_ETL/tree/master/model
https://tarql.github.io
https://tarql.github.io
https://ceur-ws.org/Vol-1184/ldow2014_paper_01.pdf
https://ceur-ws.org/Vol-1184/ldow2014_paper_01.pdf
https://doi.org/10.1007/978-3-319-58068-5_3
https://doi.org/10.1007/978-3-319-58068-5_3
http://samtools.github.io/hts-specs/
http://www.sequenceontology.org
http://geneontology.org/page/go-annotation-file-format-20
https://github.com/SouthGreenPlatform/AgroLD_ETL
https://github.com/SouthGreenPlatform/AgroLD_ETL
https://doi.org/10.6084/m9.figshare.1232122.v1
https://doi.org/10.6084/m9.figshare.1232122.v1
https://doi.org/10.12688/f1000research.16605.1
https://doi.org/10.12688/f1000research.16605.1

	﻿AgroLD: a knowledge graph for the plant sciences
	﻿Abstract
	﻿Introduction
	﻿Methods
	﻿Information content
	﻿AgroLD integration pipelines
	﻿URI design and data linking

	﻿Results
	﻿Discussion
	﻿Conclusion
	﻿References