METHODS ARTICLE published: 25 August 2012 doi: 10.3389/fphys.2012.00326 Bridging the phenotypic and genetic data useful for integrated breeding through a data annotation using the Crop Ontology developed by the crop communities of practice Rosemary Shrestha 1, Luca Matteis 2, Milko Skofic2, Arllet Portugal3, Graham McLaren3, Glenn Hyman4 and Elizabeth Arnaud2* 1 Genetic Resources Program, Centro Internacional de Mejoramiento de Maiz y Trigo, Texcoco, Edo. de México, Mexico 2 Bioversity International, Maccarese, Rome, Italy 3 Generation Challenge Programme, Centro Internacional de Mejoramiento de Maiz y Trigo, Texcoco, Edo. de México, Mexico 4 Centro Internacional de Agricultura Tropical, Cali, Colombia Edited by: The Crop Ontology (CO) of the Generation Challenge Program (GCP) (http:// Jean-Marcel Ribaut, Generation cropontology.org/) is developed for the Integrated Breeding Platform (IBP) (https://www. Challenge Programme, Mexico integratedbreeding.net/) by several centers of The Consultative Group on International Reviewed by: Agricultural Research (CGIAR): bioversity, CIMMYT, CIP, ICRISAT, IITA, and IRRI. Integrated Omar Pantoja, Universidad Nacional Autonoma de Mexico, Mexico breeding necessitates that breeders access genotypic and phenotypic data related to Ruth Bastow, University of a given trait. The CO provides validated trait names used by the crop communities of Warwick, UK practice (CoP) for harmonizing the annotation of phenotypic and genotypic data and thus *Correspondence: supporting data accessibility and discovery through web queries. The trait information is Elizabeth Arnaud, Bioversity completed by the description of the measurement methods and scales, and images. The International, Via dei Tre Denari 472/a, 00057 Maccarese trait dictionaries used to produce the Integrated Breeding (IB) fieldbooks are synchronized (Fiumicino), Rome, Italy. with the CO terms for an automatic annotation of the phenotypic data measured in e-mail: earnaud@cgiar.org the field. The IB fieldbook provides breeders with direct access to the CO to get additional descriptive information on the traits. Ontologies and trait dictionaries are online for cassava, chickpea, common bean, groundnut, maize, Musa, potato, rice, sorghum, and wheat. Online curation and annotation tools facilitate (http://cropontology.org) direct maintenance of the trait information and production of trait dictionaries by the crop communities. An important feature is the cross referencing of CO terms with the Crop database trait ID and with their synonyms in Plant Ontology (PO) and Trait Ontology (TO). Web links between cross referenced terms in CO provide online access to data annotated with similar ontological terms, particularly the genetic data in Gramene (University of Cornell) or the evaluation and climatic data in the Global Repository of evaluation trials of the Climate Change, Agriculture and Food Security programme (CCAFS). Cross-referencing and annotation will be further applied in the IBP. Keywords: Crop Ontology, breeding trait, plant phenotype, trait dictionaries, breeding fieldbook, data annotation, integrated breeding platform, crop community of practice INTRODUCTION In the case of crop breeding programs, plant breeders repeat- In recent years, sequence information has become readily avail- edly measure a large number of traits in order to understand the able for a variety of crop species. However, a gap is emerging crop phenotype, based on variation in genotype and environ- between the physical genome information and the quantitative ment. Some traits are common across crops whereas some other information regarding phenotypes. It is becoming clear that the traits are crop specific such as anthesis silking interval (ASI) for application of quantitative genetic information by researchers and maize. Common traits across crops can be measured with differ- breeders is limited by a lack of standard nomenclature used to ent methods and scales. Likewise, one trait could be measured describe both crop development and agronomic traits. Without under several environmental conditions at different growth stages either a nomenclature or information, which provides the equiv- within a crop. Therefore, the management of crop characteriza- alence links between trait descriptions, it is hard to compare tion and evaluation data in databases at the global level is always information from Quantitative Trait Loci (QTL) and association complex and critical. The situation is more complex for traits like studies in a way that permits systematic transfer of knowledge resistance to disease or to abiotic stresses such as drought and about genotype-phenotype relationships among crops or between salinity tolerance. For example a plant pathologist could score crops. stem rust disease in the greenhouse at seedling stage or in the www.frontiersin.org August 2012 | Volume 3 | Article 326 | 1 Shrestha et al. Crop Ontology for integrated breeding field (adult plants for severity and incidence) by artificial inocula- elements, genetic regulatory control factors, or modulators of tion of pathogen or via natural infestation using different scoring the biochemical fluxes within metabolic and physiological path- rating scales. To enable comparison of these different types of ways, at the sub-cellular, tissue, organ, and whole organism level. measurements related to a single trait, and to support future mod- This sum total of molecular expression integrates the overall eling of the correlation among several traits the following are structural and behavioral features of the plant—its “phenotype.” required: (1) that a nomenclature and controlled vocabularies in The unfolding of this story also has an essential environmental the form of ontologies are applied in databases and knowledge context, including biotic (ecosystem) and abiotic (geophysical) bases and (2) the data generated by the trials/experiments are factors modulating expression in a variety of ways via diverse properly annotated by crop communities practiced in using val- sensory and regulatory mechanisms in the plant. Various classes idated trait names, and adjusted to the recommended methods of experimental data associated with this tapestry of germplasm of measurement and scales. Data annotation is the addition of function are summarized in Figure 1. metadata (i.e., ontological terms) that describe the data file and Phenotypes and genotypes can be characterized at various possibly the data point. Phenotype and genotype data annotation levels of abstraction and resolution (Bruskiewich et al., 2006). enable researchers to attach information and data to a botani- In the case of plant phenotypes, it includes measurements of cal term, a development stage and a trait name. It can also be traits at different growth stages, in various environments and used to specify the process through which trait data has been treatment conditions. Genotypes include laboratory measure- obtained and its provenance. Although annotation of genetic data ments of DNA and simple observations of visible phenotypes. is commonplace, data produced via phenotyping studies are usu- The molecular variation measured by genotyping can be neutral ally not annotated using a controlled vocabulary to facilitate their or biologically significant. Neutral molecular variation generally integration into multi-crop platforms. involves markers that simply exhibit DNA structural polymor- phism that is usefully applied to answer basic questions on the APPLICATION OF THE INTEGRATED BREEDING CROP extent of similarity between germplasm samples (i.e., “finger- ONTOLOGY IN CROP RESEARCH printing” experiments) or on the chromosome location of a The fundamental scientific question underlying research on marker (i.e., “mapping” experiments). Answering such questions diverse genotypes of any plant species is “What is the causal rela- will often lead to deeper exploration of germplasm, such as evo- tionship between genotype and phenotype?” DNA is transcribed lutionary studies, practical management of plant crosses, and into RNA, which is either bioactive itself (as non-coding RNA genetic resource management. Whatever the nature of phenotype gene products) or is translated into peptides that form part of and genotype measurements, the primary task is to completely protein gene products. Ultimately, these products act as structural capture and accurately codify the raw and derived phenotype FIGURE 1 | Biological relationships in germplasm research adapted from Bruskiewich et al. (2006). Frontiers in Physiology | Plant Physiology August 2012 | Volume 3 | Article 326 | 2 Shrestha et al. Crop Ontology for integrated breeding and genotype data. The role of the ontology is precisely to sup- the trait dictionary and includes a link to the corresponding trait port the description of all the pathways between the gene and name in the IB CO. the expression of the trait, enabling data interpretation (Shrestha The objectives of the integrated workflow between the IB et al., 2011). The Crop Ontology (CO) provides additional Fieldbook, the Trait Dictionary and the CO are (1) for breed- terms and descriptions of traits, along with methods and scales ers and data managers to define a standard list of traits; (2) for that complement the Gene Ontology (GO; http://geneontology. breeders to access more information on the trait and the protocols org), Plant Ontology (PO; http://plantontology.org) and Trait used for measurement when defining their evaluation experi- Ontology (TO; http://www.gramene.org/) for bridging a wider ment; (3) to provide an automatic annotation of the data captured set of annotated genetic, genomic, and phenotypic data with for- by breeders via the CO terms. The CO, in combination with the malized phenotype descriptions and leading to data discovery. crop trait dictionaries, provides a tool to foster the phenotypic Documentation of protocols related to phenotypic data is very and genotypic data curation and annotation by the communities important for enabling comparison across crops, environments of practice (CoP) of several crops using validated common trait and plant growth stages and the CO aims to provide comprehen- names, particularly breeders’ traits, protocols, and scales. sive information about the trait and the measurement of the trait. CREATING TRAIT DICTIONARIES FOR THE CROP DATABASES THE CROP ONTOLOGY (CO) AND THE TRAIT DICTIONARIES AND THE FIELDBOOKS IN THE INTEGRATED BREEDING FIELDBOOK The IB Fieldbook and the crop databases based on the The Integrated Breeding Platform (IBP; https://integratedbreed International Crop Information System (ICIS) contain the trait ing.net/) is developed by the Generation Challenge Programme dictionaries to support the harmonization of the trait measure- (GCP; http://www.generationcp.org/) for crop breeders. The ments across the phenotyping sites and the data annotation across objective of the IBP is to provide access to modern breeding tech- databases. The trait dictionaries and the ontology are embed- nologies, breeding material, and related information and services, ded into the crop databases for cassava, chickpea, rice, maize, in a centralized and functional manner. This should improve wheat, and soon for banana, groundnut, cowpea, common beans, plant breeding efficiency in developing countries and facilitate pigeon pea, and sorghum. Each crop-specific trait ontology and the adoption of molecular breeding approaches (Delannay et al., dictionary will be maintained by acrop lead center and/or a crop 2011). The Integrated breeding fieldbook (referred to in the text research community. as the IB Fieldbook, Figure 2) supports the harmonized capture To assist breeders an Excel spread sheet template was devel- of trait measurements in the evaluation sites and their integration oped to simplify the process of submitting traits, trait descrip- in the crop databases. The fieldbook’s trait template is based on tions, allocation of categories or valid ranges and measurement FIGURE 2 | Integrated Breeding Fieldbook for capturing trait measurement with mobile devices. www.frontiersin.org August 2012 | Volume 3 | Article 326 | 3 Shrestha et al. Crop Ontology for integrated breeding protocols. Utilization of the trait template was very helpful to continuously performed by the CoP and the use of the online obtain extended trait information and manage the quality con- ontology will be prioritized to avoid deviation from a single trol of trait names within the databases. Multi-location evaluation reference list of traits, methods and scales. programs have been conducted in several countries to ensure that trait names are stored in the fieldbooks and databases in several DEPLOYING THE TRAIT DICTIONARIES ANNOTATED WITH THE languages. An indicator of the language has also been added to CROP ONTOLOGY TERMS the online trait dictionaries so that crop communities can send The schema of the GCP crop database, along with the trait dictio- trait names in different languages via the basic trait template. The naries, is being deployed within each CoP through the installation same term identifier will be used for the same trait in different of a central database managed by the crop lead center and sev- languages, so that different versions of the same trait are referred eral local databases installed in the research stations and partners to as synonyms to facilitate the search of data across languages. institutions. The trait dictionaries that include the CO terms are Recently, the trait dictionaries were used to prioritize the traits embedded into the central database and are maintained by crop according to the frequency of use by breeders in their research data curators. The curator manages the validation and synchro- programs and importance for the crop. The objective was to pro- nization of trait dictionaries with the online CO curation tool. vide a core standard set of crop specific traits that will appear by The local crop databases contain the reference trait dictionaries default in the crop fieldbook wherever the crop is evaluated. A inherited from the central database that is used to design the field list of optional traits is also available and can be added by the book template for the handheld or the printed form. This data breeder according to the evaluation objective. All existing trait flow (Figure 3) ensures that traits measured in the field are har- dictionaries have been uploaded in the CO and are also avail- monized across sites and are captured within the template format. able for download on each crop page of the IBP website. The The CO terms and their identifiers, which are embedded into harmonization between the CO and the trait dictionaries will be the fieldbook template, ensure that data are already annotated FIGURE 3 | Trait data flow between the ontology, the crop databases and the field book. Frontiers in Physiology | Plant Physiology August 2012 | Volume 3 | Article 326 | 4 Shrestha et al. Crop Ontology for integrated breeding without any additional effort from the database curator. The methodology, which was applied for developing the PO and annotated data could therefore easily be synchronized from the TO, was also used for developing the CO. In order to embed hand held data capture devise to the local database and then to methods and scales in the Crop specific ontologies, new onto- the central crop database. logical relations were created such as “method_ of,” “scale_of,” and “derived_from” for meaningfully describe the traits and their DEVELOPMENT OF THE CROP-SPECIFIC TRAIT relations to methods and scales (Figure 4). ONTOLOGIES At present, the CO provides crop-specific trait ontologies for THE ONLINE CROP ONTOLOGY SITE FOR A cassava, chickpea, maize, musa, potato, sorghum, rice, wheat, COMMUNITY-BASED CURATION AND ANNOTATION as well as online trait dictionaries for common bean, cow- In 2011, the new CO website (www.cropontology.org) was pea, and groundnut developed by the crop lead centers of the released providing a tool for participatory ontology develop- GCP challenge initiatives. These simple trait lists built in the ment, curation, and annotation by the crop database curators form of controlled vocabularies with short descriptions do not (Figure 5). Users can browse crop-specific ontologies, access trait fulfill all the requirements for ontology-based access to data. definition with the bibliographic reference, synonyms, images, Therefore, the trait dictionaries will be upgraded into ontolo- term abbreviation, as well as online cross references to PO, TO gies by adding multiple relationships and cross referencing to and the GCP crop databases. The tool provides features for post- other major ontologies. Since 2007, the crop-specific ontologies ing comments and printing trait information. Only crop specific were developed in the crop lead centers, by teams of breeders, curators are allowed to upload ontologies, add new terms and biometricians and data managers using the OBO-Edit software attributes of traits and edit text to control quality. Video tutorials promoted by the Open Biomedical Ontology (OBO) communi- are available in the website. The code used for the development is ties such as GO (Ashburner and Lewis, 2002; Day-Richter et al., hosted on Google App Engine and the versioned code is hosted 2007), PO and TO (Jaiswal et al., 2002). By using OBO-Edit, on GitHub. ontology curators are able to construct the ontology from lists Trait measurement methods are displayed as derived terms of traits, create the necessary multi-relationships between terms, of the related trait name with newly created relationship and simultaneously create cross-references with the terms in TO “method_of” and scales are derived terms of their related and PO. Multi-relationships between biological terms provide the method with relationship “scale_of” (Figure 6). Providing proto- semantic framework, which is necessary to model the biologi- cols related to traits facilitates the selection of appropriate terms cal pathways, describing the expression of the traits in plants, for data annotation and data exchange across databases. in various tissues, at different development stages and different The prototype of the online annotation tool was inspired environments. by Terminizer, developed by David Hancock (University of The CO describes agronomic, morphological, physiologi- Manchester, http://terminizer.org/). This tool allows the user to cal, quality, and abiotic and biotic stresses related traits of associate the ontology terms with existing trait names extracted several crops using most common “is_a” and “part_of” rela- from the database or text and overcome the heterogeneous man- tions assigned by OBO-foundry (Shrestha et al., 2010). The ner of naming the traits (Figure 7). FIGURE 4 | Representation of the multi-relationships of “Anthesis silking interval” in OBO-Edit. www.frontiersin.org August 2012 | Volume 3 | Article 326 | 5 Shrestha et al. Crop Ontology for integrated breeding FIGURE 5 | Crop Ontology homepage (http://www.cropontology.org). Frontiers in Physiology | Plant Physiology August 2012 | Volume 3 | Article 326 | 6 Shrestha et al. Crop Ontology for integrated breeding FIGURE 6 | Online display of new relationships “method_of” and “scale_of” for “stem rust” along with information and images about the scale used for measurement. EXPANDING THE USE OF THE CROP ONTOLOGY INTO their respective ontology on the online curation tool: the Soybean THE INTERNATIONAL COMMUNITY FOR DATA ontology for Soybase and the Solanaceae ontology. DISCOVERY New CO terms were submitted for addition to PO and TO. The AN OPEN SOURCE SERVER OF CROSS-REFERENCED collaboration will continue through the cross-referencing of PO, TRAIT NAMES FOR DATA INTEGRATION TO and CO in order to develop internationally shared crop trait The online Integrated Breeding CO is a freely available resource ontology. To extend the access to genetic information, CO cura- that acts as open-source server for names of traits thanks to tors have cross-referenced most of the traits with synonyms in PO an Application Programming Interface (API). The API enables and TO. An important online feature is the active web linkages programmatic access to the CO by web sites, web services or of these cross-referenced terms that direct users to the corre- data template wizards that can dynamically synchronize their lists sponding term-specific page on Gramene (Cornell) or on PO and of traits with the CO. This synchronization supports the har- the annotated genetic data (e.g., QTL) associated with the trait monization of data annotation and then enables the discovery (if available) (Figure 8). of annotated data through web queries based on the ontology The United State Department of Agriculture (USDA) and the terms. The first site to use the API is the Global Agricultural Solanaceae Genomics Network (SGN)—who are presently the Trial Repository of the CGIAR program on Climate Change for most interested to cross reference their respective ontology and Food and Agriculture Security (CCAFS; http://www.agtrials.org: data with the GCP CO to enable data integration—have uploaded 8080/). The CCAFS initiative dynamically links the names of www.frontiersin.org August 2012 | Volume 3 | Article 326 | 7 Shrestha et al. Crop Ontology for integrated breeding FIGURE 7 | Screenshot of the online annotation tool showing steps in the annotation (C) information and images about the corresponding ontological annotation process: (A) paste data or metadata to annotate (B) the tool term are displayed below the term selected for annotation (e.g., anthesis generates a table and user can select one ontology (e.g., maize trait) before silking interval). Users can check and validate or reject the proposition. FIGURE 8 | Direct access to the QTL information associated with the trait “anthesis silking interval” on the Gramene website through the cross referencing link placed in the Crop Ontology. Frontiers in Physiology | Plant Physiology August 2012 | Volume 3 | Article 326 | 8 Shrestha et al. Crop Ontology for integrated breeding FIGURE 9 | Screenshot showing the dynamic link from the variable “spikelet fertility” on Agtrials to additional information in the online Rice Ontology. FIGURE 10 | Mockup of an ontological trait based access to the map of trials on Agtrials. www.frontiersin.org August 2012 | Volume 3 | Article 326 | 9 Shrestha et al. Crop Ontology for integrated breeding variables measured during the evaluation of varieties with the CO specific ontologies can interactively modify existing trait names or terms. The objectives are (1) to facilitate the annotation of the add new ones along with images, methods and scales. A full ontol- data files by users with harmonized trait names; and (2) to pro- ogy can easily be uploaded or created online, which encourages vide users with access to detailed information on the variables partnership for the cross-referencing of terms. Once published (Figure 9). online, the cross reference of traits are converted into a web link This cross-referencing prepares the ground for integration of to directly access related data in other websites like Gramene online data into a single site, and the objective is to integrate this (University of Cornell) or Agtrials (CCAFS-CIAT). This is the further within the IBP. Integrating the Agtrials website with the premise of the integration of phenotypic, genotypic and environ- CO would provide, for any given trait and crop, access to the mental data associated with a given trait. The IBP will further phenotypic data combined with geographical and environmental utilize the CO to integrate as much as possible of the genetic data data (Figure 10). in the genomic data management system with the phenotypic data collected in the GCP phenotyping sites. This online access of CONCLUSIONS the CO provides a useful mechanism for bridging a wider set of The development of a GCP CO for breeders’ traits is a pio- annotated genetic, genomic and phenotypic data with formalized neering activity that was acknowledged by major partners phenotype descriptions that will lead to new data discovery. in the agronomic research and in the landscape of pheno- type ontology development such as the USDA, the Solanaceae ACKNOWLEDGMENTS Genomics Consortium, Cornell University, the PO Consortium, To the breeders and data managers who develop the Crop ontol- the National Center for Biotechnology Information and the NSF ogy: Peter Kulakow, Bakare Moshood, Sam Ofodile, Ousmane Research Coordination Network on Phenotype. The CO develop- Boukare, Antonio Lopez Montes (IITA); Trushar Shah, Prasad ment is currently based on Trait dictionaries defined by teams of Peteti, Praveen R Reddy, Ibrahima Sissoko, Eva Weltzien, Isabel breeders and data managers for direct use in the IB Fieldbook. Vales, Suyah Patil (ICRISAT); Reinhard Simon (CIP); Inge van This initiative facilitates direct annotation of breeders’ data cap- den Bergh, Stephanie Channeliere (Bioversity International); tured in the field and will enable the integration of phenotypic Mauleon Ramil, Nikki Borgia, Ruaraidh Sackville-Hamilton and genetic data sets. It will also help the breeders, when evaluat- (IRRI); Alberto Fabio Guerero, Steve Beebe, Roland Chirwa ing traits in the field, to access the correct trait information they (CIAT). We would also like to thank Generation Challenge need, including detailed standard protocols and scales. Thanks to Programme (GCP) for providing the fund for this collaborative the new online curation and annotation tool, the curators of crop crop ontology development and implementation project work. REFERENCES Jaiswal, P., Ware, D., Ni, J., Chang, E. (2011). “Development of crop G and Arnaud E (2012) Bridging the Ashburner, M., and Lewis, S. E. (2002). K., Zhao, W., Schmidt, S., Pan, ontology for sharing crop phe- phenotypic and genetic data useful for On ontologies for biologists: the X., Clark, K., Teytelman, L., notypic information,” in Drought integrated breeding through a data gene ontology – uncoupling the Cartinhour, S., Stein, L., and Phenotyping in Crops: From Theory annotation using the Crop Ontology web. Novartis Found. Symp. 247, McCouch, S. (2002). Gramene: to Practice, eds P. Monneveux and developed by the crop communi- 66–80. discussion: 80–83, 84–90, development and integration J. M. Ribaut (Mexico: Generation ties of practice. Front. Physio. 3:326. doi: 244–252. of trait and gene ontologies for Challenge Programme (GCP), c/o 10.3389/fphys.2012.00326 Bruskiewich, R., Metz, T., and rice. Comp. Funct. Genomics 3, CIMMYT), 167–176. This article was submitted to Frontiers in McLaren, G. (2006). Bioinformatics 132–136. Plant Physiology, a specialty of Frontiers and crop information systems in Shrestha, R., Arnaud, E., Mauleon, in Physiology. Conflict of Interest Statement: The rice research. IRRN 31, 5–12. R., Senger, M., Davenport, G. Copyright © 2012 Shrestha, Matteis, authors declare that the research Day-Richter, J., Harris, M. A., Haendel, F., Hancock, D., Morrison, N., Skofic, Portugal, McLaren and Hyman was conducted in the absence of any M., The Gene Ontology OBO Edit Bruskiewich, R., and McLaren, and Arnaud. This is an open-access commercial or financial relationships Working Group., and Lewis, S. G. (2010). Multifunctional crop article distributed under the terms of the that could be construed as a potential (2007). Obo-Edit – An ontology trait ontology for breeders’ data: Creative Commons Attribution License, conflict of interest. editor for biologists. Bioinformatics field book, annotation, data which permits use, distribution and 23, 2198–2200. discovery and semantic enrich- reproduction in other forums, provided Delannay, X., McLaren, G., and Ribaut, ment of the literature. AoB Plants Received: 11 April 2012; accepted: 25 July the original authors and source are J. M. (2011). Fostering molecular 2010, plq008. 2012; published online: 25 August 2012. credited and subject to any copyright breeding in developing countries. Shrestha, R., Davenport, G. F., Citation: Shrestha R, Matteis L, Skofic notices concerning any third-party Mol. Breed. 29, 857–873. Bruskiewich, R., and Arnaud, M, Portugal A, McLaren G, Hyman graphics etc. Frontiers in Physiology | Plant Physiology August 2012 | Volume 3 | Article 326 | 10