Soonho Kim1, Xingyi Song2, Boyeong Park1, Daeun Ko1, and Yanyan Liu1 1International Food Policy Research Institute (IFPRI), Washington, USA; 2University of Sheffield, Sheffield, UK December 2024 Real-time Media Analysis using Large Language Model (LLM) for the Top 5 Prioritized Pests and Diseases 2 Executive Summary This report presents a comprehensive overview of the real-time media analysis system developed to assess risks associated with the top five prioritized pests and diseases affecting crops. The activity, under Work Package 2 of the CGIAR Research Initiative on Plant Health, utilizes advanced text mining and machine learning techniques, including a Large Language Model (LLM), to process and analyze media articles. Key achievements include the development of an automated media analysis pipeline to monitor pests and diseases globally, the integration of GPT-4 to classify and extract detailed information from news articles, the creation of a public, interactive Crop Disease Dashboard providing real-time insights, the implementation of a cloud-based interface and REST API for user-friendly interaction and integration, and the ongoing refinement of the system based on human verification and feedback. This innovative approach aims to strengthen crop health monitoring and support policymakers and researchers in mitigating the risks posed by crop diseases and pests. 3 Table of Contents Summary .......................................................................................................................................... 2 1. Introduction ............................................................................................................................... 4 2. Methods .................................................................................................................................... 4 Identifying the top 5 prioritized pests and diseases and corresponding google articles ................................. 4 Determining the information we need to extract from the google articles ................................................... 6 3. Results ....................................................................................................................................... 7 The overview of the updated system ......................................................................................................... 7 Upgrade Disease Analysis Automation Pipeline .......................................................................................... 8 Verification of outputs by human ............................................................................................................... 9 Disease Analysis Cloud Service, GUI and REST API ...................................................................................... 9 Interactive Dashboard on top 5 prioritized diseases ................................................................................... 9 4. Conclusion ............................................................................................................................... 10 Acknowledgements ........................................................................................................................ 10 4 1. Introduction Outbreaks pose significant challenges to agricultural productivity. By employing eco-friendly strategies, the initiative aims to minimize crop losses caused by pests and diseases. This approach plays a crucial role in unlocking the agricultural potential of target countries, thereby improving food, feed, and nutrition security, while also enhancing the livelihoods of millions of smallholder farmers and consumers. A cornerstone of this initiative is the pest risk monitoring conducted under Work Package (WP) 2. This monitoring concentrates on the top five prioritized pests and diseases, employing daily analyses of media articles to monitor these threats. The analytical methodology integrates two primary elements: text mining through natural language processing and a Large Language Model (LLM). Together, these tools are designed to identify the various impacts of pests and diseases, including quantitative losses, qualitative losses, and crop fatalities, as reported in media content. The monitoring is executed in three distinct stages: Stage 1, initiated in 2022, involved identifying the top 5 prioritized pests and diseases and developing a tailored media analysis system. Stage 2, carried out in 2023, focused on the implementation and deployment of the media analysis system. Stage 3, implemented for 2024, refined the system further and published a monthly crop disease dashboard within the Food Security Portal. The collaboration with the University of Sheffield has been instrumental, particularly in leveraging the GATE system—an open-source software designed for text analysis using natural language processing. Additionally, this report underscores the integration of GPT-4 technology into the crop disease analysis framework to further enhance text mining capabilities. 2. Methods Identifying the top 5 prioritized pests and diseases and corresponding google articles Researchers involved in WP2 activities provided a list of pests and diseases prioritized for the first phase of surveillance activities under WP1, as shown in Table 1. Table 1. Priority pest & diseases for surveillance ID P&D Countries Crop 1 Fusarium head blight Mexico, Tanzania Wheat 2 Wheat rust Mexico, Ethiopia, Kenya, Uganda, Tanzania, Morroco, Lebanon, Egypt, Wheat 3 Wheat blast Zambia, Bangladesh Wheat 4 B. cockerelli, CaLso & PPT Ecuador, Peru Potato 5 Fusarium TR4 Ecuador, Mozambique, Tanzania, Vietnam Banana Table 2 provides information about the prioritized pests and diseases, detailing their associated countries, affected crops, and the RSS feed URLs used for media analysis. Fusarium Head Blight is monitored in Mexico and Tanzania, primarily affecting wheat, with its feed accessible through a Google News RSS link. Wheat Rust is prevalent across Mexico, Ethiopia, Kenya, Uganda, Tanzania, Morocco, Lebanon, and Egypt, also targeting wheat. Wheat Blast impacts Zambia and Bangladesh, focusing on wheat as well. B. cockerelli, CaLso & PPT is monitored in Ecuador and Peru, affecting potatoes, while Fusarium TR4 is significant in Ecuador, Mozambique, Tanzania, and Vietnam, with bananas as the affected crop. Lastly, Fusarium Wilt is observed in Uganda, also impacting bananas. Each pest and disease entry includes a corresponding RSS feed URL to facilitate ongoing 5 media tracking. by filtering keywords and refining Table 2. Corresponding RSS feeds for pests and diseases P&D Countries Crop Feed URL Fusarium h ead blight Mexico, Tanzania Wheat https://news.google.com/rss/search?q=Fusarium%20head%20bli ght&hl=en-US&gl=US&ceid=US:en Wheat rust Mexico, Ethiopia, Kenya, Uganda, Tanzania, Morroco, Lebanon, Egypt, Wheat https://news.google.com/rss/search?q=Wheat%20rust&hl=en- US&gl=US&ceid=US:en Wheat blast Zambia, Banglades h Wheat https://news.google.com/rss/search?q=Wheat%20rust&hl=en- US&gl=US&ceid=US:en B. cockerelli, CaLso & PPT Ecuador, Peru Potato https://news.google.com/rss/search?q=Bactericera%20cockerelli %20cockerelli&hl=en-US&gl=US&ceid=US:en Fusarium TR4 Ecuador, Mozambiq ue, Tanzania, Vietnam Banana https://news.google.com/rss/search?q=Fusarium%20TR4&hl=en- US&gl=US&ceid=US:en Fusarium wilt Uganda Banana https://news.google.com/rss/search?q=Fusarium%20wilt&hl=en- US&gl=US&ceid=US:en 6 Determining the information to extract from the google articles We aim to extract specific information from Google articles, focusing on details such as disease names, host plants, scientific and common names, impacted areas, affected countries, sub-regions, types of impact (qualitative, quantitative, or fatalities), duration (start and end dates), and the origin of the pest or disease. This structured approach ensures comprehensive monitoring of the prioritized pests and diseases. Following this, we designed a media analysis system utilizing the General Architecture for Text Engineering (GATE) platform developed by the University of Sheffield. Plant disease names, pest names, and host terminology were integrated with the European and Mediterranean Plant Protection Organization (EPPO) database and an in-house University of Sheffield Wikipedia crawled database. A terminology extraction algorithm was developed to extract plant names, diseases, and their relationships from Wikipedia articles, enriching the original EPPO database. The web crawled database was subsequently cleaned to remove noise or incorrect terminologies. Damage type prediction was conducted at the sentence level, where GPT-4 classified each sentence into one of the following categories: pest causing the death of the affected plant; pest causing quantitative production losses; pest causing qualitative production losses; or no damage mentioned. The classification system consists of two primary components: The input document is processed through the GATE Natural Language Processing toolkits, which include text extraction (from PDF and HTML inputs) and sentence splitting. An in-house fine-tuned BERT model serves as the classification model. The pre-trained BERT language model minimizes the need for extensive training data. We initially examined the extraction of impacted areas, duration, and origin country using JAPE (grammar extraction). Figure 1 provides an overview of the methodology to identify each item from Google articles. 7 Figure 1. Key information extracted from articles and the applied methods 3. Results The overview of the updated system The enhanced pipeline of the disease media analysis system, now utilizing GPT-4, is depicted in Figure 1. The system includes the following key components: 1. RSS and HTML Parser Component: An RSS parser extracts news URLs from the IFPRI RSS feed, followed by an HTML parser that retrieves the title and main content of the articles. The extracted articles are then stored locally. 2. Document-Level Named Entity Recognition (NER) Components: Five NER components are implemented to extract mentions of Disease, Pest, Country, Host, and Duration from the news articles. These mentions are formatted into the following categories: Country mention: ISO 3-letter codes. Disease: Scientific and common names. Pest: Scientific and common names. Host: Scientific and common names. Disease name --> keyword matching/synonym matching Scientific name Common name Local name Host --> keyword matching/synonym matching wheat Potato Banana Specify the name of hosts Pest name ( i.e. virus, bacteria, fungus) --> keyword matching/synonym matching Scientific name Common name Impacted area: are where the pest/disease is shown --> Location identification Country: country name Sub-region (sub the country) Type of impact --> GPT4 pest causing the death of affected plant Pest causing qualitative production losses Pest causing quantitative production losses No damage mentioned Duration --> Pattern matching starting date Ending date Origin country: area where the pest/disease came from --> Pattern matching 8 3. Sentence Splitter: A sentence splitter is used to break down the content of the news articles into sentence-level segments. 4. Damage Type Sentence-Level Classification: GPT-4 is employed to classify each sentence into one of the following categories: • Pest causing the death of affected plants. • Pest causing quantitative production losses. • Pest causing qualitative production losses. • No damage mentioned. The GPT-4 API service was leveraged for this classification task, focusing on tailored prompts to efficiently extract relevant information from datasets. Strategic timeout features and safeguard strategies were implemented to address potential server failures. 5. Relation Linking: This component connects extracted mentions (from NER) to the corresponding sentences with damage labels (e.g., pest causing death, quantitative losses) and organizes the results into a JSON format for output. Figure 2. The pipeline of the disease media analysis system An automated cron service is deployed on a dedicated Amazon GPU server. The pipeline retrieves the RSS feed three times daily and saves the output to the local disk with a timestamp. Output evaluation and error analysis are conducted on the outputs to ensure accuracy and reliability. Upgrade Disease Analysis Automation Pipeline The automation pipeline with GPT-4 has undergone significant upgrades, enhancing its overall functionality and performance. These enhancements include: The adaptation of the Google News feed with improved redirect link functionality to streamline data collection. The integration of GPT-4 as the primary analysis tool within the pipeline, dramatically improving analysis efficiency and addressing challenges associated with sparse training data. The implementation of a reliable backup system, utilizing the proven BERT pipeline to ensure uninterrupted service and maintain the system’s reliability, particularly during unexpected GPT-4 9 downtimes. The execution of media analysis for specific types of crop diseases through manual pipeline runs on selected news articles, paving the way for more accurate and human-led analyses. The pipeline is now fully optimized to automatically process any new data from RSS feeds, enabling timely and efficient analysis of media related to crop diseases. Verification of outputs by human The system outputs were evaluated manually for model fine-tuning. Researchers, with basic knowledge of prioritized pests and diseases, conducted the evaluation and verification process for any inaccuracies in the outputs. Given the time-consuming and labor-intensive nature of verification, randomly picked outputs were identified for this process, using an Excel file generated by the system. The researchers independently reviewed the original Google articles (feeds) and compared them with the system outputs at the sentence level, making necessary corrections. These updated outputs were subsequently reintroduced into the model as training sets, enabling further refinement and improvement of the system. Disease Analysis Cloud Service, GUI and REST API We developed a cloud-based interface (https://cloud.gate.ac.uk/shopfront/displayItem/crop-pest-classifier) to facilitate disease analysis testing. This platform provides scalable, remote access to analytical tools and features a user-friendly graphical interface, designed to simplify interaction for users with diverse technical backgrounds. Additionally, the platform is equipped with a robust REST API, enabling seamless data exchange with external systems. Interactive Dashboard on top 5 prioritized diseases A CROP DISEASE DASHBOARD shown in Figure 3 is an interactive visualization to visualize what we found from the media analysis. Figure 3. A screenshot of the interactive CROP DISEASE DASHBOARD The features an interactive world map indicating impacted countries in different shades of brown, representing data related to crop diseases. Below the map, there's a detailed table listing the number of articles, dates, affected countries, host commodities, common and scientific names of diseases, the type of impact, and article titles and links. As shown in the Figure x , the dashboard is accessed by public users through the Food Security 10 Portal project facilitated by International Food Policy Research Institute (https://www.foodsecurityportal.org/node/2784)/ 4. Conclusion This activity successfully developed an advanced system for monitoring and analyzing risks associated with the top five prioritized pests and diseases. Utilizing a combination of GPT-4 and text mining techniques, the system efficiently processes media articles to extract detailed information, including disease names, impacted areas, and types of damage. This structured approach has significantly enhanced the monitoring process and facilitated data-driven decision-making. The automation pipeline underwent substantial upgrades, including the integration of GPT-4 as the primary analysis tool, the implementation of a robust backup system with BERT, and the refinement of article classification through human verification. These advancements addressed prior challenges, such as irrelevant content inclusion, by introducing pre-processing steps to improve accuracy and reduce false positives. Key outcomes of this activity include the creation of an interactive CROP DISEASE DASHBOARD, providing real- time insights through a public-facing platform. This dashboard integrates outputs from the media analysis system, offering detailed visualizations and data summaries for policymakers, researchers, and stakeholders. The collaboration between IFPRI and the University of Sheffield has been pivotal, particularly through leveraging the GATE system and fine-tuning the model for accurate extraction of media information. This activity underscores the transformative potential of AI-driven tools in advancing agricultural health monitoring and supporting global food security initiatives. Acknowledgements We would like to thank all funders who supported this research through their contributions to the CGIAR Trust Fund: https://www.cgiar.org/funders/ We gratefully acknowledge the Food Security Portal for co-funding this project, an invaluable partnership that has significantly contributed to the advancement of our media analysis system and the ongoing efforts to safeguard crop health and global food security. This publication has been prepared as an output of CGIAR Initiative on Plant Health. This publication has not been independently peer-reviewed. Any opinions expressed here belong to the author(s) and are not necessarily representative of or endorsed by CGIAR. In line with principles defined in CGIAR’s Open and FAIR Data Assets Policy, this publication is available under a CC BY 4.0 license. © The copyright of this publication is held by IFPRI, in which the Initiative lead resides. We thank all funders who supported this research through their contributions to CGIAR Trust Fund