TECHNICAL REPORT

CGIAR Initiative on  Digital Innovation  |  on.cgiar.org/digital 1

WaterCopilot: A Water Management AI Virtual Assistant for the Limpopo 
River Basin Digital Twin - Technical Guide 
Keerththanan Vickneswaran1, Hugo Retief2, Rafael Padilha3, Chris Dickens1, Paulo Silva1,  Mariangel Garcia 
Andarcia1

1 International Water Management Institute, Colombo, Sri Lanka
2 Association for Water and Rural Development, South Africa
3 Microsoft Research, USA

Citation
Vickneswaran, K.; Retief, H.; Padilha, 
R.; Dickens, C.; Silva, P.; Garcia 
Andarcia, M. 2024. WaterCopilot: a 
water management AI virtual assistant 
for the Limpopo River Basin Digital 
Twin - technical guide. Colombo, Sri 
Lanka: International Water 
Management Institute (IWMI). 
CGIAR Initiative on Digital 
Innovation. 23p. 

INFORMATION

Keywords Artificial intelligence, 
large language models, 
natural language 
processing, natural 
resources management, 
water management, 
environmental 
monitoring 

Flagship Digital Twin

Work package Real-time monitoring

Partners  The Leona M. And 
Harry B. Helmsley 
Charitable Institute, 
IWMI, AWARD, 
Microsoft Research

ABSTRACT
The present document provides a comprehensive overview of the development, 
architecture, and capabilities of the Limpopo Digital Twin Chatbot or Copilot 
(WaterCopilot). WaterCopilot is an AI-driven virtual assistant designed to enhance 
data accessibility and support decision-making for water management in the Limpopo 
River Basin (LRB). It has been developed through collaboration between the 
International Water Management Institute (IWMI) and Microsoft Research.  
WaterCopilot integrates advanced natural language processing with real-time data 
retrieval to address key challenges in water resource management, including 
fragmented information sources, manual data processing, and delays in response.

The document outlines the project's objectives, system architecture, and modular 
plugin approach, which enables the Copilot to seamlessly connect with various 
datasets, including real-time environmental data, historical records, and policy 
documents related to water availability, rainfall patterns, and environmental flow. By 
leveraging Azure OpenAI services, WaterCopilot interprets user queries and retrieves 
relevant information.  Key features of the Copilot include real-time monitoring of 
water availability, rainfall patterns, and environmental flow alerts, as well as user-
friendly data visualizations and contextual insights.

The deployment strategy utilizes Docker containers on AWS infrastructure, ensuring 
scalability, reliability, and efficient performance of the Copilot. This document also 
addresses the technical challenges encountered during development, the solutions 
implemented to create a robust and adaptable system, and outlines future work aimed 
at further enhancing WaterCopilot's capabilities. This detailed documentation serves 
as a technical guide to understanding WaterCopilot's capabilities, architecture, and 
future directions, emphasizing its role in supporting sustainable water management 
across the LRB.

NOTE: This is a Research Project and caution should be applied when AI is used for 
decision-making, as artificial intelligence and machine learning are rapidly evolving 
fields of study. Given the probabilistic nature of machine learning, the use of this work 
may, in some situations, result in outputs that do not accurately reflect real people, 
places, or facts.


TECHNICAL REPORT

CGIAR Initiative on  Digital Innovation  |  on.cgiar.org/digital 2

INTRODUCTION

WaterCopilot is an innovative chatbot or Copilot developed to 
provide comprehensive support for stakeholders involved in 
the management and analysis of the LRB. The project is a 
collaborative effort between IWMI and Microsoft Research, 
harnessing advanced AI technologies to deliver timely, 
accurate, and context-sensitive information. Large language 
models are AI systems capable of understanding and 
generating human language by processing vast amounts of text 
data (IBM, n.d.). Leveraging these advanced language 
processing capabilities, WaterCopilot aims to streamline data 
retrieval and enhance decision-making processes in a region 
critically dependent on its water resources.

The Limpopo River Basin (LRB), shown in Figure 1, is a vital 
watershed that spans four countries—Botswana, Mozambique, 
South Africa, and Zimbabwe. It serves as an essential source of 
water for millions of people, providing irrigation for 
agriculture, supporting biodiversity, and sustaining livelihoods 
(Sitoe & Qwist-Hoffman, 2013). However, the complexity and 

transboundary nature of the basin pose significant challenges 
for effective management and policymaking. Many 
applications in environment and water management require the 
capability to autonomously extract useful information from 
vast amounts of data in real time. The vast information related 
to the river's environmental conditions, rainfall patterns, and 
water management practices is often scattered across various 
platforms, documents, and databases, making it difficult to 
manage using traditional workflows. (Sun & Scanlon, 2019).

This AI virtual assistant, developed as part of the Digital Twin 
project, was designed to address these challenges by providing 
a centralized platform for accessing and processing essential 
information on the Limpopo River Basin. The Digital Twin 
concept integrates real-time data, simulation models, machine 
learning, and reasoning tools to create a virtual representation 
of the basin, enabling users to visualize scenarios and make 
informed decisions (Garcia Andarcia et al., 2024). Central to 
this project is the foundational hydrological model for the 
Limpopo River Basin, developed using the Soil and Water 
Assessment Tool Plus (SWAT+), which provides a robust 

Figure 1. Geographical Map of the Limpopo River Basin Highlighting Drainage Patterns and Key Locations [source: LIMCOM]


TECHNICAL REPORT

CGIAR Initiative on  Digital Innovation  |  on.cgiar.org/digital 3

framework for analyzing water availability and flow dynamics 
in the basin (Gurusinghe et al., 2024). This foundational model 
is enhanced through automation processes, such as the 
dynamic integration of real-time climatic data, including 
rainfall updates, which transition the model from a static 
framework to an operational tool. By enabling accurate 
predictions for rainfall trends, river flows, and water 
availability, this operationalized SWAT+ model supports 
sustainable water management across diverse climatic zones in 
the basin (Leitão et al., 2024).

The development of the Copilot was guided by Microsoft 
Research's prior experience with Farm Vibes, a platform 
designed to assist farmers in agricultural decision-making 
(Microsoft, n.d.). This expertise provided critical insights for 
adapting similar principles to water management challenges in 
the LRB.

Users can engage with the Copilot to inquire about current and 
historical rainfall, river flows, water availability, 
environmental flow (also referred to as E-Flows) alerts, and 
many other characteristics. The Copilot's interactive interface 
allows users to explore predefined topics, facilitating access to 
the information they need without navigating through 
cumbersome websites or lengthy documents, but it also allows 
a user to explore any topic related to water resources 
management in the LRB. By consolidating multiple data 
sources into a user-friendly format, WaterCopilot enhances the 
accessibility of essential information for researchers, water 
resource managers, and policymakers. 

NOTE: This paper describes the Digital Twin and the 
WaterCopilot in its prototype format.  While the essence of the 
Copilot has been completed, the addition of multiple streams 
of new data and information will greatly expand its capabilities 
over the years.   

PROJECT OBJECTIVES

The WaterCopilot project aims to address the critical need for 
accessible and efficient data retrieval concerning the LRB. As 
a collaborative effort with Microsoft Research, this initiative 
leverages advanced artificial intelligence technologies, 
particularly Large Language Models (LLMs), to facilitate user 
engagement and provide reliable information. The project 
objectives are as follows:

Primary Objectives

1. Enhance Accessibility of Information

One of the primary objectives of WaterCopilot is to streamline 
access to information regarding the LRB. Traditional data 
collection methods and existing web resources often present 
challenges, making it hard to navigate and time-consuming for 
users to find relevant information. By developing a Copilot that 
serves as a digital assistant, users can quickly and efficiently 
retrieve comprehensive data on rainfall, river flows, 

environmental flows and water availability, and can even 
generate alerts to pending transgression of targets.

WaterCopilot utilizes a pre-trained LLM model to summarize 
complex datasets and provide actionable insights, ensuring 
users receive relevant and concise information tailored to their 
queries. This allows for not only quicker access but also a 
holistic view of the available data, as users can obtain multiple 
insights in one place. Additionally, the Copilot enables 
retrieval of data from static documents, allowing users to 
gather additional information about the LRB. By consolidating 
multiple data sources and offering intelligent summaries, 
WaterCopilot significantly enhances the accessibility and 
usability of vital information, empowering stakeholders to 
make informed decisions with ease.

2. Provide Real-time Data and Analysis

Another key objective is to enable users to access real-time 
data through the Copilot interface. The WaterCopilot is 
designed to interact seamlessly with multiple databases, using 
APIs to deliver up-to-date information about the LRB. This 
feature ensures that users are equipped with the latest insights, 
allowing for informed decision-making regarding water 
management, agricultural practices, and environmental 
assessments. By providing a platform for real-time data 
analysis, WaterCopilot empowers stakeholders to respond 
promptly to changing conditions in the basin.

3. Support Decision-Making with Comprehensive Insights

The WaterCopilot aims to support decision-making processes 
by offering comprehensive insights derived from both live data 
and historical records. The Copilot can analyze trends, 
summarize findings, and provide graphical representations of 
data, thus enabling users to visualize complex information. 
This capability is particularly beneficial for policymakers and 
water resource managers who require a deep understanding of 
the basin's dynamics to formulate effective strategies for 
sustainable water use. By transforming raw data into 
meaningful insights, WaterCopilot has the potential to serve as 
an essential tool for informed decision-making.

Secondary Objectives

1. Foster Collaboration and Knowledge Sharing

Through its development, the WaterCopilot project 
emphasizes collaboration and knowledge sharing among 
different stakeholders involved in water resource management. 
The integration of Microsoft Research's expertise and 
technological advancements enhances the Copilot's 
capabilities, facilitating a platform where users can exchange 
ideas, share experiences, and collaborate on solutions for the 
LRB. This collaborative approach not only enriches the 
knowledge base but also strengthens partnerships among 
researchers, practitioners, and policymakers working towards 
sustainable water management.


TECHNICAL REPORT

CGIAR Initiative on  Digital Innovation  |  on.cgiar.org/digital 4

2. Promote User Engagement and Education

The WaterCopilot is also dedicated to promoting user 
engagement and education on issues concerning water 
management practices in the LRB. By providing an interactive 
platform that is user-friendly and accessible, the Copilot 
encourages users to explore and learn about the basin's 
challenges and opportunities. The integration of multilingual 
support further enhances engagement, allowing users from 
diverse linguistic backgrounds to access vital information. By 
fostering an educational environment, WaterCopilot aims to 
empower users, even those without the traditional skills used 
for water resources management, with the knowledge 
necessary for effective participation in water resource 
management initiatives.

3. Continuous Improvement and Adaptation

Lastly, the project objectives include a commitment to 
continuous improvement and adaptation of the WaterCopilot. 
User feedback will be integral in refining the Copilot's 
functionalities and expanding its knowledge base. By regularly 
updating the system with new data sources, features, and 
enhancements, WaterCopilot will become a relevant and 
effective tool for users in the ever-evolving landscape of water 
resource management. This dedication to improvement 
ensures that the Copilot will grow and increasingly meet the 
needs of its users and support sustainable practices in the LRB.

SYSTEM REQUIREMENTS

• The system shall enable users to query both forecast and
historical data using natural language.

• The system shall provide current information on rainfall,
river flows, water availability, and environmental flow
data relevant to the LRB.

• The system shall retrieve and summarize relevant
documents, reports, and static datasets through Azure AI
Services.

• The system shall connect seamlessly with APIs to retrieve
data from databases.

• The system shall generate graphical representations (e.g.,
charts and graphs) for environmental metrics such as
rainfall patterns and river flow statistics.

• The system shall provide users with real-time alerts and
visualizations for monitoring eflow thresholds and other
critical indicators in the LRB.

• The system shall summarize complex data and highlight
key insights to support decision-making for water
management stakeholders.

• The system shall offer contextual insights based on
historical trends and patterns, giving users a
comprehensive view of environmental changes in the

LRB.

• The system shall maintain previous interaction and
conversation knowledge to improve interactions within
the same conversation.

• The system shall enable interactions in multiple
languages, including English, Portuguese, and French, to
accommodate diverse user groups.

• The system shall ensure each response includes a
reference to the original data source, promoting
transparency and trustworthiness.

SYSTEM DESIGN
High-Level Design

The WaterCopilot system is designed to provide users with 
real-time and static information about the LRB by integrating 
various data sources, including static documents and dynamic 
databases. This interactive Copilot interface interprets user 
queries, retrieves relevant data, and presents responses in a 
user-friendly format.

As shown in Figure 2, the architecture consists of multiple 
components: the web interface for user interaction, the Copilot 
Agent for query orchestration, plugins for accessing both static 
and dynamic data, and the integration of Azure AI for 
enhanced natural language processing and search capabilities. 
Together, these components create a seamless and powerful 
platform that allows users to access both static and real-time 
data on the LRB.

This high-level design outlines the overall architecture of 
WaterCopilot and its approach to processing queries for both 
static and dynamic data.

Overall Architecture of WaterCopilot

This section outlines how the system components interact with 
one another and how each component contributes to delivering 
an integrated user experience at a high level (Figure 2).

1. WaterCopilot Web-Interface:

a. Purpose: This is the interface that interacts directly with
the user. It receives user queries and provides answers. It
acts as the user-facing component where the user inputs
their requests (queries) and views the responses. This
web interface facilitates conversational interaction
between the user and the Copilot system.

b. Function: It accepts the user’s query and sends it to the
Copilot Agent for processing. Once the agent retrieves
the necessary information from the appropriate plugin or
tool, the answer is displayed back to the user via this
interface.


TECHNICAL REPORT

CGIAR Initiative on  Digital Innovation  |  on.cgiar.org/digital 5

2. Copilot Agent:

a. Purpose: The core logic engine or orchestrator of the
system. It manages the flow of requests and determines
which plugins or tools should be utilized based on the
user's query.

b. Function: Upon receiving a user query from the
WaterCopilot Web-Interface, the Copilot Agent decides
whether to call the Document Retrieval Plugin or the
IWMI API Plugin. It then gathers the necessary
information and passes the response back to the web
interface.

3. Plugins:

a. Document Retrieval Plugin (iwmi-doc-plugin):

i. Purpose: This plugin is designed to interact with
Azure AI Search for querying static documents. It
facilitates the search and retrieval of relevant
documents stored in Azure based on the user's input.

ii. Function: When the Copilot Agent identifies that the
query requires static document retrieval, it delegates
this task to the Document Retrieval Plugin, which in
turn retrieves the needed information from Azure AI
Search.

b. IWMI API Plugin (iwmi-api-plugin):

i. Purpose: This plugin is developed to interact with
APIs that access databases and CSV files to provide
dynamic information and visual charts.

ii. Function: For user queries that require interaction
with live data from external APIs (e.g., water flow
data, visualizations), this plugin communicates with
these tools and returns the results to the Copilot Agent.

4. External Resources:

a. Azure AI Search: A cloud-based tool used for processing,
indexing and searching static documents. It provides
search capabilities that the Document Retrieval Plugin
utilizes.

b. External APIs: These include external APIs, which are
built to interact with databases and CSV-based data
sources, providing dynamic information such as real-
time environmental flow data.

c. Storage: This represents databases where dynamic data,
such as E-flow, water availability, and rainfall, are stored.

WaterCopilot Query Process for Static Documents

As shown in Figure 3, the query process for static documents 
in WaterCopilot ensures that user input is processed accurately 
and efficiently. It follows a series of steps to retrieve relevant 
documents from the indexed database. These steps include 
embedding the query, filtering metadata, interacting with the 
retrieval plugin, and presenting the most relevant results.

Figure 1. High-Level Overview of the WaterCopilot Agent Architecture and Its Integration with Plugins and Tools [Source: IWMI]


TECHNICAL REPORT

CGIAR Initiative on  Digital Innovation  |  on.cgiar.org/digital 6

1. User Input

The process initiates when a user submits a natural language
query through the WaterCopilot interface.

2. Embedding Process

The submitted query undergoes an embedding process,
where it is transformed into a vector representation. This
transformation utilizes the same embedding model applied
to the indexed static documents, ensuring that both the query
and documents reside in the same semantic space. This
alignment is critical for achieving accurate similarity
matching.

3. Metadata Filtering

Following the embedding process, the query vector
representation is combined with metadata filters proposed
by the agent LLM. These filters refine the search results by
applying specific criteria, such as document type and
relevance, to narrow down the scope of the retrieved
information. This step ensures that only the most applicable
static documents are considered in the retrieval process.

4. Retrieval Plugin Interaction

After filtering, the query is routed to the retrieval plugin.
This component interacts with the Azure index, leveraging
the query embedding and metadata filters to conduct a
search within the indexed static documents.

5. Result Retrieval and Presentation

The retrieval plugin identifies and returns the most relevant
snippets from the static documents that closely match the
user’s query. Through this workflow, the WaterCopilot
effectively presents contextually similar results, maximizing
response accuracy and relevance, and aligning with the
user’s intent.

WaterCopilot Query Process for Dynamic data

As shown in Figure 4 , WaterCopilot queries databases through 
an API call to retrieve real-time data. It follows a series of steps 
to process the user's request, ensuring accurate and relevant 
information is retrieved efficiently.

1. Query Analysis

The process begins when the user enters a query into
WaterCopilot, seeking specific information on topics like
water management, historical or forecasted data, or other
supported subjects related to LRB. WaterCopilot then uses a
Language Model (LLM) to analyze the query, interpreting
the user’s intent and identifying the specific data
requirements needed to respond accurately.

2. Determining the Tool and Extracting Parameters

After analyzing the query, the WaterCopilot system refers to
tool details, including each tool's description and argument
information, to identify the most appropriate tool for
handling the request. The LLM then extracts the necessary
parameters for the chosen tool from the query, such as
location, date, or data type, to ensure accurate data retrieval.

3. API Call Setup and Execution

With the extracted parameters, WaterCopilot calls the tool,
setting up the API request with the specific parameters
required. The tool then executes the API call.

4. Retrieving Data

After the API call is made, WaterCopilot fetches the relevant
information from multiple databases, including IWMI-DB,
INWARDS-DB, and FISHTRAC-DB. This step gathers data
from various sources, compiling the information needed to
respond to the user’s query. Once the data is retrieved,
WaterCopilot organizes it into a user-friendly response.
WaterCopilot then displays the answer to the user,
completing the process and delivering the requested
information.

Figure 3. High-Level Workflow of Copilot Querying Static 
Documents with Embedding and Retrieval Processes [Source: 
IWMI]

Figure 4. High-Level Workflow of Copilot Querying Databases 
via API Calls Using the IWMI API Plugin [Source: IWMI]


TECHNICAL REPORT

CGIAR Initiative on  Digital Innovation  |  on.cgiar.org/digital 7

Low-Level Design

This section outlines the workflow through which the 
WaterCopilot processes user prompts, identifies the 
appropriate tools, and determines when to invoke them. A 
plugin consists of related tools, each responsible for executing 
a specific task. Each tool is described in a way that allows the 
language model (LLM) to dynamically select and call the 
relevant tool based on the user's input. Tools also accept 
specific parameters to execute their tasks accurately. This 
workflow ensures seamless interaction between user inputs 
and the system, enabling the execution of tasks through 
integrated tools that deliver accurate and dynamic responses.

As shown in Figure 5, WaterCopilot’s tool calling architecture 
involves determining the appropriate tools based on the user’s 
query. The process includes retrieving a list of available tools, 
analyzing the query, and invoking the relevant tool to gather 
and present the required data. 

1. Tool Retrieval

Each time the user submits a query, the WaterCopilot calls a 
specific endpoint to retrieve a list of available tools along with 
their descriptions. This retrieval includes detailed metadata for 
each tool, outlining its capabilities, such as data retrieval, 
computations, or specific actions relevant to the LRB. The 
descriptions provide the WaterCopilot with information about 
what each tool does and specify the arguments that each tool 
accepts, ensuring the bot can utilize the available tools 

effectively.

2. User Input

Users interact with the WaterCopilot by providing input. This 
input could be a general inquiry or a specific request requiring 
one or more tools to process the information. The bot is 
designed to handle both types of queries seamlessly, directing 
the request either to its internal knowledge base or to the 
appropriate tool or function.

 The bot receives the user's input, which may request specific 
data or trigger an action. For example:

• General Inquiry: "Describe the countries in the Limpopo
basin?"

• Specific Data Request: "What is the current alert level for
e-flow?"

In both cases, the bot interprets the user's query, analyzing 
whether it can be answered directly or if external functions 
(such as APIs or data processing tools) are required to fulfill 
the request. This analysis is key to ensuring that the bot 
dynamically utilizes available resources to deliver accurate and 
timely responses.

3. Determining the Tool

Based on the user input, the bot evaluates whether the request 
can be handled using its built-in language processing 
capabilities or whether it requires the assistance of an external 

Figure 5. Detailed architecture of the tool calling mechanism in WaterCopilot, where the LLM calls the appropriate tool based on 
the user query and tool description. [Source: Serquex76.fr]


TECHNICAL REPORT

CGIAR Initiative on  Digital Innovation  |  on.cgiar.org/digital 8

tool or function. The bot compares the user input against the 
retrieved tool descriptions to determine if a tool is necessary.

• Query Analysis: The bot analyzes the user's query,
comparing it with the available tools and their capabilities
as defined by the tool descriptions.

• Tool Requirement Determination: If the query requires
specific data, calculations, or actions that correspond to
the functionality of one of the registered tools or
functions, the bot determines that the tool should be
invoked.

4. Invoking Tools

If the bot identifies that a tool or function is required, it 
proceeds to call the relevant tool with the necessary arguments 
(parameters) extracted from the user input. The bot 
communicates directly with the tool, sending any data required 
for the tool's operation.

• Parameter Passing: The bot extracts the relevant
arguments from the user input and passes them to the
identified tool or function.

• Tool Invocation: The bot invokes the registered tool and
waits for the response.

5. Receiving and Processing Tool Responses

Once the tool has been executed, the bot receives the response 
generated by the tool or function. The bot then processes and 
interprets this data, ensuring it is presented in a user-friendly 
and meaningful way for easy understanding. This may involve 
converting technical data into simpler terms, providing 
summaries, or creating visualizations to enhance clarity.

AAS AML Pipeline for Indexing Documents

As shown in Figure 6, WaterCopilot utilizes the AAS AML 
pipeline to generate an Azure index from the Limpopo PDF 
documents. This process involves extracting content, parsing 
and chunking the data, generating embeddings, and indexing 
the information for efficient retrieval.

1. Documents

The pipeline begins with source documents, such as PDF
files containing static information about the Limpopo basin.
These documents could include research reports, policy
documents, or environmental data.

2. Extraction and Parsing

a. Extracting: The first stage involves extracting raw content
from documents, including text, tables, and other
elements. This process is conducted using Azure
Document Intelligence services, which enable the
efficient extraction of both structured and unstructured
data.

b. Parsing: The extracted content is parsed to identify its
structure and categorize different types of information.
For example, paragraphs, headers, or tables are separated
and classified.

3. Chunking

a. Purpose: Large documents are divided into smaller,
manageable pieces or "chunks." This is crucial for
ensuring that each chunk is small enough for efficient
processing, especially during the embedding and
querying phases.

b. Benefits: Chunking improves the efficiency of the
indexing process by handling content at a granular level,
enabling more precise search results.

4. Embedding

a. Purpose: Each chunk is converted into a vector
representation (embedding) using a pre-trained deep
learning model. These embeddings capture the semantic
meaning of the text, allowing for more accurate search
and retrieval.

b. Method: Pre-trained models are used to create high-
dimensional vectors for each chunk, enabling efficient
comparisons and search operations.

Figure 6. Pipeline Architecture for Generating an Azure Index from Limpopo PDF Documents [Source: IWMI]


TECHNICAL REPORT

CGIAR Initiative on  Digital Innovation  |  on.cgiar.org/digital 9

5. Metadata Generation

a. Purpose: Metadata is created for each chunk. Metadata
might include the document’s title, author, date,
geographical scope and other relevant attributes.

b. Importance: Metadata enriches the indexing process,
providing additional context that improves the relevance
of search results.

6. Indexing

a. Purpose: The final step involves indexing the processed
chunks and their embeddings into a vector-based search
index.

b. Tool: Azure AI Search Vector Database is used for
indexing, ensuring the content is searchable based on its
semantic meaning and metadata.

7. Searchable Azure AI Index

The indexed documents are now stored in a format that
supports fast and accurate searches using Azure's AI-
powered search capabilities. This enables users to query the
Limpopo basin documents by meaning, not just keywords,
allowing for more relevant search results.

SYSTEM IMPLEMENTATION

This section provides an in-depth overview of the plugin 
architecture within WaterCopilot and highlights its essential 
role in enhancing the bot’s functionality. The use of plugins is 
foundational, enabling modular backend development that 
allows the integration of various data sources and specialized 
tools. This approach not only extends the bot’s capabilities but 
also provides flexibility to add or update components 
independently.

Two primary plugins have been developed for WaterCopilot, 
each serving a unique purpose and designed to provide specific 
data and services. Each plugin operates as a backend module, 
pulling relevant data and processing it as required. Detailed 
descriptions of each plugin, including key code sections and 
functionalities, are included, with full code accessible in their 
respective GitHub repositories.

Following the plugin development, the WaterCopilot 
application itself acts as the user interface and core 
orchestrator. It bridges user interactions with the backend 
plugins, managing the flow of data and determining which 
tools to invoke based on the user’s query. This orchestration 
ensures a cohesive and responsive user experience by 
dynamically engaging with the relevant plugins to gather and 
present information in real-time. Together, these components 
create a seamless interaction between users and the complex 
data ecosystem integrated within WaterCopilot.

Plugin Development

The development of plugins represents a fundamental aspect of 
the WaterCopilot, facilitating the integration of diverse data 
sources and enhancing the Bot's overall functionality. At its 
core, a plugin can be envisioned as a service that houses a 
collection of specialized tools. Each of these tools is designed 
to perform specific functions essential for data processing and 
interaction within the application.

Structure and Functionality of Plugins

Each plugin is implemented as a standalone Python file, within 
which multiple tools are defined. This structure allows 
developers to encapsulate related functionalities and establish 
clear boundaries for tool operations. Each tool is accompanied 
by a detailed description that articulates its purpose and 
operational parameters, including argument requirements and 
the expected types of input. This design is not for 
documentation purposes; it plays a critical role in how the 
underlying LLM model, specifically the GPT-4, interprets and 
utilizes the tools.

When a user poses a question to the WaterCopilot, the GPT-4 
model engages in an analytical process. It examines the 
descriptions of the available tools and their arguments to 
determine which tool is best suited to address the user's 
inquiry. This mechanism ensures that users receive accurate 
and contextually relevant responses, as the model is equipped 
to make informed decisions based on the structured metadata 
provided by the developers.

Basic Template for Creating a New Plugin

As shown in Figure 7, WaterCopilot uses a basic template for 
creating tools with the Industry Plugin Devkit Library. This 
template helps developers define the new tools and set up the 
necessary functions for seamless integration and functionality 
within WaterCopilot.

Steps to Implement Your Plugin

1. Define Each Tool’s Purpose and Functionality

a. Use the @tool decorator to designate a function as a tool.

b. Provide a brief description in quotes after @tool to
explain what the tool does. This helps the bot determine
when and how to use this tool based on user queries.

2. Set Up Arguments with Annotations

a. Use Annotated type hints for each argument to specify its
type and purpose. This makes it clear what type of input
the tool expects.

b. Clearly describe each argument’s role, acceptable values,
and data type, so the bot can interpret inputs accurately
when invoking the tool.


TECHNICAL REPORT

CGIAR Initiative on  Digital Innovation  |  on.cgiar.org/digital 10

3. Implement Tool Logic

a. Inside the function, write the code to retrieve or process
data (e.g., query a database, make an API call).

b. Ensure that the return value is formatted clearly and
concisely, providing a user-friendly response.

4. Testing and Deployment

a. Run thorough tests to verify that each tool functions
correctly with expected inputs and gracefully handles any
potential errors.

b. Once validated, register and deploy the plugin in the
WaterCopilot environment. This will allow the bot to
access and use the new tool seamlessly.

Tool Selection and User Interaction

The process of selecting tools based on descriptions and 
argument specifications is integral to the functionality of the 
WaterCopilot. The clarity of the tool descriptions allows the 
GPT model to map user queries to the appropriate tools, 
streamlining the interaction process. For instance, if a user 
inquiry about water availability data, the model can quickly 
identify the specific tool responsible for fetching that data 
based on the descriptions and argument requirements defined 
in the plugins.

This intelligent mapping significantly enhances the user 
experience by minimizing the need for users to navigate 
complex workflows. Instead, they can engage with the  
WaterCopilot using natural language, posing questions in a 
straightforward manner. The underlying model takes care of 
interpreting these queries, determining which tool to call, and 

executing the necessary operations to provide the user with 
relevant insights.

Ease of Tool Addition and Extensibility

An essential advantage of this plugin architecture is the ease 
with which developers can add new tools. Using tool 
annotations, developers can provide essential metadata that 
enriches the model's understanding of each tool's capabilities. 
This annotation process simplifies the integration of new 
functionalities, allowing developers to respond quickly to 
emerging data sources or user needs.

For example, if a new data source becomes available that 
relates to environmental metrics, a developer can create a new 
tool within an existing plugin or introduce a new plugin 
altogether, complete with annotated descriptions and argument 
details. This extensibility is vital for keeping the WaterCopilot 
relevant and responsive, as it allows for ongoing enhancements 
without overhauling the existing infrastructure.

Developed Plugins Overview

 During the development of the WaterCopilot, two key plugins 
were implemented using the Industry Plugin Devkit library. 
Industry Plugin Devkit, The Devkit is a comprehensive library 
that simplifies plugin development, providing essential 
templates, tools, and functionalities to create, build, and deploy 
plugins seamlessly. While currently a private library, it is 
expected to be published soon by Microsoft Research, making 
it accessible to a wider audience. This library was pivotal in 
building efficient and modular plugins, enabling easy 
management and integration of tools.

The two plugins developed are: iwmi-doc-plugin and iwmi-

Figure 7. Code Template for Developing Plugins with the Industry Plugin Devkit Library


TECHNICAL REPORT

CGIAR Initiative on  Digital Innovation  |  on.cgiar.org/digital 11

api-plugin. Each serves a distinct purpose, allowing the 
WaterCopilot to handle different types of data with precision. 
By utilizing the Industry Plugin Devkit library, these plugins 
are not only robust but also flexible, supporting the addition of 
new tools and functionalities with minimal effort.

• iwmi-doc-plugin: This plugin is designed to interact with
static documents, allowing the WaterCopilot to query data
from an Azure AI Search index and extract information
from PDFs. The iwmi-doc-plugin enables users to
seamlessly retrieve the most relevant data stored in index
format in Azure AI Services.

• iwmi-api-plugin: This plugin is designed to integrate
with external APIs and query MySQL databases. It
facilitates the WaterCopilot's ability to retrieve and
process dynamic data, such as Rainfall analysis, Eflow
alerts, Water availability, and SWAT information. The
iwmi-api-plugin ensures smooth interaction with external
data sources, providing accurate and up-to-date responses
to user queries.

IWMI-DOC-PLUGIN OVERVIEW

The iwmi-doc-plugin is a key component of the WaterCopilot, 
specifically designed to interact with static documents indexed 
in Azure AI Search. This plugin enhances the Copilot's ability 
to provide precise and relevant information by enabling 
semantic search capabilities, allowing users to query or 
retrieve information from documents related to the LRB using 
natural language. 

The full code for the iwmi-doc-plugin is available on GitHub. 
Developers can easily clone and use this plugin.

GitHub Repository: iwmi-doc-plugin

Key Features of the iwmi-doc-plugin

1. Semantic Search with Vector Embeddings:

• Unlike traditional keyword-based search, the plugin
utilizes a Sentence Transformer model to convert the
user’s query into a semantic vector. This approach allows
the plugin to understand the context and meaning behind
the query, rather than just matching keywords.

• The model used is sentence-transformers/all-mpnet-base-
v2, a state-of-the-art embedding model that effectively
captures the semantic essence of the query. This ensures
that even if the user’s query is phrased differently from
the content in the documents, relevant results will still be
returned.

2. Filter-Based Search:

• To refine searches, the plugin allows the use of filters
based on document metadata. Users can specify
categories, keywords, or other attributes to narrow down

their search results.

• This is particularly useful when the dataset is extensive,
and users need to locate specific pieces of information
quickly. The ability to filter by metadata ensures that users
get more focused and precise responses.

3. Ease of Integration and Expansion:

• Built using the Industry Plugin DevKit library, the plugin
is inherently modular. This means that developers can
easily add new tools and functionalities without needing
to overhaul the existing structure.

• The template-based approach provided by Industry
Plugin Devkit simplifies the process of adding new data
sources or modifying existing ones, enabling rapid
development and deployment cycles.

Explanation of the Code

1. Library Imports and Constants:

• The plugin begins by importing essential libraries for
handling Azure AI Search, embedding models, and plugin
development.

• Constants such as AAS_END_POINT, AAS_API_KEY,
and AAS_INDEX_NAME are defined to set up the
connection parameters for Azure AI Search. These
constants ensure that the plugin can authenticate and
interact with the correct index on Azure.

2. Sentence Embedding Model Setup:

• One of the standout features of the iwmi-doc-plugin is its
use of semantic embeddings for search. By utilizing the
sentence-transformers/all-mpnet-base-v2 model, the
plugin converts natural language queries into numerical
vectors.

• This conversion process allows the plugin to perform a
more nuanced search, retrieving content that is
semantically related to the query, even if the words used
are different.

3. Filter Formatting (format filter Function):

• This function processes the filters provided by the user
and converts them into a format that Azure AI Search can
understand.

• It ignores filters with None values, ensuring that only
relevant filters are applied. This flexibility allows users to
customize their search experience, specifying criteria
such as document type, category, or source.

4. AasIndex Class: Core Search Functionality:

• The AasIndex class is central to the plugin’s functionality.
It initializes a connection to the Azure Search service
using the provided endpoint and API key.

• Initialization: The class sets up a SearchClient instance,
which handles all interactions with Azure Search. This

http://iwmi-doc-plugin


TECHNICAL REPORT

CGIAR Initiative on  Digital Innovation  |  on.cgiar.org/digital 12

client is responsible for sending search queries and 
retrieving results.

• Filter Field Retrieval: The filter fields property
dynamically retrieves available fields for filtering. This is
done by inspecting the metadata in the Azure Search
index, ensuring that only valid fields are offered as filters.

• Search Execution: The search method takes an
embedding of the query and optional filters, constructs a
VectorizedQuery, and retrieves matching results. By
using vector embeddings, the plugin can perform
semantic searches, looking beyond mere keyword
matches.

5. Building Filter Type (build_filter_type):

• To allow users to apply filters easily, the build_filter_type
method creates a pydantic model that dynamically
includes the possible filter fields.

• This model is built based on the filterable fields identified
in the Azure index, ensuring that users can select or omit
filters without encountering errors.

6. Defining the query_aas Tool:

• Purpose: The query_aas function is the main interface
exposed by the plugin. It allows users to input a query and
apply filters, retrieving relevant information from the
indexed documents.

• Parameters:

• query: The user’s natural language query, which will be
converted into an embedding.

• filters: Optional filters that allow users to narrow down
the search results. Filters can be set to null if not
needed.

• Execution: The function encodes the query into a vector,
calls the search method from the AasIndex class, and
retrieves relevant snippets. Each snippet is returned with
its content and source, making it easy for the
WaterCopilot to display useful information.

Future Enhancements and Scalability

The architecture of the iwmi-doc-plugin is designed with 
scalability in mind, ensuring it can adapt to evolving user 
needs. Future enhancements may include the integration of 
additional data sources, allowing the plugin to be configured 
with new indices as more documents become available, 
thereby expanding the range of accessible information about 
LRB. Enhanced filtering options could also be introduced, 
enabling users to refine their search results more effectively 
and locate specific information with greater precision.

IWMI-API-PLUGIN OVERVIEW

The iwmi-api-plugin is designed to seamlessly integrate with 
multiple APIs, enabling the WaterCopilot to fetch and process 
real-time data. As shown in Figure 8, the iwmi-api-plugin 
serves as an essential component of the WaterCopilot, 
connecting it to various external APIs. Each tool within this 
plugin is specialized to handle a specific data request, 
facilitating real-time information retrieval from different 
databases.

The tools are designed to process specific user queries by 
communicating with these databases via APIs. Each tool's 
description and argument details are carefully defined, 
allowing the LLM model to understand their functionalities 
and decide which tool to call based on the user’s query.

The full code for the iwmi-api-plugin is available on GitHub. 
Developers can easily clone and use this plugin.

GitHub Repository: iwmi-api-plugin

1. Key Functionalities of the iwmi-api-pluginAuthentication:

The plugin begins its operation by retrieving an access token 
required for all API interactions. It sends a login request with 
predefined credentials (username and password) to the 
authentication endpoint. Upon successful login, the API 
responds with an access token, which is crucial for 
authenticating subsequent API requests. This process ensures 
that only authorized users can access sensitive data and 
functionalities, maintaining the security of the system.

2. Eflow Data Retrieval:

The plugin offers tools to interact with environmental flow 
data, allowing users to:

• List Metadata: Retrieve a comprehensive list of available
environmental flows, including river codes and
descriptive details. This metadata is essential for users to
understand which Eflow sites are currently accessible to
the WaterCopilot.

• Fetch Eflow Values: Retrieve specific Eflow values based
on user-defined parameters, such as river code and date
range. The plugin processes this information to provide
insights into flow rates, water availability, and
environmental conditions.

3. Rainfall Data Retrieval:

Rainfall data management is facilitated through several tools:

• Rainfall Station Information: Users can obtain
information about various rainfall stations, including their
geographic locations (latitude and longitude). This data is
crucial for analyzing rainfall patterns at each specific
station.

• Nearest Rainfall Station Identification: Based on
geographic coordinates provided by the user, the plugin


TECHNICAL REPORT

CGIAR Initiative on  Digital Innovation  |  on.cgiar.org/digital 13

can identify the closest rainfall station. This functionality 
enhances the user's ability to retrieve relevant rainfall data 
specific to their area of interest.

4. Alerts for Eflow Areas:

The plugin enables users to query alerts related to Eflow sites 
based on specific dates. It provides valuable insights into the 
health of river systems, including any alerts or warnings issued 
for environmental conditions. The alerts may indicate various 
status levels (e.g., healthy, warning, critical) and are crucial for 
stakeholders to take timely action to manage water resources 
effectively.

5. Data Visualization:

To enhance user engagement and comprehension, the plugin 
supports generating URLs for visual representations of Eflow 
charts and rainfall data. Users can visualize trends, patterns, 
and forecasts through graphical analyses, which are integrated 
into the WaterCopilot. This functionality aids in interpreting 
complex data in a more accessible format, enabling better 
decision-making.

6. Reservoir Information:

A dedicated tool retrieves detailed information about 
reservoirs or dams located within the Limpopo Basin. This tool 
provides users with comprehensive data on water storage 
facilities, including their capacities, current water levels, and 
operational statuses. Understanding reservoir conditions is 
vital for effective water management and planning.

Explanation of the Code

In this plugin, each tool is designed to be similar and accepts 
these kinds of parameters: the river code, the name of the 
channel, or the station name, along with a start and end date for 
the data range. By using these appropriate parameters, the tools 
set up the API call and then make the call. The API responds 
with the relevant data, which is returned to the WaterCopilot 
for user interaction. Therefore, all the tools within this plugin 
are responsible for making the API call and retrieving the 
relevant data.

The structure across all tools in this plugin is consistent, 
making it straightforward for developers to add new tools. To 
integrate a new tool, developers simply need to adjust the 

Figure 8. Comprehensive Overview of API Integrations in WaterCopilot's Development [Source: IWMI]


TECHNICAL REPORT

CGIAR Initiative on  Digital Innovation  |  on.cgiar.org/digital 14

specific parameters and provide an appropriate description, 
ensuring that this plugin remains highly reusable and 
modifiable for future enhancements.

Future Enhancements and Scalability

The iwmi-api-plugin is designed with a scalable and adaptable 
architecture, enabling continuous improvement and expanded 
functionality to meet evolving user requirements. One key 
enhancement is the seamless integration of additional APIs, 
which will allow the WaterCopilot to retrieve new dynamic 
data related to LRB, ensuring more specialized and relevant 
data access. Furthermore, future updates will introduce 
advanced visualization tools capable of generating dynamic 
charts, graphs, and other graphical representations directly 
from API data. These tools will provide users with intuitive 
and actionable insights, enhancing the overall user experience 
and decision-making process regarding LRB.

WATERCOPILOT DEVELOPMENT

WaterCopilot is an interactive Copilot application built with 
Streamlit as the frontend framework, integrating with Azure 
OpenAI for natural language processing. This Copilot assists 
users by providing insights and data related to the Limpopo 
River Basin (LRB), including rainfall data, environmental flow 
(eflow), alerts, and relevant document information.

WaterCopilot serves as the central logic orchestrator, 
determining and invoking the appropriate tools to retrieve 
information based on user queries. It interacts with integrated 
plugins to access various data sources, enhancing response 
accuracy and relevance. This architecture enables the Copilot 
to deliver a seamless and interactive user experience, 
efficiently pulling information through plugins and presenting 
it to users in real time.

The complete code for WaterCopilot is available on GitHub. 
To set it up, clone the repository, read the instructions, and 
modify the environment variables as needed. This allows 
developers to run the Copilot easily.

GitHub Repository: WaterCopilot

Architecture and Workflow

1. User Interface Design with Streamlit

• The application uses Streamlit to create a clean,
responsive user interface. Streamlit's ability to quickly
render elements like buttons, input fields, and
visualizations makes it an ideal choice for rapid
prototyping and deployment.

• A sidebar is integrated to guide users on the
functionalities available, such as "Rainfall Insights,"
"Eflow Alerts," "Limpopo Library," and "Eflow
Analysis," with a brief description of each option to assist
in navigation.

• Streamlit's flexible layout is utilized to style buttons,
headers, and interactions, providing a user-friendly and
visually appealing experience. Custom CSS is also
embedded to enhance the appearance of buttons and
overall layout.

2. Prompt-Based Interaction

• The Copilot operates using a prompt-based approach
where users initiate conversations by selecting predefined
queries or entering their own.

• Each user query is processed, and the bot analyzes the
intent to determine which tools or plugins to invoke. The
prompts are dynamically set, ensuring that the Copilot can
provide targeted responses based on user needs.

• The system uses a System Prompt that instructs the
assistant on how to handle user queries, ensuring
consistent and context-aware responses.

3. Azure OpenAI Integration

• Azure OpenAI powers the bot's natural language
understanding and processing capabilities. When users 
ask questions, the Copilot interprets these inputs through 
Azure OpenAI’s models (e.g., GPT-4).

• The Copilot is designed to not just respond with text, but
also decide if any tools (like for data retrieval or
visualization) need to be invoked based on the query. This
integration with Azure OpenAI enhances the Copilot’s
ability to handle complex, multi-step requests efficiently.

4. Plugin Integration and Tool Management

• The bot uses a modular system that connects to various
external plugins. Each plugin serves as a set of tools
specialized for specific data retrieval and analysis tasks,
such as accessing eflow data, querying rainfall
information, or fetching documents.

• The get_plugin_specs function dynamically loads these
tool specifications from the plugins’ OpenAPI
specifications, ensuring seamless integration and
scalability.

• The bot effectively uses the tool selection mechanism to
determine which external tool should be called, based on
user intent. This design ensures that only relevant plugins
are engaged, improving response efficiency.

5. Real-Time Data Handling

• The Copilot’s ability to handle real-time data makes it
highly effective for environmental monitoring. Users can
ask for current alerts, rainfall statistics, and environmental
flow data.

• The bot fetches data directly from APIs associated with
external plugins. For instance, if a user asks about current
water levels or eflow alerts, the bot calls specific plugins
designed to fetch this real-time data and deliver it in a

http://WaterCopilot


TECHNICAL REPORT

CGIAR Initiative on  Digital Innovation  |  on.cgiar.org/digital 15

user-friendly format, including charts and tabular data.

6. Visual Representation of Data

• The bot integrates visual elements directly into the
responses to enhance user comprehension. If a query is
related to data that benefits from graphical representation
(e.g., trends over time, forecasts), the bot generates URLs
for visualizations, which are rendered within the Streamlit
interface.

• Tools like visual_alert_api are used to provide graphical
representations of alerts, enhancing the way information
is consumed by users.

7. Efficient Conversation Management

• The bot maintains the conversation context using
Streamlit's session state, ensuring that interactions are
cohesive and context-aware. It keeps track of past
interactions, enabling users to continue conversations
without losing context.

• A function named truncate_messages is used to limit the
number of tokens processed by the model. This ensures
that even if a conversation becomes long, it remains
within the model's token limits by removing the oldest
messages while preserving the essential system prompt.

Development Features

1. Customizable System and User Prompts

• The Copilot begins each session with a defined system
prompt, which guides how it should interpret and respond
to user queries. This allows for flexibility in adjusting the
bot’s behavior based on specific user interactions or
datasets.

• Users are given a range of predefined query options or can
input their own questions, ensuring that the interaction
feels intuitive and guided without being overly restrictive.

2. Dynamic Plugin Loading

• The bot supports dynamic loading of tools through
plugins. This design allows for easy addition or removal
of functionalities. For example, if a new dataset or
analysis feature is needed, a corresponding plugin can be
integrated without altering the core Copilot logic.

• The get_tool_spec function allows the Copilot to
understand what each tool can do, ensuring accurate and
efficient tool selection during user interactions.

3. Tool Call Management

• When a user query requires data fetching or processing,
the bot handles the call through the call_tool and call_
tools_if_necessary functions. These ensure that external
tools are correctly invoked, and responses are fetched
seamlessly.

• The bot also manages the responses from tools to ensure
they are correctly formatted and interpreted, converting
them into user-friendly responses.

4. Error Handling and Feedback

• The bot has built-in error handling to manage cases where
external API calls fail or return unexpected results. When
an error occurs, a simple message is displayed to inform
the user, ensuring that the conversation remains smooth
and understandable.

• Continuous improvement is encouraged through feedback
mechanisms. Users can provide insights or ask follow-up
questions based on the responses, allowing the bot to
refine its answers and guidance.

User Experience Enhancements

1. Welcome Interface and Sidebar Guide

• Users are greeted with a welcome interface that
introduces the bot's capabilities. The sidebar provides
detailed descriptions of each feature, helping users
understand what they can ask and how to get the most out
of the bot.

• There are also direct links to additional resources, such as
the project website, ensuring users can explore more
information if needed.

2. Streamlined Conversation Flow

• The Copilot's design emphasizes smooth conversation
flow, enabling users to ask follow-up questions or switch
between different query types seamlessly. This is
especially useful for users who may need to gather
information from multiple data points or analyze various
aspects of the LRB data.

3. Interactive Elements and Styles

• Buttons, dropdowns, and interactive prompts make the
user experience engaging. For example, users can easily
switch between different data queries, request
visualizations, or drill down into more specific data
details, thanks to the intuitive interface elements designed
in Streamlit.

• Custom CSS is applied to enhance the visual appeal,
ensuring that the interface is not just functional but also
inviting and easy to navigate.

Scalability and Future Expansion

• The modular design of the WaterCopilot ensures it is
highly scalable. By utilizing a plugin-based architecture,
new data sources or analytical tools can be added without
needing significant changes to the existing codebase.

• Future enhancements could include more advanced
analytics, integration with additional APIs for broader
data coverage, and improvements in natural language


TECHNICAL REPORT

CGIAR Initiative on  Digital Innovation  |  on.cgiar.org/digital 16

understanding to provide even more accurate and context-
aware responses.

Deployment 

As shown in Figure 9, the deployment of WaterCopilot
utilizes AWS cloud infrastructure to ensure scalability and 
reliability. Docker containers are employed to package each 
component, including the core application and plugins, 
enabling consistent performance across different 
environments. The architecture also supports modular 
deployment, making it easier to scale and update individual 
components independently.

Deployment of Plugins Using Docker on AWS EC2

Each plugin of WaterCopilot is packaged and deployed as a 
Docker container on AWS EC2. This containerization 
approach ensures that all dependencies and configurations are 
encapsulated within each container, enabling the plugins to run 
consistently across different environments. Deploying Docker 
containers on EC2 also supports flexibility and scalability, 
allowing each plugin to operate independently.

Port Allocation: Each plugin is assigned a specific port on 
EC2 to avoid conflicts and ensure smooth communication:

Iwmi-doc-plugin: Deployed on port 7000

Iwmi-api-plugin: Deployed on port 7050

This clear separation of ports allows WaterCopilot to interact 
with each plugin via defined endpoints, streamlining access to 
data and services provided by each plugin. By running these 
plugins independently on EC2, the architecture promotes 
modularity, making it easier to add new plugins or update 
existing ones without disrupting the entire system.

Deployment of WaterCopilot Application on AWS EC2 
Managed with Supervisor

The core WaterCopilot, which serves as the central interface 
for interacting with the Copilot and its features, is deployed on 
an AWS EC2 instance. Supervisor, a process control system, is 
used to manage the application, ensuring that it runs 
continuously and is automatically restarted if it crashes. This 
setup guarantees high availability and reliability for users 
accessing the WaterCopilot.

The WaterCopilot is specifically deployed on port 8501, 
making it easily accessible to users through a web browser. 
The EC2 instance hosting the WaterCopilot also runs the 
Docker containers for the plugins, ensuring all components are 
centralized on a single virtual machine. This co-location 
reduces latency in communication between the Copilot and the 
plugins, leading to faster response times. Additionally, having 
all elements on the same EC2 instance simplifies monitoring 
and maintenance, as administrators can manage the entire 
system from a single access point.

Scalability Considerations

The deployment architecture of WaterCopilot is designed with 
scalability in mind. The use of Docker containers allows for 
rapid scaling of each plugin, as instances can be easily 
replicated and distributed across multiple servers if needed. 
This flexibility ensures that the system can handle increased 
loads by simply adding more container instances without 
requiring significant changes to the infrastructure.

For the WaterCopilot deployed on the EC2 instance, horizontal 
scaling can be achieved by adding more EC2 instances and 
setting up a load balancer to distribute incoming traffic. This 

Figure 9. WaterCopilot Deployment Architecture and Integration within the AWS Cloud Environment [Source: IWMI]


TECHNICAL REPORT

CGIAR Initiative on  Digital Innovation  |  on.cgiar.org/digital 17

ensures that as the user base grows, the system can scale 
efficiently to accommodate more concurrent users without 
sacrificing performance.

Benefits of the Deployment Strategy

• Modularity: Independent deployment of plugins means
each component can be developed, tested, and updated
separately. This modularity supports a more flexible and
maintainable system.

• Reliability: The use of Supervisor to manage the core
application on EC2 ensures that the WaterCopilot remains
available, with automatic restarts in case of failure.

• Scalability: Docker containers and AWS infrastructure
allow for easy scaling of individual components to handle
increased demand making the system robust and
adaptable to growing user needs.

• Ease of Maintenance: Centralizing the deployment on a
single EC2 instance simplifies the management of the
system, while Docker ensures consistent environments
across different stages of development and production.

Overall, the deployment and scalability strategy of 
WaterCopilot leverages modern cloud technologies and best 
practices to deliver a reliable, scalable, and easy-to-manage 
system that can grow alongside user demands.

Key Features of WaterCopilot

WaterCopilot is designed to provide an intuitive and robust 
user experience, offering a range of features that enhance 
interaction and deliver comprehensive information about the 
LRB. Below is a detailed overview of its key functionalities:

Interactive Data Retrieval

WaterCopilot excels in real-time data integration by 
seamlessly connecting to multiple databases through APIs, 
fetching the most current information on topics such as 
rainfall, river flow, e-flow management, and water availability. 
This integration allows it to merge data from various datasets, 
ensuring that users receive comprehensive and up-to-date 
insights for effective decision-making. Additionally, the bot 
can access static documents related to the LRB, which are 
indexed through a robust pipeline.  By combining dynamic 
real-time data with static documents, WaterCopilot provides a 
well-rounded perspective on the basin, supporting users with 
both immediate insights and contextual background 
information.

User-Friendly Interface

The design of WaterCopilot prioritizes ease of use, with an 
interface that is both intuitive and accessible. Upon initiating 
interaction with the bot, users are presented with a set of 
predefined topics to explore, such as "Rainfall Queries," "E-
Flow Alerts," or "Limpopo Library", as shown in Figure 10. 
This approach reduces the uncertainty of knowing how to start, 
guiding users smoothly into the interaction by clearly outlining 
what types of questions can be asked. By offering these initial 
options, the bot ensures that even new users can easily navigate 
the system, understand its capabilities, and begin exploring 
relevant data and insights without any steep learning curve. 
This user-centric design makes the bot an effective tool for 
both novice and experienced users.

Figure 10 Overview of initial user options in WaterCopilot, designed to facilitate intuitive navigation and ease of use. [Source: 
IWMI]


TECHNICAL REPORT

CGIAR Initiative on  Digital Innovation  |  on.cgiar.org/digital 18

Guided Interaction for Queries

One of the standout features of WaterCopilot, as shown in 
Figure 11, is its ability to simplify complex queries through 
guided interaction. When users begin exploring a topic, 
Copilot prompts them step by step, ensuring that all necessary 
parameters are specified for an accurate response. This 
structured approach helps users provide relevant information, 
leading to more precise and effective results. For example, if a 
user inquires about rainfall data, the bot might ask for a 
location, time period, and any additional details that could 
refine the search. If users are unsure about what inputs are 
needed, the Copilot provides helpful examples and 
suggestions, making the process smoother and less 
intimidating. This structured approach is especially useful for 
multi-part queries, where the bot walks users through each 
step, ensuring that no critical details are overlooked. By 
guiding users through the process, WaterCopilot helps them 
formulate effective questions, leading to more precise and 
relevant answers, thereby saving time and enhancing their 
understanding of the information they seek.

Reference Provision

Transparency and reliability are at the core of WaterCopilot’s 
design. For every answer provided, WaterCopilot includes 
detailed references to the sources used, as shown in Figure 12. 
This ensures that users can verify the information and trust its 
accuracy, allowing them to trace back to the original data and 
sources for further validation. When information is retrieved 
from static documents, the bot specifies which document was 

referenced, highlighting the exact snapshots or sections from 
which the data was drawn. Similarly, for real-time data fetched 
from APIs, the bot provides real-time references, ensuring that 
the information reflects the latest updates. This level of 
transparency not only builds trust but also facilitates further 
research by enabling users to trace back to the original source 
material. Users can rely on these detailed references to ensure 
the credibility of the information presented, making 
WaterCopilot a dependable tool for research and decision-
making.

Memory State

WaterCopilot’s memory feature significantly enhances the user 
experience by remembering past interactions and maintaining 
context throughout the conversation. This means that users 
don’t have to repeat information, as the bot can recall previous 
queries, preferences, and details, streamlining future 
interactions. For instance, if a user has previously asked about 
rainfall data for a specific region, the bot can automatically 
reference that context in follow-up queries, saving time and 
effort. Moreover, this feature enables the bot to offer 
personalized suggestions based on a user’s interaction history, 
making it easier to access relevant data quickly. By retaining 
important information, WaterCopilot ensures that 
conversations are more seamless, efficient, and tailored to each 
user’s needs, enhancing overall engagement.

Figure 11 Guide on how the Copilot assists users by providing detailed parameter information. [Source: IWMI]


TECHNICAL REPORT

CGIAR Initiative on  Digital Innovation  |  on.cgiar.org/digital 19

Multilingual Support

Recognizing the diverse linguistic backgrounds of its users, 
WaterCopilot offers robust multilingual support. Users can 
interact with the bot in multiple languages, including English, 
French, Portuguese, and others, ensuring that language is not a 
barrier to accessing vital information. Whether a user asks a 
question in English or another supported language, the bot can 
process the query accurately and provide responses in the same 
language, ensuring a smooth and accessible experience. This 
feature is particularly important in regions like the Limpopo 
Basin, where stakeholders and users may speak different 
languages. By enabling communication across multiple 
languages, WaterCopilot broadens its reach and inclusivity, 
making it a valuable tool for a wider audience.

Summary and Insights

WaterCopilot is equipped with the capability to generate 
concise summaries and insights, making it easier for users to 
digest complex information. When users ask a question, the 
bot doesn’t just provide raw data; it processes and combines 
relevant information from various sources to understand the 
overall context. By analyzing the data, the bot extracts key 
points and presents them in a clear, simplified summary, 
making complex topics easier to grasp. Additionally, the bot 
can identify patterns and highlight important details that might 
otherwise be overlooked, providing deeper insights into the 
subject matter. This feature ensures that users receive a 
comprehensive yet easy-to-understand response, enabling 
them to make informed decisions without being overwhelmed 
by too much data.

Calculation Capability

WaterCopilot, as shown in Figure 13, is designed to handle a 
variety of calculations, adding significant value for users 
involved in water management and environmental 
assessments. Whether users need to calculate the average 
rainfall over a specific period, compare precipitation or e-flow 
values across different regions, or sort data for in-depth 
analysis, the bot can efficiently perform these tasks. This 
calculation capability provides accurate and quick metrics, 
supporting data-driven decision-making. By seamlessly 
integrating these computations into its responses, 
WaterCopilot helps users analyze data effectively, identify 
trends, and gain deeper insights, all within a single, interactive 
platform. This makes it a powerful tool for users who need 
precise data analysis.

Graphical Representations

To make data more accessible and easier to interpret, 
WaterCopilot offers advanced visualization capabilities. As 
shown in Figure 14, the Copilot can generate tailored charts 
and graphs by interacting with APIs that access data from the 
Limpopo Basin. These visual representations help users see 
patterns, trends, and comparisons at a glance, making complex 
information more digestible. Whether a user needs to visualize 
rainfall patterns over a period, analyze e-flow statistics, or 
compare water levels across different sites, the bot provides 
relevant graphs that simplify the information. By offering these 
visualizations, WaterCopilot helps users quickly grasp critical 
insights and make informed decisions. This feature is 
especially valuable for those who prefer visual data analysis 
over text-based data interpretation.

Figure 12. Evidence or source references are provided by the Copilot when answering user queries. [Source: IWMI]


TECHNICAL REPORT

CGIAR Initiative on  Digital Innovation  |  on.cgiar.org/digital 20

Integration with the Digital Twin Platform

WaterCopilot is designed to interact with the main Digital 
Twin platform, enhancing its functionality by providing direct 
links and suggestions based on the ongoing conversation. 
Although still in its initial phase, this integration allows the bot 
to suggest relevant sections or features within the Digital Twin 
portal, guiding users to explore more detailed information 
directly on the platform. For example, if a user asks about 
water availability data, the bot might offer a link to the Digital 

Twin portal's section where users can view in-depth data 
visualizations or access further analytical tools. This 
interaction not only enhances the user's experience by 
providing a seamless transition between the Copilot and the 
platform but also ensures that users can delve deeper into data 
and insights when needed. As this feature continues to develop, 
it promises to create a more integrated and comprehensive 
ecosystem for accessing water management information.

Figure 13. Copilot's ability to perform calculations, including basic arithmetic and complex data computations. [Source: IWMI]

Figure 14. Copilot's ability to generate visual representations of data, such as rainfall charts. [Source: IWMI]


TECHNICAL REPORT

CGIAR Initiative on  Digital Innovation  |  on.cgiar.org/digital 21

FUTURE WORK

As WaterCopilot continues to evolve, several key 
enhancements are planned to improve its functionality, user 
experience, and overall performance. The following outlines 
the primary areas of focus for future development:

Expansion of Static Data Sources

One of the primary goals is to expand the repository of static 
data related to the LRB. By adding more documents and 
datasets, the Copilot will be equipped with comprehensive 
knowledge about historical patterns, environmental reports, 
and other relevant static information. This enhancement will 
provide users with a richer context for their queries, improving 
the overall depth of insights available through the bot.

Integration of Additional APIs

To enhance the dynamic information available to users, future 
work will include the integration of more APIs. This will allow 
WaterCopilot to access a broader range of data sources, 
providing users with real-time insights into various aspects of 
the LRB. By expanding the range of data and services 
accessible through the bot, users will benefit from more 
accurate and timely information, facilitating better decision-
making.

Interface Improvements

Improving the user interface will be a continuous focus. 
Enhancements will aim to make interactions more intuitive and 
engaging. This includes refining the layout, improving 
navigation, and implementing features such as sharing 
conversation transcripts. Users will have the option to share 
their interactions easily, facilitating collaboration and 
discussion with colleagues or stakeholders.

Support for Diverse Graph Types

The capability to support various types of graphs and 
visualizations will be expanded. By implementing additional 
visualization techniques, users will be able to choose from 
different chart types that best represent the data they are 
analyzing. This flexibility will enhance the interpretability of 
the data and provide users with better insights into trends and 
patterns over time.

Report Generation and Downloadable Formats

Future enhancements will also focus on the implementation of 
a reporting feature. Users will be able to generate customized 
reports based on current scenarios or specific queries, which 
can then be downloaded in various formats (e.g., PDF, CSV). 
This functionality will streamline the process of sharing 
insights and findings, making it easier for users to present data 
to stakeholders or include it in formal reports.

Performance Optimization and Latency Reduction

Ongoing efforts will be directed toward improving the overall 
performance of WaterCopilot. This includes optimizing 
algorithms, refining data retrieval processes, and implementing 
caching strategies to reduce latency. By enhancing 
performance, the bot will provide a smoother and more 
responsive user experience, ensuring that users receive timely 
and efficient responses to their queries.

Access to the Digital Twin Main Portal

In the future, WaterCopilot will implement a technique to 
enable direct access to the main Digital Twin portal from 
within the Copilot. This integration will allow users to 
seamlessly transition between the Copilot and the Digital Twin 
platform, providing a more cohesive experience. Users will be 
able to explore detailed data and features on the portal without 
needing to navigate away from the Copilot interface, further 
enhancing usability and accessibility.

CHALLENGES AND SOLUTIONS

The development of WaterCopilot encountered several 
significant technical challenges, particularly due to the 
evolving nature of Generative AI (GenAI) and Large Language 
Models (LLMs), as well as the complexity of the Digital Twin 
project itself. Below are the key technical challenges faced 
during development and the solutions implemented to address 
them.

Limited Resources on Generative AI and LLMs

Challenge: Generative AI and LLMs are relatively new 
technologies, resulting in a scarcity of comprehensive 
resources and documentation. This made it difficult to develop 
an end-to-end Copilot that effectively leverages these 
advanced models, as the lack of established best practices 
posed significant hurdles.

Solution: To overcome this challenge, the team collaborated 
with Microsoft Research. This partnership provided access to 
initial ideas, a well-defined pipeline, and sample code that 
served as a foundational framework. By building upon this 
groundwork, the team successfully developed a fully 
functional application that effectively utilized GenAI and LLM 
technologies.

Complexity of the Digital Twin Project

Challenge: The Digital Twin project is inherently complex, 
involving multiple components, datasets, and analytical tools. 
Understanding the integration of these elements and their 
interdependencies presented initial challenges for the 
development team.

Solution: The team organized knowledge-sharing sessions and 
workshops to facilitate cross-disciplinary collaboration. By 
bringing together software engineers, data scientists, and 


TECHNICAL REPORT

CGIAR Initiative on  Digital Innovation  |  on.cgiar.org/digital 22

environmental experts, the team fostered a deeper 
understanding of the project's goals and requirements. 
Additionally, detailed documentation and flowcharts were 
created to visualize the architecture and workflows, making it 
easier for team members to grasp the overall system structure.

Integration of Multiple Data Sources

Challenge: Integrating data from various sources—including 
APIs, static documents, and databases—was technically 
challenging. Ensuring data consistency, synchronization, and 
accuracy across these different formats required careful 
coordination.

Solution: The team implemented a modular plugin 
architecture, allowing each data source to be connected 
independently via APIs. This design ensured that updates to 
one data source did not affect the others. Robust data 
validation mechanisms were established to maintain 
consistency and accuracy across all sources, preserving the 
integrity of the information presented to users.

CONCLUSION

The development of WaterCopilot represents a major step 
forward in streamlining data access and analysis for 
stakeholders managing the LRB, with potential for expansion 
to any other river basin around the world. By integrating 
advanced AI technologies with near real-time data retrieval 
and interactive user interfaces, the project has succeeded in 
addressing the complexities of environmental monitoring and 
decision-making. The Copilot consolidates information from 
various sources, including near real-time APIs and static 
documents, making it a comprehensive tool for users seeking 
insights on rainfall, e-flow, water availability, and other critical 
metrics.

Despite the challenges associated with integrating Generative 
AI and managing a complex ecosystem like the Digital Twin 
project, the collaboration with Microsoft Research provided a 
strong foundation that enabled the development team to build 
a functional and robust application.  The deployment strategy, 
leveraging Docker and cloud services such as AWS, was 
designed to ensure scalability and reliability. Moreover, the 
modular plugin architecture offers flexibility for seamless 
integration of new features and data sources. Importantly, this 
architecture is cloud-agnostic, supporting not only AWS but 
also other cloud providers, enabling adaptability, resilience, 
and independence from a single platform.

Looking ahead, WaterCopilot will continue to evolve, with 
plans to expand its data repositories, enhance visualizations, 
and integrate more dynamic APIs. Improvements to user 
interface design and performance optimization will further 
enhance the user experience, and future features like report 
generation and integration with the main Digital Twin portal 
will provide even more value to users.

Ultimately, WaterCopilot serves as an essential tool for 
empowering users with timely, accurate, and actionable 
information, supporting sustainable water management and 
informed decision-making across the LRB. The project's 
success demonstrates the power of AI-driven solutions in 
tackling complex environmental challenges, and it sets a 
foundation for future innovations in the field.

ACKNOWLEDGMENTS

The development of the WaterCopilot was made possible 
through the technical expertise and support provided by the 
Microsoft Research team namely: Ranveer Chandra, Eduardo 
Rodrigues, and Leonardo Nunes. Special thanks are extended 
to our research team, particularly Matheswaran Karthikeyan, 
who initially expressed interest in the development of 
WaterCopilot. In addition, we would like to express our sincere 
gratitude to the LIMCOM Member States and the LIMCOM-
UNDP/GEF project team, supported by the Global Water 
Partnership Southern Africa (GWPSA), the United Nations 
Development Programme (UNDP) facilitated through the 
Global Environment Facility (GEF), for their vision, 
encouragement and support throughout this project. The 
invaluable contributions from all stakeholders and partners, 
whose experience, deep understanding of the system and 
commitment played a crucial role which cannot be overstated. 
Their involvement was instrumental in shaping the 
development process. Finally, we are deeply appreciative of 
the Leona M. and Harry B. Helmsley Charitable Trust for their 
generous grant under the DIWASA project, which not only 
enabled this project but also provided an opportunity to 
collaboratively develop tools that will have a lasting impact on 
the Limpopo River Basin.

ACCESSIBILITY AND USAGE 
POLICY

The complete code for the iwmi-doc-plugin, iwmi-api-plugin, 
and Water Copilot is available on GitHub through registration 
and is licensed under the Apache License, Version 2.0. This 
license allows developers to freely use, modify, and distribute 
the plugins with proper attribution. However, the 
accompanying documentation and data resources are licensed 
under the Creative Commons Attribution-NonCommercial 4.0 
(CC BY-NC 4.0) license. This permits non-commercial use, 
sharing, and adaptation with appropriate credit. Developers 
and users should note that any commercial use of these 
materials is strictly prohibited without explicit written 
permission. For further details or inquiries about licensing, 
please contact iwmi-digitaltwins@cgiar.org.


TECHNICAL REPORT

CGIAR Initiative on  Digital Innovation  |  on.cgiar.org/digital

This publication has been prepared as an output of the CGIAR Initiative on Digital Innovation, which researches pathways to 
accelerate the transformation towards sustainable and inclusive agrifood systems by generating research-based evidence and 
innovative digital solutions. This publication has not been independently peer reviewed. Responsibility for editing, proofreading, 
and layout, opinions expressed, and any possible errors lies with the authors and not the institutions involved. The boundaries and 
names shown and the designations used on maps do not imply official endorsement or acceptance by the International Water 
Management Institute (IWMI), CGIAR, our partner institutions, or donors. In line with principles defined in the CGIAR Open and 
FAIR Data Assets Policy, this publication is available under a CC BY 4.0 license. © The copyright of this publication is held by 
IWMI. We thank all funders who supported this research through their contributions to the CGIAR Trust Fund.

23

REFERENCES

Garcia Andarcia, M., Dickens, C., Silva, P., Matheswaran, K., 
& Koo, J. (2024). Digital Twin for management of water 
resources in the Limpopo River Basin: a concept. 
Colombo, Sri Lanka: International Water 
Management Institute (IWMI). CGIAR Initiative on 
Digital Innovation. 4p. https://hdl.handle.net/10568/151898

Gurusinghe, T., Muthuwatta, L., Matheswaran, K., & Dickens, 
C. (2024). Developing a foundational hydrological model 
for the Limpopo River Basin using the Soil and Water 
Assessment Tool Plus (SWAT+). Colombo, Sri Lanka: 
International Water Management Institute (IWMI). CGIAR 
Initiative on Digital Innovation. 14p. https://
hdl.handle.net/10568/151939

IBM. (n.d.). Large language models. IBM. Retrieved from 
https://www.ibm.com/topics/large-language-models

Leitão, P. C., Santos, F., Barreiros, D., Santos, H., Silva, P., 
Madushanka, T., Matheswaran, K., Mutuwatte, L., 
Vickneswaran, K., Retief, H., Dickens, C., & Garcia 
Andarcia, M. (2024). Operational SWAT+ model: 
Advancing seasonal forecasting in the Limpopo River 
Basin. Colombo, Sri Lanka: International Water 
Management Institute (IWMI). CGIAR Initiative on 
Digital Innovation. https://hdl.handle.net/10568/155533

Microsoft. (n.d.). Project FarmVibes. Redmond, WA: 
Microsoft Research. Retrieved from https://www.microsoft.
com/en-us/research/project/project-farmvibes/

Sitoe, S., & Qwist-Hoffman, P. (2013). Limpopo River Basin 
monograph. Aurecon AMEI (Pty) Ltd. Retrieved November 
18, 2024, from https://dsc.duq.edu/limpopo-policy/1/

Sun, A. Y., & Scanlon, B. R. (2019). How can Big Data and 
machine learning benefit environment and water 
management: A survey of methods, applications, and future 
directions. Environmental Research Letters, 14(7), 073001. 
https://doi.org/10.1088/1748-9326/ab1b7d

APPENDIX 1: TECHNOLOGY STACK 
FOR WATERCOPILOT

1. Development Language and Tools

• Python

A popular programming language known for its simplicity
and readability, which boasted a rich ecosystem of libraries
for data manipulation and machine learning.

• Visual Studio Code (VS Code)

A lightweight and versatile code editor that supported
Python programming with extensive features, powerful
debugging, and version control capabilities.

2. Front-End Development

• Streamlit

A framework used to build interactive user interfaces, which
simplified the development of web applications with built-in
state management and seamless integration with Python.

http://on.cgiar.org/digital
https://hdl.handle.net/10568/124807
https://hdl.handle.net/10568/124807
https://creativecommons.org/licenses/by/4.0/
https://creativecommons.org/licenses/by/4.0/
https://www.iwmi.cgiar.org
https://www.cgiar.org/funders/