Croppie: AI-powered 
information extraction 
from natural language  
 

2024-01-18 


García-Acosta, Sebastian  

  
Content 
Auto-Regressive Language Model ...................................................................................................................... 6 

Token .................................................................................................................................................................. 6 

REST API ............................................................................................................................................................. 6 

NLP Tasks ............................................................................................................................................................ 6 

Problem identification ....................................................................................................................................... 6 

Transformers ...................................................................................................................................................... 7 

Extractive models ............................................................................................................................................... 7 

BERT ................................................................................................................................................................... 7 

DistilBERT ........................................................................................................................................................... 7 

ALBERT ............................................................................................................................................................... 7 

RoBERTa.............................................................................................................................................................. 7 

Generative Models ............................................................................................................................................. 7 

GPT-3 .................................................................................................................................................................. 7 

GPT + RLHF ......................................................................................................................................................... 8 

GPT-4 .................................................................................................................................................................. 8 

LlaMa 2 ............................................................................................................................................................... 8 

Claude 2.1 .......................................................................................................................................................... 8 

Comparison between Extractive and Generative Models ................................................................................. 8 

Quality ................................................................................................................................................................ 8 

Scalability ........................................................................................................................................................... 8 

Efficiency ............................................................................................................................................................ 9 

Cost .................................................................................................................................................................... 9 

Algorithmic complexity study ............................................................................................................................ 9 

Generative models comparison ......................................................................................................................... 9 

Solution selection ............................................................................................................................................... 9 

Costs ................................................................................................................................................................... 9 

Methodology ...................................................................................................................................................... 9 

Exploration of alternatives ............................................................................................................................... 10 

Division by Model Types Based on Architecture: ............................................................................................ 10 

Model comparison ........................................................................................................................................... 10 

Results .............................................................................................................................................................. 10 

Future work ...................................................................................................................................................... 11 

Conclusions ...................................................................................................................................................... 11 

Appendices ....................................................................................................................................................... 12 


Bibliography ..................................................................................................................................................... 16 

 
January 2024 | Croppie 
 

  5 

 
Abstract 
This report addresses the fundamental 

challenge of extracting key fields from 

natural language text in the context of 

natural language processing (NLP). 

Focusing on Spanish text, our solution, 

integrated into Croppie, offers an efficient 

and precise method for extracting 

essential information. Emphasizing 

robustness to colloquial expressions and 

grammatical errors, the report discusses 

implementation alternatives and presents 

the implemented solution, with the code 

available on our GitHub repository. This 

functionality is crucial for automated 

decision-making by enhancing the 

understanding of context in natural 

language conversations.

https://github.com/CIAT-DAPA/croppie_bot_model


January 2024  |  Croppie: AI-powered information extraction from natural language. 6 

Glossary 
Auto-Regressive Language Model 

Is a type of language model that generates output 

sequences one token at a time, where each token is 

predicted based on the preceding ones. Examples 

include GPT (Generative Pre-trained Transformer) 

models. 

Token 

In natural language processing, a token is the 

smallest unit of a sequence, often representing a 

word or a subword. It is the basic building block that 

language models process and generate. Tokens can 

include words, punctuation marks, or subword units, 

depending on the tokenization method. 

REST API 

REST API stands for Representational State 

Transfer Application Programming Interface. It is 

a set of rules and conventions for building and 

interacting with web services. RESTful APIs use 

standard HTTP methods (GET, POST, PUT, 

DELETE) to perform operations on resources. 

REST APIs are widely used for communication 

between different software systems over the 

internet, providing a scalable and stateless 

approach. 

Introduction 
This report tackles the challenge of 

extracting key fields from natural language 

text in Spanish within the context of natural 

language processing (NLP), presenting a 

solution integrated into Croppie. 

Emphasizing efficiency and precision while 

addressing colloquial expressions and 

grammatical errors, the report explores 

implementation alternatives with the code 

available on GitHub. The proposed 

functionality enhances automated 

decision-making by improving contextual 

understanding in natural language 

conversations. 

In a comprehensive model evaluation, 

GPT-3.5-turbo from OpenAI emerges as 

the optimal choice for extracting keywords 

from Spanish text. With a quality score of 5, 

it demonstrates exceptional language 

understanding and accurate responses. 

Although not achieving the maximum 

privacy score, GPT-3.5-turbo's competitive 

privacy measures, reinforced by OpenAI's 

commitment, make it the preferred choice. 

The model excels in scalability (score: 5) 

and cost efficiency (score: 4), providing 

quick responses and minimizing token 

usage for JSON-formatted outputs. 

Maintenance ease is emphasized (score: 

5), with structured output generation 

improvements facilitating integration and 

maintenance. GPT-3.5-turbo stands out as 

the comprehensive tool for developing the 

Croppie keyword extraction solution. 

NLP Tasks 

The extraction of key fields from natural language text is a 

fundamental challenge in natural language processing (NLP). 

In natural language conversations, the identification and 

extraction of relevant information are essential for 

understanding the context and making automated decisions. 

This report introduces a feature of Croppie designed to 

address this problem, providing an efficient and accurate way 

to extract key information from Spanish text that is robust to 

colloquial expressions and grammatical errors. The report 

discusses various implementation alternatives and presents 

the implemented solution. The code for the solution's 

implementation can be found in this GitHub repository. 

Problem identification 

Given the conversation history in natural language, the system 

must extract user-relevant fields. Table 1 (page 12) illustrates 

the system's functional requirements. 

https://github.com/CIAT-DAPA/croppie_bot_model


  7 

Key-fields 

Considering the functional requirements, this is the list of fields 

to extract from the conversation between the user and the 

WhatsApp chatbot, along with the data validation criterion. 

Table 2 (page 14) shows a detailed list of the key fields to 

extract. 

Background and solution alternatives 

This section evaluates the feasibility of different options to solve 

the problem. Different types of Machine Learning (ML) and 

Natural Language Processing (NLP) models are explored. 

Mainly, the comparison is divided between public and private 

access models. 

Transformers 

In recent years, advances in natural language processing (NLP) 

have been notable, marked by the introduction of highly 

sophisticated models. One of the most significant milestones 

was the arrival of BERT, which revolutionized the field by 

introducing transformer architectures (Vaswani et al., 2017) 

capable of understanding broader contexts in text. In this part 

of the analysis we will make a brief description of each 

technique and contrast its capabilities. 

 
Transformer model architecture. The architecture is divided 

into 2 parts: encoder (left) and decoder (right). 

Extractive models 

BERT 

BERT (Bidirectional Encoder Representations from 

Transformers) (Devlin et al., 2018) is a pre-trained model 

designed to understand the bidirectional context of words in a 

sentence. Its ability to capture semantic and contextual 

relationships allows it to excel in extractive tasks, where the goal 

is to identify relevant information in a text. 

DistilBERT 

DistillBERT (Sanh et al., 2019) is a lighter version of BERT that 

maintains surprisingly high performance with a fraction of the 

original parameters. By reducing the model's complexity, 

DistillBERT is more computationally efficient, making it suitable 

for applications with limited resources. 

ALBERT 

ALBERT (Lan et al., 2019) addresses the scalability of BERT by 

improving efficiency and reducing the number of parameters 

without sacrificing performance. It achieves this by 

implementing a structure of shared parameters between layers, 

allowing for more effective training of larger models. 

RoBERTa 

RoBERTa (Robustly optimized BERT approach) (Liu et al., 2019) 

is an enhancement of BERT that modifies key aspects of the 

model, such as tokenization and training dynamics. By 

adopting a more robust approach to bidirectional 

representation, RoBERTa outperforms BERT in various tasks, 

demonstrating a greater ability to understand complex 

contexts and capture semantic relationships. Its optimized 

structure and solid results make it a valuable option in the 

natural language processing model landscape. 

Generative Models 

Unlike extractive models based on the BERT architecture that 

only use the Transformer encoder part, generative models are 

auto-regressive, and their architecture employs a decoder to 

produce text. The most popular ones include: 

GPT-3 

GPT-3 (Generative Pre-trained Transformer) (Radford, 2018) is 

the third iteration of the GPT series developed by OpenAI. 

Introduced in 2020, GPT-3 is a pre-trained generative language 

model with a considerable corpus and a large number of 

parameters (175 billion). Unlike extractive models like BERT, 

GPT-3 has the ability to generate complete text and is known 

for its impressive performance in various natural language 

processing (NLP) use cases. However, it is noted that, despite 

its impressive capability, GPT-3 and similar models may provide 

incorrect responses and, at times, "hallucinate" or invent 

information. 


January 2024  |  Croppie: AI-powered information extraction from natural language. 8 

GPT + RLHF 

GPT-3 is a pre-trained generative language model that has 

proven highly effective in various NLP tasks. However, as 

mentioned earlier, it can also produce incorrect or 

"hallucinated" responses. To address this issue, OpenAI 

developed a technique called Reinforcement Learning from 

Human Feedback (RLHF). RLHF uses human feedback to 

improve the quality of responses generated by GPT-3. 

GPT-4 

GPT-4 (OpenAI, 2023) is the fourth iteration of the GPT series 

developed by OpenAI. It was released on March 14, 2023, and 

is available through the API and for ChatGPT Plus users. GPT-4 

is significantly larger than GPT-3, with 175 billion parameters 

compared to GPT-3's 175 billion. This allows it to learn more 

complex patterns in data and generate more natural and 

coherent text. 

GPT-4 has also been trained on a larger dataset than GPT-3, 

including text from books, articles, websites, and social media. 

This enables it to generate more informative and relevant text 

for a variety of tasks. Overall, GPT-4 is a more advanced pre-

trained generative language model than GPT-3, capable of 

producing more natural, coherent, and informative text. 

LlaMa 2 

LLaMa2 (Touvron et al., 2023) is a publicly accessible pre-

trained generative language model developed by Meta AI. It 

was launched in July 2023 and is a large-scale transformer 

model with different versions of 7, 13, and 70 billion 

parameters. 

LLaMa2 is a significant improvement over its predecessor, 

LLaMa, in various aspects. Firstly, LLaMa2 has a significantly 

larger model size, allowing it to learn more complex language 

patterns. Secondly, LLaMa2 uses a larger context window (4096 

tokens). 

Claude 2.1 

Claude 2.1, an enhanced version of Anthropic's natural 

language processing model, highlights notable improvements 

in handling extensive contexts and an increase of over 50% in 

accuracy. It supports an expanded context window of 200,000 

tokens. 

Comparison between Extractive and Generative 

Models 

The main difference between extractive and generative models 

is that extractive models can only respond with a subtext from 

the input, while generative models can produce text does not 

present in the input text. This difference is crucial for one of the 

criteria to consider when defining which type of model is better 

to implement a solution to the problem. The criteria for 

comparing extractive and generative models are as follows: 

- Quality: the model's ability to extract key fields in Spanish in 

the desired format. 

- Scalability: the responsiveness of a system concerning the 

number of requests it receives in a given time period. 

- Efficiency: the temporal complexity of input processing. 

- Cost: the costs associated with running model inference. 

For each criterion, a score will be given in the range of 1 - 5, 

where closer to 1 is worse and closer to 5 is better. 

Table 3.  Selection criterion for LM types 

Model 

type Quality Scalability Efficiency Cost Total 

Extractive 2 3 2 3 10 

Generative 5 5 4 2 16 

 
Generative models surpass extractive models because they 

better satisfy the needs that the solution to the problem 

requires: 

Quality 

Extractive models exhibit limited robustness to varied user 

presentations of information. Since these models can only 

predict portions of the original input text, they lack the 

capability to modify the representation of key fields. For 

instance, if a numeric field (123) is expressed textually (one two 

three) in the original text, the model can only extract the 

information in the presented format. In contrast, generative 

models, trained on extensive datasets and fine-tuned through 

human feedback, can transform information from key fields into 

the desired format without additional post-processing. 

Scalability 

Deployed in highly optimized environments to handle large-

scale requests from cloud servers, generative models 

demonstrate superior scalability compared to extractive 

models. While extractive models are smaller in size than 

generative counterparts, deploying them necessitates 

configuring optimized servers with GPUs to ensure the 

specified minimum response time, leading to additional 

configuration efforts. 

 
  9 

Efficiency 

Despite the larger size of generative models compared to 

extractive models, their algorithmic complexity concerning the 

text size and the number of fields to extract is lower, as 

explained in the next section. 

Cost 

Generative models are consumed through APIs with variable 

costs based on input and output tokens (this cost scheme is 

elucidated later in the cost section). Consequently, they may be 

more expensive than deploying an extractive model that 

requires fewer computational resources to operate. 

Algorithmic complexity study 

Consider the number of input text tokens n and the number of 

fields to extract m. Given that extractive models can only 

respond with one answer at a time, they would need to process 

the n text tokens to extract each of the m fields. In this way, their 

algorithmic complexity is O(n⋅m). 

On the other hand, since generative models can produce an 

output sequence depending on the instructions they receive, 

they can extract all key fields by processing the n number of 

input tokens in a single pass. Therefore, their temporal 

complexity is not dependent on m and is O(n). 

Generative models comparison 

Comparison factors: 

• Quality 

• Privacy 

• Scalability 

• Cost 

• Maintainability 

For each criterion, a score will be assigned on a scale of 1 - 5, 

where closer to 1 is worse, and closer to 5 is better. 

Table 4.  Selection criterion for generative models 

Model Quality Privacy Scalability Cost 

Maint

ainab

ility Total 

LlaMa 

2 3 5 1 5 1 13 

GPT-3 5 2 5 4 5 21 

GPT-4 5 2 4 2 5 

18 

 
Claude 

2.1 3 2 4 4 5 18 

 
Solution selection 

Given that OpenAI's models have been trained with the largest 

amount and highest quality of data in various languages, and 

considering the lack of labeled data for training processes, 

GPT-3 is the option that offers the best quality for the problem. 

Additionally, recent improvements have been announced on 

November 6th, substantially reducing the costs of using the API 

and enhancing output generation in structured formats like 

JSON. This significantly improves code maintainability and the 

quality of extracted fields (New Models and Developer 

Products Announced at DevDay, n.d.). 

Costs 

The table below shows the costs (as of the current date, in 

December 2023) for every 1000 tokens for each language 

model service option and its context window. 

Table 5.  Costs per generative model 

Model 

Price per 

input token 

(in 

thounsands) 

Price per output 

tokens (in 

thousands) 

Context 

window (in 

thousands of 

tokens) 

gpt-4-

1106-

preview $0.0100 $0.030 128 

gpt4 $0.0300 $0.060 32 

gp-3.5-

turbo-1106 $0.0015 $0.0020 16 

Claude 2.1 $0.0080 $0.0240 200 

 
Methodology 

The methodology used to select the appropriate solution for 

extracting key information fields from natural language text is 

based on a comprehensive evaluation of natural language 

processing (NLP) models. The following outlines the 

methodological approach: 


January 2024  |  Croppie: AI-powered information extraction from natural language. 10 

Exploration of alternatives 

Various publicly available and privately hosted Machine 

Learning (ML) and Natural Language Processing (NLP) models 

were analyzed. 

Division by Model Types Based on Architecture: 

A distinction was made between extractive and generative 

models, considering the specific characteristics and 

applications of each type. 

Extractive models 

Models such as BERT, DistillBERT, ALBERT, and RoBERTa were 

evaluated, highlighting their capabilities and efficiencies in the 

task of extracting key information from text. 

Generative models 

Generative models like GPT-3, GPT-4, LlaMa2, and Claude 2.1 

were explored, analysing their abilities to generate complete 

text and their performance in NLP tasks. 

Model comparison 

Evaluation criteria 

Fundamental criteria, such as quality, scalability, efficiency, and 

cost, were established for comparing the models. 

Model scores 

A score from 1 to 5 was assigned for each criterion, where 5 

indicates the best performance and 1 the worst. 

Model scores Comparison between extractive and 

generative models 

A detailed comparison was made between extractive and 

generative models, highlighting key differences in quality, 

scalability, efficiency, and cost. 

Study of temporal complexity and efficiency 

Analysis of complexity 

A study of temporal complexity was conducted for the 

extraction of key fields, considering the number of input 

tokens and the number of fields to extract. 

Comparison of extractive and generative models. 

The algorithmic complexity of extractive and generative 

models was compared, emphasizing efficiency in extracting 

key information. 

Detailed comparison of generative models. 

Additional comparison factors: 

Additional factors such as privacy and maintainability were 

evaluated, specifically for generative models. 

Detailed scoring 

A detailed score from 1 to 5 was assigned for each generative 

model in each additional criterion. 

Solution selection 

Key considerations 

Crucial aspects like quality, scalability, efficiency, and cost were 

taken into account when making the final decision. 

Selection of GPT-3.5-turbo. 

GPT-3 was chosen as the preferred solution due to its quality, 

diverse training data in multiple languages, and recent 

improvements in costs and structured output generation 

through OpenAI Functions. 

Cost analysis. 

A detailed analysis of the costs associated with each model was 

provided, considering the price of input and output tokens, as 

well as the context window. 

Cost comparison. 

Costs per 1000 input tokens were compared for each language 

model service option, considering their respective context 

windows. 

This methodology provides a solid foundation for informed 

decision-making when selecting the most suitable solution for 

extracting key information from natural language text. 

Results 

Model training 

As explained earlier, given the conditions for the development 

of this project and the lack of conversational text with the 

colloquial traits that the real world conversations have, training 

custom models is not a viable option. Instead, the usage of 

Large Language Models (LLMs) that have been pre-trained on 

large amounts of data in different languages and situations 

offer a much more effective solution. 

In addition to this, given the deployment strategy of the service, 

a closed-source and callable API  for LLMs was also preferred, 

due to the cost-effectiveness gained by the scalability that 

these LLMs providers have. 

Instead of training an NLP model from scratch and deploying it 

in proprietary servers, the best-suited solution uses few-shot 

prompting techniques to instruct a Large Language Model to 

detect fields in conversational text. 

Architecture 

The following sequence diagram summarizes the workflow of 

the microservice solution: 


  11 

The client sends the chat history to Croppie, Croppie applies 

text-preprocessing techniques to the chat history. Once the text 

has been cleaned, Croppie forms a prompt text requesting all 

the required fields to extract. This prompt is send to the Large 

Language Model (LLM), which is accessed through a REST API. 

Once the LLM generates the answer in a parsable-JSON format, 

Croppie applies key-specific validations to make sure that the 

extracted fields were detected correctly, e.g., that the ID 

number contains 10 digits. If there are fields that do not comply 

with the validation criteria, Croppie forms another prompt with 

these fields in order to attempt to extract them correctly. 

Otherwise, it returns (field, value) pairs formatted as JSON to 

the client. 

Figure 2.  Workflow of the application 

 
Future work 

Single Point of Failure 

Relying solely on the OpenAI service introduces a potential 

Single Point of Failure (SPF) in the system. If there is a 

dependency on a single language model provider, any service 

disruptions or issues with OpenAI could lead to a complete halt 

in the functionality of the language processing component. To 

mitigate this risk, especially as the number of service requests 

grows considerably, implementing a redundancy scheme with 

multiple Language Model Providers (LLMs) becomes a 

preferable strategy. This approach ensures system resilience by 

distributing the workload across multiple providers, reducing 

the impact of service outages or disruptions from any single 

provider. Redundancy not only enhances reliability but also 

allows for flexibility and adaptability in handling increased 

demand, contributing to a more robust and reliable overall 

system architecture. 

Hallucinations 

Hallucinations in generative language models pose a 

significant challenge as these models may generate inaccurate 

or fabricated information. To address this issue, mitigating 

strategies are crucial. The temperature sampling strategy is one 

such approach that influences the randomness of the 

generated output. By adjusting the temperature parameter to 

its lowest, Croppie controls the level of randomness, helping to 

produce more focused and reliable responses. Additionally, 

validating the extracted fields is another essential step. 

Through field-specific validation processes, the generated 

content can be cross-checked against predefined criteria or 

compared with ground truth data, ensuring that the output 

aligns with accurate information. These combined strategies 

enhance the reliability of generative language models, 

minimizing the occurrence of hallucinations and improving the 

overall accuracy of the extracted values. 

Privacy 

Utilizing OpenAI's closed-source Language Models (LLMs) may 

pose a limitation on privacy as the inner workings and training 

data of these models remain undisclosed. This lack of 

transparency raises concerns about whether the information 

processed through these models might be utilized to further 

train and refine them. The uncertainty regarding data usage 

introduces a potential privacy risk for users and organizations 

relying on these models. However, OpenAI acknowledges and 

addresses these concerns with the OpenAI Enterprise plan, 

offering a solution that guarantees data privacy. With OpenAI 

Enterprise, there is a commitment not to use the incoming data 

for training the models, providing a more secure and privacy-

centric environment for users. This assurance is pivotal for 

organizations and users seeking a balance between leveraging 

advanced language models and safeguarding the privacy of 

the information processed through them. 

Another alternative to address these limitations, is to fine-tune 

and deploy private models with manually collected data. This 

approach might be suitable for later stages of the project, 

because it requires more maintenance and domain-specific 

expertise. 

Conclusions 
 

After a thorough comparison of generative models, 

considering key criteria such as quality, privacy, scalability, 

costs, and maintainability, it is clear that GPT-3.5-turbo from 

OpenAI stands out as the top choice for implementing the 

solution of extracting keywords from natural language text.  

GPT-3.5-turbo achieves the highest score in quality, with a 

rating of 5. Its training with large amounts of data in various 

languages gives it 

exceptional ability to understand Spanish text and generate 

accurate responses. While GPT-3.5-turbo does not reach the 

maximum score in privacy, its score of 2 remains competitive 

and acceptable. OpenAI has implemented optional measures 

to ensure that data sent to the API will not be used to train the 

model (Introducing ChatGPT Enterprise). 

 
January 2024  |  Croppie: AI-powered information extraction from natural language. 12 

GPT-3.5-turbo excels in scalability with a score of 5. Its ability to 

handle a context window of 16 thousand tokens allows it to 

effectively address longer and more complex texts. Moreover, 

being a smaller model than counterparts like GPT-4 and Claude 

2.1, it responds more quickly. 

 
One of the most critical factors is cost, and GPT-3.5-turbo 

proves to be highly efficient with a score of 4. Its recent 

improvements in API costs and structured output generation 

significantly contribute to its economic appeal, minimizing the 

number of tokens required for instructions to generate JSON-

formatted responses. 

 
Maintenance ease is essential for long-term implementation, 

and GPT-3.5-turbo leads with a score of 5. Improvements in 

structured output generation facilitate code integration and 

maintenance efficiently. In contrast to publicly accessible 

alternatives, GPT-3.5 does not require specialized server 

configuration for model inference and does not need to be 

trained with Spanish texts. 

 
In summary, GPT-3.5-turbo not only excels in text generation 

quality but also offers a robust combination of scalability, cost 

efficiency, and ease of maintenance. These attributes make 

GPT-3.5-turbo the preferred choice and the most 

comprehensive tool for developing the Croppie keyword 

extraction solution. 

 
Appendices 
 

Table 1.  Functional and non-functional requirements 

Requirement type Code Description 

Functional 

requirements 

RF01 Ask the user if they 

agree to the 

application's usage 

and data. 

 RF02 Inquire about the 

technician or 

technology the 

user is using. 

 RF03 Request the user's 

name of th 

producer. 

 RF04 Ask the user for the 

producer's ID or 

identification 

number. 

 RF05 Ask the user if they 

have a coffee ID 

and request the 

number if the 

answer is yes. 

 RF06 Ask the user about 

their gender. 

 RF07 Ask the user about 

their age range. 

 RF08 Ask the user about 

the type of cell 

phone they have. 

 RF09 Ask the user to 

select in which 

department the 

property is located. 

 RF10 Ask the user to 

write the 

municipality where 

the property is 

located. 

 RF11 Ask the user to 

write the path 

where the property 

is located. 

 RF12 Ask the user to 

enter the total size 

of their productive 

areas. 

 RF13 Ask the user to 

enter the total size 


  13 

of their coffee 

production areas. 

 RF14 Ask the user for 

their yield estimate 

for the coffee on 

the plot. 

 RF15 Verify and ask 

reasons if the 

estimate is less 

than 40 arrobas/ha. 

 RF16 Ask the user what 

type of market they 

produce for 

(Volume or 

Quality). 

 RF17 Ask the user about 

the variety of 

coffee crops on 

their plot. 

 RF18 Ask the user if the 

plot is technical. 

 RF19 Ask the user for the 

name of the plot. 

 RF20 Ask the user about 

the area of the plot 

where they grow 

coffee. 

 RF21 Ask the user about 

the age of the crop 

on the plot. 

 RF22 Ask the user about 

the total coffee 

production of the 

previous year on 

the plot. 

 RF23 Ask reasons if the 

previous year's 

production was less 

than 40 arrobas/ha. 

 RF24 Offer a tutorial on 

how to calculate 

the distance 

between coffee 

trees and the 

spacing between 

rows. 

 RF25 Ask the user the 

distance covered by 

5 consecutive 

furrows. 

 RF26 Ask the user the 

distance covered by 

5 consecutive 

coffee trees. 

 RF27 Ask the user how 

many coffee trees 

they have on the 

plot. 

 RF28 Ask the user if all 

the trees are 

productive, with a 

yes or no option. 

 RF29 Ask the user to 

estimate the 

percentage of 

productive trees if 

the previous 

answer is no. 

 RF30 Provide a tutorial 

on how to carry out 

sampling of 

productive trees. 

 RF31 Ask the user about 

the height of the 

coffee tree. 


January 2024  |  Croppie: AI-powered information extraction from natural language. 14 

 RF32 Provide a best 

practice guide for 

taking photographs 

of coffee trees. 

 RF33 Prompt the user to 

upload a full photo 

of tree number 1. 

 RF34 Ask the user for the 

number of 

productive 

branches of tree 

number 1. 

 RF35 Prompt the user to 

upload a photo of a 

productive branch 

of tree number 1. 

 RF36 Ask the user for the 

number of 

productive buds on 

the loaded branch 

of tree number 1. 

 RF37 Repeat the 

previous steps for 

at least 5 trees. 

 RF38 Confirm the 

information 

uploaded by the 

user and request 

confirmation to 

perform the 

performance 

calculation. 

 RF39 Show the user 

estimated 

performance 

results and request 

feedback. 

Non-functional 

requirements 

RNF01 Expose endpoints 

through which the 

service can be 

called to extract 

key information. 

 RNF03 Provide responses 

in real time, with a 

maximum response 

time of 5 seconds. 

 
Source: Requirements specification document 

Table 2.  key-fields 

Field Description 

Valid

ation 

consenti

miento 

True if the user explicitly allows 

the application to use their data, 

False if the user does not allow 

it. If no information is available, 

this field should be null (default). 

 
seleccio

n_tecnic

o 

The technician or technology the 

user is currently using. Choose 

from a list of authorized 

technicians. If no information is 

available, this field should be 

null (default). 

 
tipo_me

rcado 

Type of market for which the 

user produces. 

 
nombre_

producto

r 

Name of the producer.  

numero_

cedula 

Coffee ID number of the 

producer. It is requested only if 

the producer answers 'Yes' to 

the question about the coffee ID. 

It must be 10 digits. 

Lengt

h of 

the 

coffee 

ID 

must 

be 10 

digits. 


  15 

cedula_c

afetera 

Número de cédula cafetera del 

productor. Se solicita solo si el 

productor responde 'Sí' a la 

pregunta sobre la cédula 

cafetera. Debe tener 10 dígitos. 

Lengt

h of 

the 

coffee 

ID 

must 

be 10 

digits. 

genero_

producto

r 

Gender of the producer.  

rango_e

dad_pro

ductor 

Age range in which the producer 

is (in years). 

 
tipo_cel

ular 

Type of cell phone the user has. 

Options: 'Smartphone', 'Phone 

without internet capability', 'I 

don't have a cell phone', 'Other 

(Specify which one)' or null if not 

mentioned. 

 
departa

mento_fi

nca 

Departamento de Colombia 

donde está ubicada la finca del 

productor. 

 
municipi

o_finca 

Municipality of Colombia where 

the producer's farm is located. 

 
vereda_f

inca 

Rural area or locality in Colombia 

where the producer's farm is 

located. 

 
tamanio

_areas_p

roductiv

as 

Total size of the productive areas 

of coffee, corn and other 

products in hectares (ha). It 

must be between 0.1 and 100. 

Rango 

de 

tamañ

o 

entre 

0.1 y 

100 

hectár

eas. 

tamanio

_areas_c

afe 

Total size of coffee productive 

areas in hectares (ha). It must be 

numerical, in hectares, and less 

than or equal to the total area. 

Size 

range 

betwe

en 0.1 

and 

100 

hectar

es. 

The 

numb

er of 

coffee 

areas 

must 

be 

less 

than 

or 

equal 

to the 

total 

size of 

the 

produ

ctive 

areas. 

nombre_

parcela 

Name of the plot.  

area_par

cela_caf

e 

Area of the plot where coffee is 

grown. 

Plot 

area 

must 

be 

greate

r than 

0. 

edad_cul

tivo_par

cela 

Age of the crop on the plot. 

Must be between 0 and 30 years 

old. 

Crop 

age 

must 

be 

betwe

en 0 


January 2024  |  Croppie: AI-powered information extraction from natural language. 16 

and 

30 

years. 

producci

on_total

_2022 

Total coffee production during 

the coffee growing period of 

2022 on the plot. It should be 

between 0 and 240 arrobas/ha. 

Rango 

de 

produ

cción 

entre 

0 y 

240 

arrob

as/ha. 

estimaci

on_rendi

miento_

cafe 

Estimation of the yield for coffee 

in the plot to be forecast for the 

current year (parchment coffee). 

It must be numerical, in 

arrobas/ha, with a range of 0 to 

240 arrobas/ha. 

Rango 

de 

estim

ación 

entre 

0 y 

240 

arrob

as/ha. 

variedad

_cultivo_

cafe 

Variety of coffee crops on the 

plot. 

 
parcela_

tecnifica

da 

If the plot where the estimate 

will be made is technical. 

 
Source: Requirements specification document 

Bibliography 
 

Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2018). BERT: 

Pre-training of Deep Bidirectional Transformers for Language 

Understanding. arXiv (Cornell University). 

https://arxiv.org/pdf/1810.04805v2 

 
Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, 

a distilled version of BERT: smaller, faster, cheaper and lighter. 

arXiv (Cornell University). 

https://arxiv.org/pdf/1910.01108.pdf 

 
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & 

Soricut, R. (2019). ALBERT: A lite BERT for self-supervised 

learning of language representations. arXiv (Cornell 

University). https://doi.org/10.48550/arxiv.1909.11942 

 
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., 

Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). ROBERTA: A 

robustly optimized BERT pretraining approach. arXiv (Cornell 

University). https://doi.org/10.48550/arxiv.1907.11692 

 
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., 

Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is All 

you Need. arXiv (Cornell University), 30, 5998–6008. 

https://arxiv.org/pdf/1706.03762v5 

 
Thoppilan, R. (2022, January 20). LAMDA: Language Models 

for Dialog Applications. arXiv.org. 

https://arxiv.org/abs/2201.08239 

 
Radford, A. (2018). Improving language understanding by 

Generative Pre-Training. 

https://www.semanticscholar.org/paper/Improving-Language-

Understanding-by-Generative-Radford-

Narasimhan/cd18800a0fe0b668a1cc19f2ec95b5003d0a5035 

 
Aligning language models to follow instructions. (n.d.). 

https://openai.com/research/instruction-following 

 
Touvron, H., Martin, L., Stone, K. H., Albert, P. J., Almahairi, A., 

Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., 

Bikel, D., Blecher, L., Ferrer, C. C., Chen, M., Cucurull, G., 

Esiobu, D., Fernandes, J., Fu, J., Fu, W., . . . Scialom, T. (2023). 

Llama 2: Open foundation and Fine-Tuned chat models. arXiv 

(Cornell University). 

https://doi.org/10.48550/arxiv.2307.09288 

 
New models and developer products announced at DevDay. 

(n.d.). https://openai.com/blog/new-models-and-developer-

products-announced-at-devday 

 
OpenAI Pricing. (n.d.). https://openai.com/pricing 

Anthropic Pricing (n.d.) https://www-

files.anthropic.com/production/images/model_pricing_dec20

23.pdf 

 
Introducing ChatGPT Enterprise. (n.d.). 

https://openai.com/blog/introducing-chatgpt-enterprise 

 
Sebastian, Croppie: AI powered key-field extraction from natural language text, segaracos@outlook.com 

Sebastian, Croppie: AI powered key-field extraction from natural language text, segaracos@outlook.com 

CGIAR is a global research partnership for a food-secure future. CGIAR science is dedicated to transforming 

food, land, and water systems in a climate crisis. Its research is carried out by 13 CGIAR Centers/Alliances in 

close collaboration with hundreds of partners, including national and regional research institutes, civil society 

organizations, academia, development organizations and the private sector. www.cgiar.org 

We would like to thank all funders who support this research through their contributions to the CGIAR Trust 

Fund: www.cgiar.org/funders. 

To learn more about this Initiative, please visit this webpage. 

To learn more about this and other Initiatives in the CGIAR Research Portfolio, please visit 

www.cgiar.org/cgiar-portfolio 

© 2023 CGIAR System Organization. Some rights reserved. 

This work is licensed under a Creative Commons Attribution-Noncommercial 4.0 International Licence (CC BY-NC 4.0).

  
https://www.cgiar.org/
https://www.cgiar.org/funders
https://www.cgiar.org/initiative/excellence-in-agronomy/
https://www.cgiar.org/research/cgiar-portfolio/