This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and education use, including for instruction at the authors institution and sharing with colleagues. Other uses, including reproduction and distribution, or selling or licensing copies, or posting to personal, institutional or third party websites are prohibited. In most cases authors are permitted to post their version of the article (e.g. in Word or Tex form) to their personal website or institutional repository. Authors requiring further information regarding Elsevier’s archiving and manuscript policies are encouraged to visit: http://www.elsevier.com/copyright Author's personal copy Computers and Electronics in Agriculture 69 (2009) 198–208 Contents lists available at ScienceDirect Computers and Electronics in Agriculture journa l homepage: www.e lsev ier .com/ locate /compag Analysis of Andean blackberry (Rubus glaucus) production models obtained by means of artificial neural networks exploiting information collected by small-scale growers in Colombia and publicly available meteorological data Daniel Jiméneza,c,d,∗, James Cockd,e, Héctor F. Satizábalb,c,d, Miguel A. Barreto Sb,c,d, Andrés Pérez-Uribec, Andy Jarvise,f, Patrick Van Dammea a Ghent University, Faculty of BioScience Engineering: Agricultural Science, Laboratory of Tropical and Subtropical Agronomy and Ethnobotany, Coupure links 653-9000, Ghent, Belgium b Université de Lausanne, Hautes Etudes Commerciales (HEC), Institut des Systèmes d’Information (ISI), Switzerland c REDS Institute, University of Applied Sciences of Western Switzerland (HEIG-VD), Route de Cheseaux 1, CH 1401 Yverdon-les-bains, Switzerland d BIOTEC, Precision Agriculture and the Construction of Field-Crop Models for Tropical Fruit Species. Recta Cali Palmira km 18, Cali, Colombia e International Center for Tropical Agriculture (CIAT), Decision and Policy Analysis (DAPA), Recta Cali Palmira km 18, A.A. 6713, Cali, Colombia f Bioversity International, Recta Cali Palmira km 18, A.A. 6713 Cali, Colombia a r t i c l e i n f o Article history: Received 19 December 2008 Received in revised form 3 June 2009 Accepted 20 August 2009 Keywords: Andean blackberry Small-scale growers Artificial neural networks Multilayer perceptron Self-Organizing Maps Input relevance analysis Publicly available meteorological data a b s t r a c t The Andean blackberry (Rubus glaucus) is an important source of income in hillside regions of Colombia. However, growers have little reliable information on the factors that affect the development and yield of the crop, and therefore there is a dearth of information onhow to effectivelymanage the crop. Site specific information recordedby small-scale producers of theAndeanblackberry on their production systems and soils coupled with publicly available meteorological data was used to develop models of such production systems. Multilayer perceptrons and Self-Organizing Maps were used as computational models in the identification and visualization of the most important variables for modeling the production of Andean blackberry. Artificial neural networkswere trainedwith information from20 sites in Colombiawhere the Andean blackberry is cultivated. Multilayer perceptrons predicted with a reasonable degree of accuracy the production response of the crop. The soil depth, the average temperature, external drainage, and the accumulated precipitation of the first month before harvest were critical determinants of productivity. A proxy variable of location was used to describe overall differences in management between farmers groups. The use of this proxy indicated that, even under essentially similar environmental conditions, largedifferences inproduction couldbe assigned tomanagement effects. The informationobtained canbe used todetermine sites that are suitable forAndeanblackberryproduction, and to transferofmanagement practices from sites of high productivity to sites with similar environmental conditions which currently have lower levels of productivity. © 2009 Elsevier B.V. All rights reserved. 1. Introduction TheAndeanblackberry (Rubus glaucusBenth.), alsoknownas the Andes Berry or Mora de Castilla (Bioversity International, 2005) is a fruit native to an area ranging from the northern Andes to the southern highlands of Mexico (National Research Council, 1989). It is grown as a commercial crop in Colombia, Ecuador, Guatemala, Honduras, México and Panamá (Franco and Giraldo, 2002). It is an important source of income in hillside regions of Colombia (Sora et al., 2006). Productivity varies widely between regions and also ∗ Corresponding author at: 90 Rue de Javel, Paris 75015, France. Tel.: +33 1 457 99 038; fax: +32 9 264 62 41. E-mail address: danieljimenez.rodas@gmail.com (D. Jiménez). between farms. Furthermore, the crop is harvested continuously during the year and the productivity varies throughout the year. At the same time growers have little reliable information on the factors that effect thedevelopment andyield of the crop, and conse- quently there is a dearth of readily available information on where to grow the crop and how to effectively manage it. Research on the Andean blackberry is limited and with the cur- rent levels of research intensity it is unlikely that technological packages can be developed for use by growers based on traditional plot based experimentation varying individual factors that affect crop production. The heterogeneous growing conditions and the continuous production throughout the year of many tropical crops mean that a large number of experiments or treatments required to draw firm conclusions concerning the optimum management of the crop under diverse conditions. The situation of a tropical 0168-1699/$ – see front matter © 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.compag.2009.08.008 Author's personal copy D. Jiménez et al. / Computers and Electronics in Agriculture 69 (2009) 198–208 199 crop such as the Andean blackberry contrasts strongly with that of, let us say, raspberries in a temperate climate. In the case of most temperate crops, there is a relatively short and well defined har- vest period and all management is geared to optimal production in that period. In tropical perennial crops that are harvested through- out the year, the number of possible combinations of management practices that need to be tested are enormous. Thus, for example Andean blackberry production during the dry season may require totally different water and pest management practices to those required for the same crop in the wet season. A direct consequence of these multiple management options is continual experimenta- tion by producers of crops like Andean blackberries. Every time a farmer harvests his crop, there is a unique event, an unreplicated experiment (Cock, 2007). Experience with sugarcane, which is also aperennial tropical crop thatmaybeharvested throughout theyear in the low latitude tropics, has shown that by collecting information on crop production produced with the naturally occurring varia- tion in management and the environment, the crops response can be modeled using statistical or best fit models (Isaacs et al., 2007). This approach has later been successfully applied to another peren- nial tropical crops, like coffee (Niederhauser et al., 2008). Given the scarce available information and the limited resources for field work research, and thehighdegree of heterogeneity in both growth and management, we opted for a data-driven modeling approach to provide information to growers on how to choose apposite sites for and to better manage their crops. Crop models are basically of two types which can roughly be describe as mechanistic simulation models and best fit or statisti- cal models. The mechanistic models have the great advantage, at least in theory, that they can be extrapolated out of the range of variation for which data exists as they are based on the basic phys- iological functions of the plant and their response to variation in individual parameters in the environment. Furthermore, variables that affect the observed variation in crop response to changes in the environment canbe identified in causal relationships.However, these mechanistic simulation models require detailed knowledge of the functional relationships between the multiple physiologi- cal and other processes involved in crop growth and development. This knowledge base simply does not exist, and would take years to develop, for a crop like the Andean blackberry that has received little attention from researchers in the past. Statistical or best fit models are generally simpler and rely upon relationships between variations in observed crop growth and development and varia- tions in the growing conditions. The best fit models, however, have the dual disadvantage that they can neither be used to extrapolate beyond the range of variation encompassed in the initial datasets used to develop the models, and secondly they are not able to determinewhether relationships are causal ormerely associations. The best fit models do, however, have the advantage that they can be constructed with a limited knowledge of the myriad individual processes and their interaction with variation in the environment that determines how a crop grows, develops and finally produces a useful product. Thus, with insufficient resources to obtain the knowledge required to developmechanisticmodels, and the obser- vation that best fit models have successfully been used in other crops, this approach was selected for Andean blackberry. Many of the best fitmodels used to predict crop yields are devel- oped using existing information on both crop production and the environment. In the case of small farm crops, such as the Andean blackberry, information on crop production is not readily available and certainly cannot readily be associated with the particular envi- ronmental conditions underwhich a particular cropwas harvested. However, as we previously observed, every harvest is effectively an unreplicated experiment. If it were possible to characterize the production system in terms ofmanagement and the environmental conditions, and if we were able to collect information on the har- vested product of a large number of harvesting events under varied conditions, it should be possible to develop best fit models for the production system. Hence, first step in developing these models was the acquisition of data on Andean blackberry production and the characterization of the production systems. Agricultural systems are difficult tomodel due to their complex- ity and their non-linear dynamic behavior. The evolution of such systems depends on a large number of ill-defined processes that vary in time, that interact with each other, and whose relationships are often highly non-linear and very often unknown (Jiménez et al., 2008). Moreover, the available information describing these sys- tems frequently includesbothqualitative andquantitativedata, the formeroftendifficult to include in traditionalmodelingapproaches. We surmised that bio-inspiredmodels, such as artificial neural net- works, are an appropriate alternative for developing models that can be used to improve production systems. Artificial neural networks have been successfully used to model agricultural systems (Hashimoto, 1997; Schultz andWieland, 1997; Schultz et al., 2000). According to Jiménez et al. (2008), these tech- niques are appropriate as an alternative to traditional statistical models and mechanistic models, when the input data is highly variable, noisy, incomplete, imprecise, and of a qualitative nature, as is the case of our Andean blackberry dataset. Artificial neural networks do not require prior assumptions concerning the data distribution or the form of the relationships between inputs and outputs (Sargent, 2001; Paul and Munkvold, 2005; Nagendra and Khare, 2006). They are capable of “learning” non-linearmodels that include both qualitative and quantitative information, and in gen- eral, they provide superior pattern recognition capabilities than traditional linear approaches (Murase, 2000; Schultz et al., 2000; Noble and Tribou, 2007). They have become a powerful technique to extract salient features from complex datasets (Chon et al., 1996; Giraudel and Lek, 2001). Furthermore, when dealing with multiple variables they can be used to produce easily comprehensible low- dimensional maps that improve the visualization of the data, and facilitate data interpretation (Barreto et al., 2007). Nevertheless, there are a number of disadvantages concerning the use of artificial neural networks, some of them are: its “black box” nature, which makes it difficult to interpret relations between the inputs and out- puts, the difficulty of directly including knowledge of a ecological processes, the tendency to overtrain, and the need for enough data to be properly trained (Schultz et al., 2000; Sargent, 2001; Paul and Munkvold, 2005). An important first step in developing models that explain vari- ation in yield is the identification of relevant variables that affect yield: identification of these variables guides the data collection required as inputs into the model. Several studies identify the most relevant variables, and explain given responses in agriculture through the use of multilayer pre- ceptrons. For instance, Miao et al. (2006) implemented a neural network for identifying the most important variables for corn yield and quality. Using soil and genetic data, and a sensitivity analy- sis for each variable, they demonstrated that the hybrid was the most important factor explaining variability of corn quality and yield. In another study, Jain (2003) reported that the best frost prediction was obtained from the relative humidity, solar activ- ity and wind speed from 2 to 6h before the frost event. Paul and Munkvold (2005) predicting severity of gray leaf spot ofmaize (Cer- cospora zeae-maydis) in corn (Zea mays L.), concluded that the best variables for predicting severity were hours of daily temperatures, hours of nightly relative humidity, and mean nightly temperature. More recently, Jiménez et al. (2007)modeling sugarcane yield, sug- gested that crop age andwater balancewere highly relevant for the modeling process. Self-Organizing Maps (SOM) have also been implemented to improve the visualization of input–input and input–output depen- Author's personal copy 200 D. Jiménez et al. / Computers and Electronics in Agriculture 69 (2009) 198–208 Fig. 1. Variables selected for the construction of the Andean blackberry yield model. dencies. Thus, for example Moshou et al. (2004) found that a waveband centered at 861nm was the variable which best dis- criminated healthy from diseased leaves with yellow rust (Puccinia striiformis f. sp. tritici) in wheat (Triticum spp. cv. Madrigal). As another example, Boishebert et al. (2006) pointed out that grow- ing year was an important factor in the differentiation of yield of strawberry varieties. Extension agents, expert crop advisers and growers of Andean blackberry have reached a general consensus that optimum condi- tions for the crop are: soils with high of organic matter content and a loamy texture, altitude between1800 and2400mabove sea level, average relative humidity between 70 and 80%, average tempera- ture between 11 and 18 degree Celsius (◦C), and 1500 and 2500mm of rainfall per year (Franco and Giraldo, 2002). The goal of this research was to demonstrate that collection of data frompoor small-scale commercial producers of Andean black- berry and its analysis by means of artificial neural networks can provide growers with useful information to increase their produc- tivity. 2. Materials and methods 2.1. Data collection and compilation Corporación Biotec together with local Andean blackberry pro- ducers developed a simple aid based on a calendar which was used by the farmers to record information on the production of each lot planted to blackberries on their farm. The soil characteristics were determined by the soil and terrain evaluation methodology knownasRASTA (Rapid Soil andTerrainAssessment) (Alvarez et al., 2004) for 20 different sites in the departments of Narin˜o and Cal- das in Colombia. The information collected by the farmers on the calendars and with RASTA was then transferred to the database of the site-specific agriculture for tropical fruits (AEPS) project. This database includes information on location, landrace varieties, yield, harvest time and data on soil characteristics. A total of 488 records of yield from the database were included in the analysis. These records or “events” provided farmers’ estimates of the quantity (in kilograms) of fruit harvested per plant for weekly periods (Fig. 1). Environmental information of landscape and climate was obtained for each site from a range of secondary data sources. Topography and landscape data was extracted from the Shuttle Radar Topography Mission (SRTM) (Farr and Kobrick, 2000), using theVersion3dataset available fromtheCSI-CGIAR. Long-termaver- ages for monthly temperature and precipitation were obtained from WORLDCLIM (Hijmans et al., 2005), and daily rainfall was extracted from the 3b42 product of the Tropical Rainfall Measuring Mission (TRMM) database (Bell, 1987). 2.2. Variable selection The informationcompiled in thedatabase forAndeanblackberry consisted of 28 variables (Table 1). This information included cat- egorical variables describing geographical position (large areas for departments, specific areas for particular localities within depart- ments) and variety (thorn blackberry or normal blackberry), and environmental variables based on landscape, soil and climate (Table 1). Each yield observation was associated with the environ- mental variables taking into account the date of harvest (Fig. 1). 2.3. Computational models 2.3.1. Multilayer perceptron A multilayer perceptron (Bishop, 1995) was used to model Andean blackberry yield, in such a manner that the output of the neural network, the continuous variable yield, is determined by the28variablesweused as inputs. TheBack-propagation algorithm (Bishop, 1995) was employed in order to train the neural networks. This algorithm is a gradient descent based optimizer which mini- mizes the difference between the desired output of the model (in the training dataset) and the actual output of the network, i.e. the mean square error (MSE). In order to provide a mechanism for testing the model perfor- mance and to compare different models or network topologies, a training and a validation dataset were created by random sam- pling without replacement from the whole dataset. In this way, each training step was performed using 80% of the whole dataset, and every testing procedure to assessmodel performance,was per- formed on the remaining 20%. This method, called “split-sample” or “hold-out” validation, may assess predictive model perfor- mance, but is not recommended in its simplest form for small datasets (Goutte, 1997). However, the split sample procedure can be improved for small dataset by repeating the “split-sample” procedure several times, and then calculating the resulting per- formance as the average of all the tests made over the different validation subsets. Different “flavors” of this method have been used with artificial neural networks (Efron, 1983). These include “cross-validation”, “leave-one-out validation”, and “bootstrap val- idation”. Author's personal copy D. Jiménez et al. / Computers and Electronics in Agriculture 69 (2009) 198–208 201 Table 1 Inputs used for development of Andean blackberry yield model. Input Variable Type Abbreviation Source 1 Thorn or Normal blackberry Cata AB Thorn N AEPS 2 Narin˜o–Caldas (Large geographic area) Cata Nar-Cal AEPS 3 Narin˜o, la unión, chical Alto (specific geographic area) Cata Na un chical AEPS 4 Narin˜o, la unión,cusillo alto (specific geographic area) Cata Na un cusal AEPS 5 Narin˜o, la union, cusillo bajo (specific geographic area) Cata Na un cusba AEPS 6 Narin˜o, la unión, la jacoba (specific geographic area) Cata Na un lajac AEPS 7 Caldas Riosucio zona rural (specific geographic area) Cata Cal riosu zr AEPS 8 Altitude Conb Srtm SRTM 9 Slope Conb Slope SRTM 10 Internal drainage Conb IntDrain AEPS 11 External drainage Conb ExtDrain AEPS 12 Effective soil depth Conb EffDepth AEPS 13 Precipitable water of the harvest month Conb Trmm 0 TRMM 14 Precipitable water of the first month before harvest Conb Trmm 1 TRMM 15 Precipitable water of the second month before harvest Conb Trmm 2 TRMM 16 Precipitable water of the third month before harvest Conb Trmm 3 TRMM 17 Average temperature of the harvest month Conb TempAvg 0 WORLDCLIM 18 Temperature range of the harvest month Conb TempRang 0 WORLDCLIM 19 Accumulated precipitation of the harvest month Conb PrecAcc 0 WORLDCLIM 20 Average temperature of the first month before harvest Conb TempAvg 1 WORLDCLIM 21 Temperature range of the first month before harvest Conb TempRang 1 WORLDCLIM 22 Accumulated precipitation of the first month before harvest Conb PrecAcc 1 WORLDCLIM 23 Average temperature of the second month before harvest Conb TempAvg 2 WORLDCLIM 24 Temperature range of the second month before harvest Conb TempRang 2 WORLDCLIM 25 Accumulated precipitation of the second month before harvest Conb PrecAcc 2 WORLDCLIM 26 Average temperature of the third month before harvest Conb TempAvg 3 WORLDCLIM 27 Temperature range of the third month before harvest Conb TempRang 3 WORLDCLIM 28 Accumulated precipitation of the third month before harvest Conb PrecAcc 3 WORLDCLIM a Categorical variables. b Continuous variables. Network topology is an important issue in training a neural network model. The selection of the number of neurons in the hidden layer was made by comparing neural networks having 1,2,3,4,5,6,7,8,9 and 10 hidden units. This comparison was carried out by simple implementation of a bootstrap validation scheme (Efron, 1983). Thus, each network was tested by performing “split- sample” validations 100 times, and then the different values of the averaged MSE were compared in order to determine the network having the best performance. The topology with the lowest MSE over the validation subset had 5 units in the hidden layer neural network (Fig. 2) and was selected for further development. An ensemble of 100 networks with the selected topology but withdifferent initializationwasbuilt and tested inorder to improve the generalization capabilities of the model (Dietterich, 2000; Brown et al., 2005). Neural networks ensembles are less affected by local minima, and have been shown to outperform their single components (Yao and Liu, 1998). In our case, the source of diversity Fig. 2. MSE of artificial neural networkswith different number of neurons in hidden layer. amongmodelswas the starting point of the Back-propagation algo- rithm (random initialization), and the resulting model output was calculated by averaging the outputs of the 100 individual networks. Finally, to identify the variables which contribute most to yield; an analysis was conducted by means of the relevance metric based on sensitivity described in Satizábal and Pérez-Uribe (2007). This method assesses input relevance by calculating the partial deriva- tive of the output of the neural network ensemble with respect to each one of the inputs. Input sensitivity should reflect input relevance because the Back-propagation algorithm finds higher connection weights to inputs having more relevant information and, in the same way, attenuates connections from noisy inputs. 2.3.2. Self-Organizing Maps The Self-Organizing Map or SOM (Kohonen, 1995) is a non- supervised algorithmwhich combines clustering and visualization. The SOM maps high-dimensional datasets can be in a low- dimensional output space (generally a grid of two dimensions) with the SOM technique: observations with similar characteristics appear clustered together in the low-dimensional map produced. Such a map facilitates exploratory visual analysis of the clusters and the relationships between the variables of a complex dataset. However, a SOM does not preserve distance information. In order to address this problem the topology is disregarded, and standard clustering methods are applied to the SOM prototype vectors, and then the clusters are displayed on a lattice (Vesanto and Ahola, 1999; Barreto and Pérez-Uribe, 2007). We chose the K-means algorithm to group the observations into agivennumberofK clusters.Oneof the limitationsof this technique is the a-priori definition of the number of clusters, which is fre- quently unknown. To tackle this drawback, different K values were tested and then different groups with different number of clus- ters were calculated. The optimal number of K was then derived using the Davies–Bouldin index (Davies and Bouldin, 1979). The coordinate axes of the lattice are not clearly interpreted in terms of the original variables. Instead, variables are typically visualized Author's personal copy 202 D. Jiménez et al. / Computers and Electronics in Agriculture 69 (2009) 198–208 by a “component plane” representation, where several lattices, one for each variable, are shown side by side. A lattice with a variable- specific coloring is called a component plane. The component plane representation is useful infindingdependencies betweenvariables. These dependencies are perceived as similar patterns in identical areas of different component planes (Figs. 7–13). The dependency search can be eased by organizing the component planes such that similarplanes arepositionednear to eachother (VesantoandAhola, 1999). In the present study, a SOM was used in order to facilitate the visualization of the relations among the productivity and the 28 environmental and geographical variables, and establish the values ranges of these variables associated with high, medium and low yield. 3. Results and discussion 3.1. Model performance The neural network model was evaluated to ensure that its performance was acceptable for our purpose of determining rela- tionship between the yield of the Andean blackberry and the characteristics of sites where it was grown. To evaluate the model’s performance we computed the coefficient of determination of the real Andean blackberry’s yield and the yield predicted by themodel using only the data from the “hold-out” validation dataset (Fig. 3). The coefficient of determination (0.89) indicates that the model explained close to 90% of the total variation, which we consid- ered sufficient to proceed to the next step of determining input relevance. The fit between the real yield values and the predicted values taken from the validation data was close at the low levels of pro- duction, but was poor over the range of 69–93 (Fig. 4). At the same time, the model accurately predicted the expected yields at high yield levels. This suggests that the model can be used to determine ex ante the conditions andmanagement associatedwith high yields andhence toprovide farmerswith guidelines onhow toobtainhigh yields. In addition, the model can also determine site characteris- tics that are inevitably associated with poor crop performance and Fig. 3. Scatterplot displaying multilayer perceptron predicted yield versus real Andean blackberry yield, using only the validation dataset. these can be used to indicate to farmers that a particular site and management combination is not a viable option. 3.2. Analysis of the variables relevance We assessed the yield response to changes in the 28 variables used in the model by obtaining the sensitivity of the model output with respect to eachoneof the inputs.Weused the relevancemetric based on sensitivity described in Satizábal and Pérez-Uribe (2007), which expresses the amount of change of the output with the vari- ations of the inputs. The nine most important variables identified by the sensitivity metric were: soil depth, the average temperature of the first month before harvest, the specific geographical areas Narin˜o–la union–chical alto and Narin˜o–la union–cusillo bajo, the average temperature of the harvest month, the average tempera- ture of the second month before harvest, the average temperature of the third month before harvest, external drainage and the accu- mulated precipitation of the first month before harvest (Fig. 5). There was a moderately sharp drop off of the sensitivity after the ninth variable (see Fig. 5). A Wilcoxon test at an alpha level of 5% Fig. 4. Line with markers displaying the fitness of the predicted and real Andean blackberry yield through the observations from the validation dataset (yield values upwardly ordered). Author's personal copy D. Jiménez et al. / Computers and Electronics in Agriculture 69 (2009) 198–208 203 Fig. 5. Sensitivity distribution of the model with respect to the inputs. (Table 2) indicated that the means of this group of nine variables were significantly different (p=0.0001) from the rest of the vari- ables. Hence, the nine most important variables were selected for further analysis. 3.3. Visualization of the relations between the variables found as relevant by the sensitivity metric and clusters with similar productivity of Andean blackberry To further analyze the effects of the nine variables, a Kohonen map was trained with the same observations we employed to train themultilayerperceptron. The resultingbidimensionalmap is com- posedof vector prototypeswhich associate topological information of the original 28 variables with Andean blackberry yield (Fig. 6a). These prototypes were clustered by using the K-means algorithm. According to the Davies–Bouldin index, the map was divided into 6 clusters exhibiting similar features that influence Andean black- berry productivity (Fig. 6b). 3.4. Component planes and variable dependencies In order to improve the visualization of the dependencies between the clusters shown in theKohonenmap (Fig. 6b) the “com- ponent planes” of Andean blackberry productivity (Fig. 7a), and the variables previously identified as the most relevant for modeling Andean blackberry yield: effective soil depth (Fig. 8), the average temperature of the harvest month, the average temperature of the first, second and third months before harvest (Fig. 9a–d), two spe- cific geographic areas (Figs. 10 and 11), external drainage (Fig. 12), and the accumulated precipitation of the firstmonth before harvest (Fig. 13), were separated from the Kohonen map and displayed as lattices. 3.4.1. Productivity plane Yields greater than 1.16kg/plant/week were associated with regions in cluster 2 on the Kohonen map (Fig. 7a and b). Yield val- ues between 0.0018 and 1.16kg/plant/week correspond to clusters 1, 3, 4, 5 and 6 in the Kohonen map. Being 3, 4, 6 the clusters with lowest yields. 3.4.2. Effective soil depth Values of soil depth greater than 70 centimeters (cm) are asso- ciated with clusters 3, 4 and 6 (Fig. 8) which are all associated with low yields. In contrast, a soil depth between 40 and 70 cm appears to be related to medium to high yield clusters (1, 2, 5). The cluster with thehighestyieldshadsoil depths in the rangeof60–70 cmsug- gesting that this level of soil depth is optimal, and that an effective soil depth greater than 70 cm is not necessary to obtain high yields. Franco and Giraldo (2002) stated that for optimal Andean black- berry development, soil depth should be deep enough to allow soil moisture retention without problems of water logging. We suggest that although soil depths above 70 cm were associated with low yields in this study this is probably due to other factors associated with the deeper soils. 3.4.3. Average temperature of the harvest month and average temperatures of the first, second and third months before harvest The Kohonnen maps for temperature of the first, second and third months before harvest were similar (Fig. 9). The multilayer perceptron showed that the average temperature of the firstmonth before harvest was more important than the others temperatures (that occurs due to small differences captured to better fit the out- put). However, the similarity of the components of temperature is probably due to the low monthly variation in temperature under the equatorial conditions of this study. The similarity of the temper- ature patterns induced us to analyze them as a group rather than Table 2 Wilcoxon test at an alpha level of 5% comparing means of relevance between the nine most important variables identified by the sensitivity metric and means of the rest of variables. T T (expected value) T (variance) Z (observed value) Z (critical value) Two-tailed p-value 171.000 85.500 527.250 3.724 1.960 0.0001 Author's personal copy 204 D. Jiménez et al. / Computers and Electronics in Agriculture 69 (2009) 198–208 Fig. 6. Kohonen map showing the resulting clusters. (a) U-matrix displaying the distance among prototypes. The scale bar (right) indicates the values of distance. The upper side exhibits high distances, whilst the lower displays low distances. (b) Kohonen map displaying the 6 clusters obtained after using the K-means algorithm and the Davies–Bouldin index. separately. It is immediately evident that cluster 6 with tempera- tures of about 24 ◦C is not suitable for high yields of blackberries (Fig. 9). Clusters 1, 2 and 5 with medium to high yields are related to temperatures between 16 and 18 ◦C (Fig. 9a–d) and low yields appear to be associatedwith temperatures in the range of 14–15 ◦C. Andean blackberry experts suggest the optimal temperature for a healthy growth of this crop is between 11 and 18 ◦C. We suggest a narrower temperature range with 16–18 ◦C associated with high yields and lower yields as the temperature moves above or below this range. 3.4.4. Geographic areas as proxy for crop management Proxies can be used to estimate the effect of either immeasur- able or unobservable variables on a given phenomenon (Thomas et al., 1990; Steckel, 1995; Goodman et al., 1996; Adami et al., 1999; Filmer and Pritchett, 1999; Montgomery et al., 1999). In our study, geographical areas were integrated into the model with the aim of capturing the effect of variables that were not measured. The geographical proxies were added to the analysis specifically to take into account management and social factors which were not captured by the data collection process and which are likely to be related to the geographic location of a site. For example, farmers from a given locality are likely to use similar management practices that will differ from those used by other communities living in distant localities. The localities Narin˜o–la union–chical alto (Fig. 10), and Narin˜o–la union–cusillo bajo (Fig. 11) were Fig. 7. (a) Component plane of Andean blackberry yield, the scale bar (right) indi- cates the range value of productivity in kg/plant/week .The upper side exhibits high values of yield, whereas the lower displays low values. (b) Kohonen map displaying the resultant 6 clusters and their labels according to yield values. associated with cluster 2 which is characterized by the highest yields. Whilst the association with high yields could be a conse- quence of specific local environmental conditions not accounted for by the environmental variables used in the model, we suggest that is more likely that they are due to particular crop manage- ment practices related to local knowledge and socio-economic circumstances. In sugarcane certain groups of farmers consistently obtain higher yields than others even in the same edapho-climatic conditions (Isaacs et al., 2007). The difference is due to better man- agement by certain groups which is related to socio-economic factors including access to knowledge on optimal production prac- tices. 3.4.5. External drainage and accumulated precipitation of the first month before harvest Scrutiny of the external drainage lattice (Fig. 12) gave no obvi- ous clues as to how drainage affects the yield of blackberries. In fact medium yield in cluster 5 is associated with poor external drainage and in cluster 2 with high yields the external drainage is highly variable. However, in all clusters with medium or high yields poor external drainage is associated with low precipita- tion of the first month before harvest (Fig. 13): not only does this appear to be true from the Kohonnen maps, but it also makes agronomic sense. Good external drainage is evidently more impor- tant when rainfall is greater. This example clearly indicates how the visual inspection of the Kohonnen maps can assist in under- standing how various factors effect the growth and development Author's personal copy D. Jiménez et al. / Computers and Electronics in Agriculture 69 (2009) 198–208 205 Fig. 8. Component plane of effective soil depth. The scale bar (right) indicates the range value in centimeters of soil depth, the upper side of the scale exhibits high values, whereas the lower displays low values. of the crop and the interactions between them. Further inspec- tion of Figs. 12 and 13 indicate that excellent external drainage is not sufficient to overcome the effects of high or moderate pre- cipitation with moderate external drainage in cluster 3. Overall there was a tendency for low rainfall to be advantageous but there were exceptions. However, when the two variables, precip- itation of the first month before harvest and external drainage are taken together it is clear that low rainfall accompanied with varied external drainage conditions can provide good yields, but that heavier precipitation of the first month before harvest with poor drainage is not conducive to high levels of productiv- ity. Fig. 9. Components planes of the averages temperature: (a) temperature of the harvest month, (b) average temperature of the first month before harvest, (c) average temperature of the second month before harvest, and (d) average temperature of the third month before harvest. In all figures, the scale bar (right) indicates the range value in ◦C of temperature. The upper side exhibits high values, whereas the lower displays low values. Author's personal copy 206 D. Jiménez et al. / Computers and Electronics in Agriculture 69 (2009) 198–208 Fig. 10. Component plane of the specific geographic area Narin˜o–la union–chical alto. The highest values indicate presence and the lowest absence as they are categorical variables. Fig. 11. Component plane of the specific geographic area Narin˜o–la union–cusillo bajo. The highest values indicate presence and the lowest absence as they are categorical variables. Fig. 12. Component plane of external drainage. In the scale bar (right), the highest value 3 indicates excellent or fast drainage, 2 moderate drainage, and 1 poor or slow drainage. Author's personal copy D. Jiménez et al. / Computers and Electronics in Agriculture 69 (2009) 198–208 207 Fig. 13. Component plane of the accumulated precipitation of the first month before harvest. The scale bar (right) indicates the range value in millimeters of rainfall, the upper side of the scale exhibits high values, whereas the lower displays low values. 4. Conclusions Data collected by small farmers in the Andes couple with information fromexisting data baseswas successfully used to char- acterize specific production events and to relate production to site and time specific events. The analysis approach focuses first on identifying those variables that explain most of the yield variabil- ity by means of artificial neural networks (multilayer perceptron), and then using the Self-Organizing Maps as a tool for dimension- ality reduction and visualization of input–input and input–output dependencies. Artificial neural networks were found to be an effective tool for managing the highly variable, noisy, and qualitative nature of agri- cultural information collected by farmers and linked to publicly available climate databases. Multilayer perceptrons were used to develop a model based on dataset with 28 variables. This model explained close to 90% of the variation in a validation set. Sensi- tivity analysis was used to identify the most important variables in determining variation in yield. Self-Organizing Maps were then used to group Andean blackberry yield from different sites accord- ing to similarity of growth conditions and management. Data was not available to directly evaluate management practices, so local- ities were used as a proxy for management. The SOM provided a straightforward manner to visualize the distribution of the vari- ables that affected yield. “Component planes” generated by SOM illustrated the association of these variables with yield and iden- tified two geographic areas as highly productive. The optimal conditions for high yields are an average temperature between 16 and 18 ◦C, an effective soil depth between 60 and 70 cm, and low rainfall during the first month before harvest in poor exter- nal drainage locations or moderate to low rainfall in better drained areas. The identification of geographic areas with higher yields than those that would be expected solely from the environmen- tal conditions suggests that the farmers in those geographical areas were managing their crops particularly effectively. How- ever, there was not sufficient information to precisely determine which management factors led to the high yields. At the same time the mere identification of areas with farmers that prop- erly manage their crops, offers the chance for these farmers to disseminate their knowledge to other farmers with sim- ilar environmental conditions so that they too can improve yields. Acknowledgements Thiswork is part of a cooperation project between BIOTEC, CIAT, and HEIG-VD (Switzerland) named “Precision agriculture and the construction of field-crop models for tropical fruits”. The econom- ical support is given by several institutions in Colombia (MADR, COLCIENCIAS, ACCI) and the State Secretariat for Education and Research (SER) in Switzerland. References Adami, J., Gridley, G., Nyren, O., Dosemeci, M., Linet, M., Glimelius, B., Ekbom, A., Zahm, S.H., 1999. Sunlight and non-Hodgkin’s lymphoma: a population-based cohort study in Sweden. Int. J. Cancer 80, 641–645. Alvarez, D.M., Estrada, M., Cock, J.H., 2004. RASTA (Rapid Soil and Terrain Assess- ment). Universidad Nacional De Colombia, Palmira, Colombia. Barreto, M., Jiménez, D.R., Pérez-Uribe, A., 2007. Tree-structured Self-Organizing Map component planes as a visualization tool for data exploration in agro- ecological modelling. In: Proceedings of the 6th European Conference on Ecological Modelling (ECEM’07), Trieste, Italy, pp. 55–56. Barreto,M., Pérez-Uribe, A., 2007. Improving the correlation hunting in a large quan- tity of SOM component planes. In: Proceedings of the International Conference on Artificial Neural Networks (ICANN 07), Porto, Portugal, pp. 379–388. Bell, T.L., 1987. Space-time stochastic model of rainfall for satellite remote-sensing studies. J. Geophys. Res.-Atmos. 92, 9631–9643. Bioversity International, 2005. Information Sheet on Rubus glaucus in New World Fruits Database. URL: http://www.bioversityinternational.org/ Information Sources/Species Databases/New World Fruits Database/. Accessed July 16, 2008. Bishop, C.M., 1995. Neural Networks for Pattern Recognition. Oxford University Press, Oxford. Boishebert, d.V., Giraudel, J.L., Montury, M., 2006. Characterization of strawberry varietiesbySPME–GC–MSandKohonenself-organizingmap.Chemometr. Intell. Lab. Syst. 80, 13–23. Brown, G., Wyatt, J.L., Harris, R., Yao, X., 2005. Diversity creation methods: a survey and categorisation. Inform. Fusion 6 (1), 5–20. Chon, T.S., Park, Y.S., Moon, K.Y., Cha, E.Y., 1996. Patternizing communities by using an artificial neural network. Ecol. Model. 90, 69–78. Cock, J., 2007. Sharing commercial information. In: Innovation Work- shop for the Agricultural Sector: Site Specific Agriculture based on Sharing Farmers Experiences, CIAT, Cali, Colombia, October, URL: http://biotec.univalle.edu.co/Memorias.htm. Davies, D.L., Bouldin, D.W., 1979. A cluster separationmeasure. IEEE. T. Pattern. Anal. 1, 95–104. Dietterich, T.J., 2000. Ensemble methods in machine learning. In: Multiple Classifier Systems First International Workshop (MCS 2000), Cagliari, Italy, pp. 1–15. Efron, B., 1983. Estimating the error rate of a prediction rule: improvement on cross- validation. J. Am. Stat. Assoc. 78 (382), 316–331. Farr, T.G., Kobrick, M., 2000. Radar topography mission produces a wealth of data American geophysical. Union Eos. 81, 583–585. Filmer, D., Pritchett, L., 1999. The effect of household wealth on educational attain- ment: evidence from 35 countries. Popul. Dev. Rev. 25, 85–120. Author's personal copy 208 D. Jiménez et al. / Computers and Electronics in Agriculture 69 (2009) 198–208 Franco, G., Giraldo, M., 2002. Condiciones ambientales del cultivo de la mora. In: Corporacion colombiana de investigacion agropecuaria, regional nueve (Eds.), El cultivo de la mora, CORPOICA, Manizales, pp. 1–3. Giraudel, J.L., Lek, S., 2001. A comparison of self-organizing map algorithm and some conventional statistical methods for ecological community ordination. Ecol. Model. 146, 329–339. Goodman, k., Correa, P., Tengana, H.J., Ramirez, H., DeLany, J.P., Pepinosa, O.G., Quin˜ones, M., Parra, T., 1996. Helicobacter pylori infection in the Colombian Andes: a population-based study of transmission pathways. Am. J. Epidemiol. 144, 290–299. Goutte, C., 1997. Note on free lunches and cross-validation. Neural. Comput. 9 (6), 1245–1249. Hashimoto, Y., 1997. Applications of artificial neural networks and genetic algo- rithms toagricultural systems.Comput. Electron.Agric. 18, 71–72 (special issue). Hijmans,R.J., Cameron, S.E., Parra, J.L., Jones, P.G., Jarvis,A., 2005.Veryhigh resolution interpolated climate surfaces for global land areas. Int. J. Clim. 25, 1965–1978. Isaacs, C.H., Carbonell, J.A., Amaya, A., Torres, J.S., Victoria, J.I., Quintero, R., Palma, A.E., Cock, J.H., 2007. Site specific agriculture and productivity in the Colombian sugar industry. In: Proceedings of the 26th congress International Society of Sugar Cane Technologists (ISSCT), Durban, South Africa. Jain, A., 2003. Predicting air temperature for frost warning using artificial neural networks. Thesis. Institute for Artificial Intelligence, The University of Georgia, USA. Jiménez, D.R., Pérez-Uribe, A., Satizabal, H.F., Barreto, M., Van Damme, P., Tomassini, M., 2008. A survey of artificial neural network-based. modeling in agroecol- ogy. In: Prasad, B. (Ed.), Softcomputing Applications in industry. Springer, Berlin, Heidelberg, pp. 247–269. Jiménez, D.R., Satizábal, H.F., Pérez-Uribe Andrés, 2007. Modelling sugar cane yield using artificial neural networks. In: Proceedings of the 6th European Conference on Ecological Modelling (ECEM’07), Trieste, Italy, pp. 244–245. Kohonen, T., 1995. Self-Organizing Maps. Springer, USA. Miao, Y., Mulla, D.J., Robert, P.C., 2006. Identifying important factors influencing corn yield and grain quality variability using artificial neural networks. Precision Agric. 7, 117–135. Montgomery, M.R., Gragnolati, M., Burke, K.A., Paredes, E., 1999. Measuring living standards with proxy variables. Demography 37, 155–174. Moshou, D., Bravo, C.,West, J.,Wahlen, S.,McCartney, A., Ramon,H., 2004. Automatic detection of ‘yellow rust’ in wheat using reflectance measurements and neural networks. Comput. Electron. Agric. 44, 173–188. Murase, H., 2000. Artificial intelligence in agriculture. Comput. Electron. Agric. 29, 1–2 (special issue). Nagendra, S.M.S., Khare, M., 2006. Artificial neural network approach for modelling nitrogen dioxide dispersion from vehicular exhaust emissions. Ecol. Model. 190, 99–115. National Research Council, 1989. Lost Crops of the Incas: Little Known Plants of the Andes with Promise for Worldwide Cultivation. National Academy Press, Washington, DC, USA, 415 pp. Niederhauser, N., Oberthür, T., Kattnig, S., Cock, J., 2008. Information and its man- agement for differentiation of agricultural products: the example of specialty coffee. Comput. Electron. Agric. 61 (2), 241–253. Noble, P.A., Tribou, E.H., 2007. Neuroet: an easy-to-use artificial neural network for ecological and biological modeling. Ecol. Model. 203, 87–98. Paul, P.A., Munkvold, G.P., 2005. Regression and artificial neural network mod- eling for the prediction of gray leaf spot of maize. Phytopathology 95, 388–396. Sargent, D.J., 2001. Comparison of artificial neural networks with other statistical approaches. Cancer Suppl. 91 (8), 1636–1642. Satizábal, H.F., Pérez-Uribe, A., 2007. Relevance metrics to reduce input dimensions. In: Proceedings of the International Conference on Artificial Neural Networks (ICANN 07), Porto, Portugal, pp. 39–48. Schultz, A., Wieland, R., 1997. The use of neural networks in agroecological mod- elling. Comput. Electron. Agric. 18, 73–90. Schultz, A., Wieland, R., Lutze, G., 2000. Neural networks in agroecological modeling—stylish application or helpful tool? Comput. Electron. Agric. 29, 73–97. Sora, D.S., Fischer, G., Florez, R., 2006. Refrigerated storage of mora de castilla (Rubus glaucus) fruits in modified atmosphere packaging. Agronomia Colombiana 24 (2), 306–316. Steckel, R.H., 1995. Stature and standard of living. J. Econ. Lit. 33, 1903–1940. Thomas, D., Strauss, J., Henriques, M., 1990. Child survival, height for age and house- hold characteristics in Brazil. J. Dev. Econ. 33, 197–234. Vesanto, J., Ahola, J., 1999. Hunting for correlations in data using the self-organizing map. In: Proceedings of the International ICSC Congress on Computational Intel- ligence Methods and Applications (CIMA), pp. 279–285. Yao, X., Liu, Y., 1998. Making use of population information in evolution- ary artificial neural networks. IEEE Trans. Syst. Man Cybern. B 28 (3), 417–425.