Predicting intercrops yield, from data gathering to modeling: a workflow
Résumé
Gathering the results of multiple and independently designed experiments is a necessity, yet an uncommon approach to gain knowledge in agronomy. Observing crop responses to various soil and climate conditions is necessary to generalize the results obtained in single locations and to enable the development of predictive approaches of crop performance. However, experimental results produced for different goals are heterogeneous and combining them is challenging, because of diverse experimental design, sampling, measured features, and data format.
In crop science, mixing two species in the same field (intercropping) has gained a lot of interest. Cereal-legume intercrops are a particularly promising mixture type and field experiments involving these intercrops have increased in the last decades.
In this study, we describe the workflow that we have conducted to face the challenges arising from the gathering of heterogeneous experiments to the building of predictive models of intercrop yield.
We gathered results from 22 factorial experiments in diverse environmental conditions (5 locations x 15 years, 8777 observations). Main crop features (yield, height, shoot biomass) were collected in intercropping and sole cropping conditions. We first developed an R package to combine experimental data, used a single data format and versioning to track modifications caused by data curation. To face the heterogeneous sampling across the experiments, we used smoothing splines to fit the growth dynamic of height and shoot biomass, from which we derived key features of plant growth dynamic (i.e. maximum growth rate, lag phase).
Our goal was to predict intercrop yield as a function of plant-plant interactions and environmental variables. For plant-related features, we focused on a set of predictors rather than using all available data, mobilizing concepts from community ecology. For environmental variables, we used specific regression models for functional data to estimate the support of the functions linking environmental variables and crop yield (time intervals). This step allowed to reduce the dimensionality of climate and soil variables and to obtain a set of environment-related predictors more explanatory of crop physiology.
We finally combined these two sets of predictors to build machine learning models and obtained encouraging results on both training and evaluation data sets.