Advancing multi-environment genomic prediction with explainable deep learning in apple
Résumé
Multi-environment genomic prediction is a useful tool for plant breeding which can help to estimate breeding values of genotypes across diverse environments. For an accurate prediction, methods must integrate phenotypic, genotypic, and environmental data effectively. Yet, the diverse structure of this data poses a challenge for its analysis. However, this complexity is well-suited for deep learning methods because of their modularity. Here, we present an explainable multimodal deep learning method to perform genomic prediction on a multi-year and multi-environment apple REFPOP dataset of eleven quantitative traits. To implement the modelling approach, genotypic data was subjected to feature selection to reduce its dimensionality and improve training performance. Conversely, environmental data was processed as daily mean values. To effectively use environmental time-series data, our model employed long-short term memory (LSTM) layers, alongside dense layers for other data inputs. Different data types were processed through separate multi-layer streams within the architecture and concatenated just before the final regression output layer. The proposed methodology outperformed its statistical counterparts for three out of the eleven traits present in the dataset when performing a five-fold cross-validation repeated five times. These traits were harvest date, titratable acidity and red over colour, with an increase in predictive ability measured with the Pearson’s correlation coefficient r of 0.05, 0.08 and 0.09, respectively. The remaining eight traits showed similar performance as the compared statistical models. Furthermore, we also incorporate an approach to explain the model predictions based on Shapley additive explanations, commonly referred to as SHAP values. Using this approach, we have been able to pinpoint the most important genetic variants as well as relevant time frames during which environmental variables influence trait predictions. Given the increasing amount of data generated in every field, our results provide a framework to integrate differentially structured data and produce accurate and interpretable predictions, using deep learning-based multi-environment genomic prediction models.
