Skip to Main content Skip to Navigation
Conference papers

Statistical learning for OTUs identification

Mohamed Anwar Abouabdallah 1, 2 Olivier Coulaud 3 Alain A. Franc 4, 2 Nathalie Peyrard 1 
2 PLEIADE - from patterns to models in computational biodiversity and biotechnology
LaBRI - Laboratoire Bordelais de Recherche en Informatique, Inria Bordeaux - Sud-Ouest, BioGeCo - Biodiversité, Gènes & Communautés
3 HiePACS - High-End Parallel Algorithms for Challenging Numerical Simulations
LaBRI - Laboratoire Bordelais de Recherche en Informatique, Inria Bordeaux - Sud-Ouest
Abstract : Statistical learning for OTUs identification: Molecular based inventories are currently made rountinely with metabarcoding. However, comparisons with optical based inventories are scarce in micro-organisms. Here, we study whether a morphological based taxonomy and unsupervized clustering of amplicons on a same dataset provide the same picture of diversity. For OTU building, we implement both HAC and a novel approach based on the Stochastic Block Models (SBM). Plants are among the best known organisms (both botanically and with molecular phylogenies). Therefore, we use a dataset of amplicons (trnH-psbA) of 1502 trees from an experimental plot in French Guiana, over a large spectrum of botanical diversity, identified by field botanists. We study whether the convergence/divergence of the 3 classifications depends on the taxonomic level addressed (order, family, genus). We deploy the HAC and test several aggregation methods. We deploy SBM with Poisson probability distribution to model the pattern of distances between sequences. Finally, we compare the 3 classifications we obtained by building contingency tables. Preliminary result show that the convergence of the three methods depends on the distribution of intra and inter-class distances. For instance, in Magnoliales they are well differentiated and convergence is very good, whereas for the Gentianales convergence is poor and distances are not well differentiated. Moreover, the SBM provides a matrix of parameters which quantify the connection between the classes. It is an excellent candidate for being a multivariate index of diversity, richer than a scalar one. Finally, we will discuss the issue of scaling of this approach to metabarcoding.
Document type :
Conference papers
Complete list of metadata
Contributor : Nathalie Peyrard Connect in order to contact the contributor
Submitted on : Thursday, September 17, 2020 - 11:36:03 AM
Last modification on : Friday, September 23, 2022 - 4:48:06 PM


  • HAL Id : hal-02941708, version 1



Mohamed Anwar Abouabdallah, Olivier Coulaud, Alain A. Franc, Nathalie Peyrard. Statistical learning for OTUs identification. ISEC 2020 - International Statistical Ecology Conference, Jun 2020, Sydney / Virtual, Australia. ⟨hal-02941708⟩



Record views