PanGBank: depicting microbial species diversity via PPanGGOLiN
Résumé
By collecting and comparing genomic sequences, many studies are focused on the overall gene content of a species (i.e. the pangenome [1]) to understand its evolution in terms of core and accessory parts. The core genome is defined as the set of genes shared by all the organisms of a taxonomic unit (generally a species). Accessory part (variable regions) is crucial to understand the adaptive potential of bacteria and contains genomic regions that are exchanged between strains by horizontal gene transfer (HGT). However, this dichotomy is not robust against poorly sampled data because it is highly reliant on the presence or absence of a single organism and also does not faithfully report the diverse ranges of gene frequencies in a pangenome. Moreover, this approach considers genomes as isolated gene sets and neglects their chromosomal organization despite its major importance to study HGT. Here, we introduce a compact modelization of multiple genomes, giving it a representation using a graph model built up from genes clustered into gene families coupled with a statistical partitioning method.
The PPanGGOLiN method merges the chromosomal links between neighboring genes to build a graph of the neighborhood between gene families weighted by the number of genomes covering each edge. In addition to the graph, the pangenome is modeled as a binary presence/absence matrix where the rows correspond to gene families and columns to the genomes (1 in case of presence of at least one gene belonging to this gene family, 0 in case of absence). The pangenome is then partitioned by evaluating, through an Expectation-Maximisation algorithm, the best parameters of a Bernoulli Mixture Model (BMM). This approach partitions pangenomes into three types of genomes: (1) persistent genome, equivalent to a relaxed core genome (genes conserved in all but a few genomes); (2) shell genome, genes having intermediate frequencies corresponding to moderately conserved genes potentially associated with environmental adaptation capabilities; (3) cloud genome, genes found at very low frequency. Finally, the partitions are overlaid on the neighborhood graph to obtain a Partitioned Pangenome Graph (PPG).
This method was applied on all the genomes available in the GenBank database (encompassing ~600 species and ~200 000 genomes) to obtain a database of PPGs. This in-development resource, called PanGBank, provides a wide view of the different range of gene frequencies and chromosomal topologies along the microbial world thanks to an API and a web visualization tool dedicated for browsing PPGs. In the context of massive comparative genomics, drawing genomes on rails like a subway map may help biologists to compare their genomes of interest to the overall pangenomic diversity.
References
[1] Tettelin, H., Masignani, V., Cieslewicz, M. J., Donati, C., Medini, D., Ward, N. L., ... & DeBoy, R. T. Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial “pan-genome”. Proceedings of the National Academy of Sciences, 102(39)13950-13955. 2005.
[2] Ambroise C., Dang M. and Govaert G. Clustering of spatial data by the EM algorithm. geoENV-I-Geostatistics for environmental applications, pages 493-504, 1997.