PPanGGOLiN: Depicting microbial diversity via a Partitioned Pangenome Graph

Guillaume Gautreau; Christophe Ambroise; Catherine Matias; Amandine Perrin; Rémi Planel; Marie Touchon; Claudine Médigue; Eduardo Rocha; Stéphane Cruveiller; David Vallenet

Résumé

Motivations : By collecting and comparing all the genomic sequences of a species, pangenomics studies focus on overall genomic content to understand genome evolution both in terms of core and accessory parts (Tettelin et al. 2005). The core genome is defined as the set of genes shared by all the organisms of a taxonomic unit (generally a species) whereas the accessory part (also named, variable regions or peripheral regions) is crucial to understand the adaptive potential of bacteria and contains genomic regions that can be exchanged between strains by horizontal transfers (i.e. the mobilome, Frost et al. 2005). Core genes are most often defined as the set of ubiquitous genes in a clade (Tettelin et al. 2005 and Vieira et al. 2011). However, this definition has 2 major flaws : (i) it is not robust against poorly sampled data because it is highly reliant on the presence or absence of a single organism; (ii) it misses many core genes (false negatives) because of the high probability to lose at least one of the core genes due to sequencing, assembling or annotation artifacts. Potential presence in the dataset of variants missing a gene because the associated function is socialized [sic] in a community (see the Black Queen Hypothesis, Morris et al. 2012) can also drop down the core genome. As pointed out by (AcevedoRocha et al. 2013), "functional ubiquity cannot be equated to sequence/structural ubiquity", the core genome definition has thus been pointed out as too conservative for being useful (Tonder et al. 2014). As a consequence, (AcevedoRocha et al. 2013) propose to rather focus on "persistent" genes, namely genes that are conserved in a majority of genomes. Some equivalent words to 'persistent' have also been introduced as 'soft core' (ContrerasMoreira et al. 2013) or 'extended core' (Lapierre et al. 2009, Bolotin et al. 2017), 'stabilome' (Vesth et al. 2010). This definition advocates for the use of a threshold on the frequency of appearance of a gene among the species of a clade, above which the gene is declared as a persistent one (generally gene families present at least in a range comprised between 90% and 95%). This approach gives an attractive answer to the issues raised by the original definition of the core genome but nevertheless has its own disadvantage that lies in choosing an appropriate threshold. Moreover, the usual dichotomy between core and accessory genome does not faithfully report the diverse ranges of gene frequencies in a pangenome. The gene frequency distribution in pangenomes is extensively documented (Lapierre et al. 2009, Collins et al. 2012, Lobkovsky et al. 2013, Bolotin et al. 2015, Bolotin et al. 2017). These studies argue for the existence of an equilibrium between genes acquisition and genes loss leading to an asymmetric U-shaped distribution of gene frequencies regardless of the phylogenetic level and the clade considered (with the exception of the non-homogeneous species (Moldovan et al. 2018)). The U left, bottom and right sides correspond respectively to the rare, moderately present and highly frequent gene families. Thereby, as proposed by (Koonin,2008) and formally modeled by (Collins et al. 2012), the pangenome can be split into 3 groups.This choice helps to shed light on genes putatively associated with positive environmental adaptations while avoiding to confound them with potentially randomly acquired ones. For that purpose, the partitioning approach that we propose here divides the pangenome into (1) persistent genome, equivalent to a relaxed core genome (genes conserved in almost all genomes); (2) shell genome, genes having intermediate frequencies corresponding to moderately conserved genes potentially associated to environmental adaptation capabilities; (3) cloud genome, genes found at a very low frequency. We tackle this challenge in the present work by first proposing a method to select this threshold automatically. Beyond the partitioning approach, the technological shifts of the sequencing methods offer us thousands of genome strains available in databases for numerous bacterial species. The processing of so many genomes poses a critical computational problem because it is no longer possible to handle comparative genomics studies as in the 90's, even with modern computing facilities. For instance, studying patterns of gene gains and losses in the evolution of a lineage is a basic question in comparative genomics but this task becomes tremendously harder when thousands of genomes have to be analyzed. Nevertheless, the information encoded in these genomes is highly redundant making it possible to design new compact ways of representing and manipulating this information. As suggested (Chan et al. 2015 and Marshall et al. 2016), a consensus representation of multiple genomes would provide a better analytical framework than using individual reference genomes. This proposition leads to a paradigm shift from the usual linear representation of reference genomes to a representation as pangenome graphs bringing together all the different known variations as multiple alternative paths. Some approaches have been developed aiming at factorizing pangenomes at the sequence level (PanCake : Ernst et al. 2013, SplitMEM : Marcus et al. 2014). However, these approaches lack direct information about genes, complicating the functional analyses from the study of the graph. Here, we introduce an extension of the concept of pangenome graph, giving it a formal mathematical representation using a graph model in which nodes represent gene families and chromosomal neighborhood information, respectively. The method introduced here can be considered as the missing link between the usual pangenomics approach (set of unlinked gene families) and the pangenome graph at the sequence level. A detailed comparison of these 2 approaches has been reviewed in (Zekic et al. 2018). Coupled with our partitioning method, this representation could be a new standard to depict all the genomic combinations of bacterial species in a single figure. Overview of the method : First, the genomes of the same species (or species cluster) are annotated before bringing homologue genes together into gene families via a all vs all protein alignment. From this data, the PPanGGOLiN method merges the chromosomal links between neighboring genes to build a graph of the neighborhood between gene families weighted by the number of genomes covering each edge. In parallel, the pangenome is modeled as a binary presence/absence matrix where the rows correspond to gene families and columns to the organisms (1 in case of presence of at least one gene belonging to this gene family, 0 in case of absence). The pangenome is then partitioned into the persistent, shell and cloud partitions by evaluating, through an Expectation-Maximisation algorithm, the best parameters of a Bernoulli Mixture Model (BMM) smoothed using a Markov Random Field (MRF) (Ambroise et al. 1997). For each partition, the BMM is composed of one mean vector of presence/absence (expected to be (11...11) for the persistent, (00...00) for the cloud and diversified for the shell) associated to a dispersion vector around the mean vector (low dispersion for the persistent and the cloud; high dispersion for the shell). Once the parameters are estimated, each gene family is associated to its closest partition according to its mean vector. As it is known that core gene families share conserved genomic organizations along genomes (Fang et al. 2008), the MRF imposes that two neighboring gene families are more likely to belong to the same partition. Therefore, the MFR penalizes unreliable partition attributed to the families compared to the partition of its neighbors in the graph (the weights of the edges account in the process). The algorithm iterates between BBM and MRF until the maximization of the overall likelihood. The strength of the topological smoothing is managed via a parameter called β (if β = 0, the smoothing is disabled and the partitioning only relies on the presence/absence matrix). At the end, the partitions are then overlaid on the neighborhood graph in order to obtain what we called the Partitioned Pangenome Graph. Thanks to this graphical structure and the associated statistical model, the pangenome is resilient to randomly distributed errors (e.g. an assembly gap in one genome can be offset by information from other genomes, thus maintaining the link in the graph). Conclusion: Due to the significant decreasing cost of recent sequencing technologies, the past recent years have seen the explosion of whole-genome sequencing projects (WGS), most notably for pathogenic bacteria. Using portable sequencer like ONT MinION, it is soon imaginable to obtain thousands of strains for each species because of the simplicity to sequence bacteria directly on the field. Therefore, the capture of all genomic variations of a species is no longer a wishful thinking. Before the emergence of the pangenomics, the emphasis has been on identifying polymorphism information to draw some sort of epidemiological map of the lineage(s) of interest. While this has resulted in the remarkably detailed information of epidemic strains, it is rapidly showing its major weakness since the analysis of the core genes actually provides very little information on the adaptive changes because most of them arise in the shell and cloud genomes. The approach presented here sheds light on these variations to focus on the gene gains and losses that are associated with these adaptive changes in a species. In the context of comparative genomics, drawing genomes on rails like a subway map may help biologist to compare genomes of interest to the overall pangenomic diversity. This graph-based approach to represent and manipulate pangenomes provides efficient bases for very large scale comparative genomics. The method is available as a standalone tool (https://github.com/ggautreau/PPanGGOLiN) and, as mentioned in (Vallenet et al. 2017), we are currently working on its integration in the MicroScope platform.

PPanGGOLiN: Depicting microbial diversity via a Partitioned Pangenome Graph

Résumé

Domaines

Dates et versions

Licence

Identifiants

Citer

Exporter

Collections

Partager