metagWGS : a workflow to analyse short or long HiFi metagenomic reads

Whole DNA shotgun sequencing of environmental samples allows to study their taxonomic composition and their functional profiles. Recently, technological evolutions and associated cost reductions allow the use of long high fidelity reads (HiFi) for some metagenomic projects. We are developing a complete, scalable, easy-to-use and reproducible workflow, metagWGS, with Nextflow [1] and Singularity [21] that processes short Illumina or long HiFi PacBio reads from shotgun metagenomics data. It provides (i) contig assemblies, (ii) syntaxic and functional annotations of genes, (iii) taxonomic affiliations and relative abundance of reads, contigs and Metagenome-Assembled Genomes (MAGs) (iv) count table of reads per non redundant genes and (v) MAGs by contigs binning. The workflow begins by preprocessing steps that clean adapters, low quality reads and the host reads [3]. The quality of the reads is controled with FastQC [2]. The taxonomic classification of reads uses Kaiju [13] in order to have a first overview of reads. The assembly is made by metaSPAdes [7] or megahit [8] for short reads and Hifiasm [9] or metaFlye [10] for long reads to generate contigs for each sample. This assembly can be done per sample or as a co-assembly of several samples. The obtained are structurally annotated for ORF (by prodigal [15]) and for tRNA and rRNA. Then, ORFs are clustered with CD-HIT [17] using a 95% sequence identity cutoff to remove redundancy and generate a uniq gene catalog between samples. Genes are functionnaly annotated by eggNog mapper [19]. Reads are mapped back to contigs and feature Counts [18] is used to count the reads overlapping annotated genes. The raw count table gathers the number of reads aligned on each gene for each sample. DIAMOND [16] is used for the tax onomic affiliation of contigs versus nr database from translated ORFs. The objective of the binning step is to group contigs belonging to the same species, according to their compositional characteristics and relative abundances. Metawrap's bin_refinement module, which is used to improve the bins sets generated by individual binning tools (CONCOCT, MetaBAT2, MaxBin2 in our case) [22, 23, 24], have been modified in order to reduce the execution time of the script (Figure 1). Next, dRep [26] clusters the bins of all samples to obtain a common reference based on their average nucleotide identity (ANI), with a default threshold of 95% ANI appropriated for obtaining Species-level Representative Genomes (SRGs). Finally, GTDB-Tk [27] performs the taxonomic affiliation of the SRGs. We provide a matrix with the relative abundances (computed from mapped reads) of each MAG in each sample.

Mots clés

metagenomic reads

Domaines

Bio-informatique [q-bio.QM]

Géraldine Pascal : Connectez-vous pour contacter le contributeur

https://hal.inrae.fr/hal-03944382

Soumis le : mercredi 18 janvier 2023-08:33:51

Dernière modification le : vendredi 12 juillet 2024-03:30:38

Dates et versions

hal-03944382 , version 1 (18-01-2023)

Identifiants

HAL Id : hal-03944382 , version 1

Citer

Joanna Fourquet, Maïna Vienne, Jean Mainguy, Vincent Darbot, Pierre Martin, et al.. metagWGS : a workflow to analyse short or long HiFi metagenomic reads. Bioinfo Biosta Genotoul day, Dec 2022, Castanet Tolosan, France. ⟨hal-03944382⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS INRAE GENETIQUE_ANIMALE ANR GENPHYSE INRAEOCCITANIETOULOUSE PHASE TOULOUSE-INP UNIV-UT3 UT3-TOULOUSEINP MATHNUM MIAT FRANCE-GENOMIQUE GENOTOUL-BIOINFO GET-PLAGE

17 Consultations

4 Téléchargements