metagWGS : a workflow to analyse short or long HiFi metagenomic reads - INRAE - Institut national de recherche pour l’agriculture, l’alimentation et l’environnement Access content directly
Conference Papers Year : 2022

metagWGS : a workflow to analyse short or long HiFi metagenomic reads

Pierre Martin
  • Function : Author
  • PersonId : 1216781
Denis Milan
Géraldine Pascal


Whole DNA shotgun sequencing of environmental samples allows to study their taxonomic composition and their functional profiles. Recently, technological evolutions and associated cost reductions allow the use of long high fidelity reads (HiFi) for some metagenomic projects. We are developing a complete, scalable, easy-to-use and reproducible workflow, metagWGS, with Nextflow [1] and Singularity [21] that processes short Illumina or long HiFi PacBio reads from shotgun metagenomics data. It provides (i) contig assemblies, (ii) syntaxic and functional annotations of genes, (iii) taxonomic affiliations and relative abundance of reads, contigs and Metagenome-Assembled Genomes (MAGs) (iv) count table of reads per non redundant genes and (v) MAGs by contigs binning. The workflow begins by preprocessing steps that clean adapters, low quality reads and the host reads [3]. The quality of the reads is controled with FastQC [2]. The taxonomic classification of reads uses Kaiju [13] in order to have a first overview of reads. The assembly is made by metaSPAdes [7] or megahit [8] for short reads and Hifiasm [9] or metaFlye [10] for long reads to generate contigs for each sample. This assembly can be done per sample or as a co-assembly of several samples. The obtained are structurally annotated for ORF (by prodigal [15]) and for tRNA and rRNA. Then, ORFs are clustered with CD-HIT [17] using a 95% sequence identity cutoff to remove redundancy and generate a uniq gene catalog between samples. Genes are functionnaly annotated by eggNog mapper [19]. Reads are mapped back to contigs and feature Counts [18] is used to count the reads overlapping annotated genes. The raw count table gathers the number of reads aligned on each gene for each sample. DIAMOND [16] is used for the tax onomic affiliation of contigs versus nr database from translated ORFs. The objective of the binning step is to group contigs belonging to the same species, according to their compositional characteristics and relative abundances. Metawrap's bin_refinement module, which is used to improve the bins sets generated by individual binning tools (CONCOCT, MetaBAT2, MaxBin2 in our case) [22, 23, 24], have been modified in order to reduce the execution time of the script (Figure 1). Next, dRep [26] clusters the bins of all samples to obtain a common reference based on their average nucleotide identity (ANI), with a default threshold of 95% ANI appropriated for obtaining Species-level Representative Genomes (SRGs). Finally, GTDB-Tk [27] performs the taxonomic affiliation of the SRGs. We provide a matrix with the relative abundances (computed from mapped reads) of each MAG in each sample.


No file

Dates and versions

hal-03944382 , version 1 (18-01-2023)


  • HAL Id : hal-03944382 , version 1


Joanna Fourquet, Maïna Vienne, Jean Mainguy, Vincent Darbot, Pierre Martin, et al.. metagWGS : a workflow to analyse short or long HiFi metagenomic reads. Bioinfo Biosta Genotoul day, Dec 2022, Castanet Tolosan, France. ⟨hal-03944382⟩
11 View
4 Download


Gmail Facebook X LinkedIn More