New tools to optmise the analysis of a large RNA-Seq dataset from non model species: development of a hybrid assembly strategy and assessment of library complexity from raw sequencing output. - INRAE - Institut national de recherche pour l’agriculture, l’alimentation et l’environnement Accéder directement au contenu
Pré-Publication, Document De Travail (Preprint/Prepublication) Année : 2014

New tools to optmise the analysis of a large RNA-Seq dataset from non model species: development of a hybrid assembly strategy and assessment of library complexity from raw sequencing output.

Résumé

With the advance of new sequencing technologies major challenges have emerged on different levels, from library preparaRon up to data analysis. Here, we present one tool that improves the quality and feasibility of the assembly process and another that assesses the complexity of the sequenced library. Currently, transcriptome assembly strategies are either reference-based or “de novo” depending on the availability of high-quality reference sequences. For non-model species where a high-quality reference transcriptome/genome is lacking, a closely-related species reference sequence can be used as a proxy to improve the quality of the reconstructed transcriptome and decrease the computaRonal requirements. By bringing together the two complementary assembly strategies, we can take advantage of the high sensiRvity of reference- based assemblers, while leveraging the ability of de novo assemblers to detect novel transcripts. The procedure includes three steps. First, the RNA-seq reads of the target species are aligned to the reference genome/transcriptome of a closely-related species. Reads that map within the same genomic locaRon are grouped into clusters. Then, each “cluster” of aligned reads combined with the remaining unaligned reads will serve as input to a de novo assembly process running in parallel. Finally, all the resulted de novo assemblies are merged to form the final transcriptome. The de novo assembly requires important compuRng resources, parRcularly memory. The proposed strategy solves the problem of intensive memory requirements of a de novo assembly and greatly reduces the computaRonal Rme by parallelising both the first and second step of the process. Here, we test this strategy on a simulated Drosophila group dataset using various reference model species in a wide range of divergence Rmes to assess the mapping success. Our pipeline led to great improvement when closely related species were used as a reference. This was further tested on an experimental dataset decreasing the required computaRonal resources. Prior to the assembly process, sequencing experiments can be evaluated based on the success of library construcRon and sequencing. OCen, technical failures can lead to biased RNA representaRon and over-sequencing of parRcular molecules that do not represent the starRng biological sample and cause reduced complexity. The level of complexity reflects the coverage of the transcriptome. We developed a new metric that infers the diversity of unique sequences and assesses the complexity of a given library using two clustering steps of idenRcal reads. This metric can be used to compare the library construcRon success and for evaluaRon of mulRple experiments. The presented tools facilitate the analysis of RNA-Seq for non-model species improving the feasibility of the assembly and allowing for the post-sequencing evaluaRon of the library construcRon. New tools to optmise the analysis of a large RNA-Seq dataset from non model species: development of a hybrid assembly strategy and assessment of library complexity from raw sequencing output. (PDF Download Available). Available from: https://www.researchgate.net/publication/319932729_New_tools_to_optmise_the_analysis_of_a_large_RNA-Seq_dataset_from_non_model_species_development_of_a_hybrid_assembly_strategy_and_assessment_of_library_complexity_from_raw_sequencing_output [accessed Feb 28 2018].
Fichier principal
Vignette du fichier
ECCB_2014_poster_v5_1.pdf (510.99 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-02799163 , version 1 (05-06-2020)

Identifiants

Citer

Jacques Lagnel, - Khalid Belkhir, - Tereza Manousaki, - Erick Desmarais, - Anastasia Tsagkarakou, et al.. New tools to optmise the analysis of a large RNA-Seq dataset from non model species: development of a hybrid assembly strategy and assessment of library complexity from raw sequencing output.. 2014. ⟨hal-02799163⟩
56 Consultations
8 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More