Proteogenomics‐Guided Evaluation of RNA‐Seq Assembly and Protein Database Construction for Emergent Model Organisms
Abstract
Proteogenomics is gaining momentum as, today, genomics, transcriptomics, and proteomics can be readily performed on any new species. This approach allows key alterations to molecular pathways to be identified when comparing conditions. For animals and plants, RNA‐seq‐informed proteomics is the most popular means of interpreting tandem mass spectrometry spectra acquired for species for which the genome has not yet been sequenced. It relies on high‐performance de novo RNA‐seq assembly and optimized translation strategies. Here, several pre‐treatments for Illumina RNA‐seq reads before assembly are explored to translate the resulting contigs into useful polypeptide sequences. Experimental transcriptomics and proteomics datasets acquired for individual Gammarus fossarum freshwater crustaceans are used, the most relevant procedure is defined by the ratio of MS/MS spectra assigned to peptide sequences. Removing reads with a mean quality score of less than 17–which represents a single probable nucleotide error on 150‐bp reads–prior to assembly, increases the proteomics outcome. The best translation using Transdecoder is achieved with a minimal open reading frame length of 50 amino acids and systematic selection of ORFs longer than 900 nucleotides. Using these parameters, transcriptome assembly and translation informed by proteomics pave the way to further improvements in proteogenomics.