Separating significant matches from spurious matches in DNA sequences - INRAE - Institut national de recherche pour l’agriculture, l’alimentation et l’environnement Accéder directement au contenu
Article Dans Une Revue Journal of Computational Biology Année : 2012

Separating significant matches from spurious matches in DNA sequences

Résumé

Word matches are widely used to compare genomic sequences. Complete genome alignment methods often rely on the use of matches as anchors for building their alignments, and various alignment-free approaches that characterize similarities between large sequences are based on word matches. Among matches that are retrieved from the comparison of two genomic sequences, a part of them may correspond to spurious matches (SMs), which are matches obtained by chance rather than by homologous relationships. The number of SMs depends on the minimal match length (l) that has to be set in the algorithm used to retrieve them. Indeed, if l is too small, a lot of matches are recovered but most of them are SMs. Conversely, if l is too large, fewer matches are retrieved but many smaller significant matches are certainly ignored. To date, the choice of l mostly depends on empirical threshold values rather than robust statistical methods. To overcome this problem, we propose a statistical approach based on the use of a mixture model of geometric distributions to characterize the distribution of the length of matches obtained from the comparison of two genomic sequences.

Dates et versions

hal-02644592 , version 1 (28-05-2020)

Identifiants

Citer

Hugo Devillers, Sophie S. Schbath. Separating significant matches from spurious matches in DNA sequences. Journal of Computational Biology, 2012, 19 (1), pp.1-12. ⟨10.1089/cmb.2011.0070⟩. ⟨hal-02644592⟩
3 Consultations
0 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More