Separating significant matches from spurious matches in DNA sequences

Hugo Devillers; Sophie S. Schbath

doi:10.1089/cmb.2011.0070

Article Dans Une Revue Journal of Computational Biology Année : 2012

Separating significant matches from spurious matches in DNA sequences

(1, 2) , (1)

1
2

Hugo Devillers

Fonction : Auteur

Unité Mathématique Informatique et Génome

Université Montpellier 2 - Sciences et Techniques

Sophie S. Schbath

Fonction : Auteur
PersonId : 183444
IdHAL : sophie-schbath
ORCID : 0000-0003-3574-8222
IdRef : 07553424X

Unité Mathématique Informatique et Génome

Résumé

Word matches are widely used to compare genomic sequences. Complete genome alignment methods often rely on the use of matches as anchors for building their alignments, and various alignment-free approaches that characterize similarities between large sequences are based on word matches. Among matches that are retrieved from the comparison of two genomic sequences, a part of them may correspond to spurious matches (SMs), which are matches obtained by chance rather than by homologous relationships. The number of SMs depends on the minimal match length (l) that has to be set in the algorithm used to retrieve them. Indeed, if l is too small, a lot of matches are recovered but most of them are SMs. Conversely, if l is too large, fewer matches are retrieved but many smaller significant matches are certainly ignored. To date, the choice of l mostly depends on empirical threshold values rather than robust statistical methods. To overcome this problem, we propose a statistical approach based on the use of a mixture model of geometric distributions to characterize the distribution of the length of matches obtained from the comparison of two genomic sequences.

Mots clés

match length

maximal exact matches mixture model génomique comparative correspondre à la longueur maximales correspondances exactes comparative genomics

Domaines

Mathématiques [math] Informatique [cs] Sciences du Vivant [q-bio]

Migration ProdInra : Connectez-vous pour contacter le contributeur

https://hal.inrae.fr/hal-02644592

Soumis le : jeudi 28 mai 2020-23:01:35

Dernière modification le : jeudi 14 mars 2024-03:13:29

Dates et versions

hal-02644592 , version 1 (28-05-2020)

Identifiants

HAL Id : hal-02644592 , version 1
DOI : 10.1089/cmb.2011.0070
PRODINRA : 48401
PUBMED : 22149632
PUBMEDCENTRAL : PMC3244807
WOS : 000298969900001

Citer

Hugo Devillers, Sophie S. Schbath. Separating significant matches from spurious matches in DNA sequences. Journal of Computational Biology, 2012, 19 (1), pp.1-12. ⟨10.1089/cmb.2011.0070⟩. ⟨hal-02644592⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

INRA UNIV-MONTPELLIER INRAE MATHNUM

8 Consultations

0 Téléchargements

Separating significant matches from spurious matches in DNA sequences

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager