Etude asymptotique du nombre d'occurrences d'un mot dans une chaîne de Markov et application à la recherche de mots de fréquence exceptionnelle dans les séquences d'ADN

Sophie S. Schbath

Thèse Année : 1995

Etude asymptotique du nombre d'occurrences d'un mot dans une chaîne de Markov et application à la recherche de mots de fréquence exceptionnelle dans les séquences d'ADN

(1)

Sophie S. Schbath

Fonction : Auteur
PersonId : 183444
IdHAL : sophie-schbath
ORCID : 0000-0003-3574-8222
IdRef : 07553424X

Unité Mathématique Informatique et Génome

Résumé

Because of many important sequencing projects, biologists have now large sets of DNA sequences from many different organisms. They need quantitative tools and statistical methods to help them in analysing sequences. Identifying words which show an important deviation between their observed frequency and their frequency predicted by a given model, is an important way to extract information from DNA sequences. We consider different Markov chain models, either with stationary or 3-periodic transition probabilities. This last class of models is well adapted for coding sequences which are naturally split into 3-letter words. Two different approximations for the number of occurrences of a word are used, depending on the asymptotic frame : the expected count tends to infinity or is bounded when the length of the sequence increases. In the first part, we propose asymptotically Gaussian statistics consisting in normalised difference between the observed count and its estimated expectation, for two estimators. The main difficulty is the normalisation of this difference, with the calculation of asymptotic variance. When the expected count is estimated by the compensator of the count related to the natural filtration of the sequence of letters, the classical central limit theorem for martingale gives the asymptotic variance. Where as the standard deviation of the count given the sufficient statistic is a good normalisation when the expected count is estimated by the conditional expectation. In the second part, we use the Chen-Stein method to prove the approximation of the number of occurences for rare words by a Poisson or a compound Poisson variable. The Poisson approximation is valid for non-overlapping words, but it is not satisfactory when the word can overlap itself in the sequence. A careful study of the periodic structure of the words is needed to take into account all the overlaps.

Mots clés

these

Domaines

Mathématiques [math] Informatique [cs] Sciences du Vivant [q-bio]

Migration ProdInra : Connectez-vous pour contacter le contributeur

https://hal.inrae.fr/tel-02850575

Soumis le : dimanche 7 juin 2020-21:22:36

Dernière modification le : jeudi 14 mars 2024-03:13:58

Dates et versions

tel-02850575 , version 1 (07-06-2020)

Identifiants

HAL Id : tel-02850575 , version 1
PRODINRA : 265880

Citer

Sophie S. Schbath. Etude asymptotique du nombre d'occurrences d'un mot dans une chaîne de Markov et application à la recherche de mots de fréquence exceptionnelle dans les séquences d'ADN. Mathematics [math]. Université Paris Descartes - Paris 5, 1995. English. ⟨NNT : ⟩. ⟨tel-02850575⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

INRA INRAE MATHNUM

40 Consultations

0 Téléchargements

Etude asymptotique du nombre d'occurrences d'un mot dans une chaîne de Markov et application à la recherche de mots de fréquence exceptionnelle dans les séquences d'ADN

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager