Thèse Année : 1995

Etude asymptotique du nombre d'occurrences d'un mot dans une chaîne de Markov et application à la recherche de mots de fréquence exceptionnelle dans les séquences d'ADN

Résumé

Because of many important sequencing projects, biologists have now large sets of DNA sequences from many different organisms. They need quantitative tools and statistical methods to help them in analysing sequences. Identifying words which show an important deviation between their observed frequency and their frequency predicted by a given model, is an important way to extract information from DNA sequences. We consider different Markov chain models, either with stationary or 3-periodic transition probabilities. This last class of models is well adapted for coding sequences which are naturally split into 3-letter words. Two different approximations for the number of occurrences of a word are used, depending on the asymptotic frame : the expected count tends to infinity or is bounded when the length of the sequence increases. In the first part, we propose asymptotically Gaussian statistics consisting in normalised difference between the observed count and its estimated expectation, for two estimators. The main difficulty is the normalisation of this difference, with the calculation of asymptotic variance. When the expected count is estimated by the compensator of the count related to the natural filtration of the sequence of letters, the classical central limit theorem for martingale gives the asymptotic variance. Where as the standard deviation of the count given the sufficient statistic is a good normalisation when the expected count is estimated by the conditional expectation. In the second part, we use the Chen-Stein method to prove the approximation of the number of occurences for rare words by a Poisson or a compound Poisson variable. The Poisson approximation is valid for non-overlapping words, but it is not satisfactory when the word can overlap itself in the sequence. A careful study of the periodic structure of the words is needed to take into account all the overlaps.

Mots clés

Fichier non déposé

Dates et versions

tel-02850575 , version 1 (07-06-2020)

Identifiants

  • HAL Id : tel-02850575 , version 1
  • PRODINRA : 265880

Citer

Sophie S. Schbath. Etude asymptotique du nombre d'occurrences d'un mot dans une chaîne de Markov et application à la recherche de mots de fréquence exceptionnelle dans les séquences d'ADN. Mathematics [math]. Université Paris Descartes - Paris 5, 1995. English. ⟨NNT : ⟩. ⟨tel-02850575⟩

Collections

INRA INRAE MATHNUM
44 Consultations
0 Téléchargements

Partager

More