The statistical world of motif occurrences along DNA sequences
Abstract
Statistics of motifs have been widely revisited in the last 15 years due to the increasing availability of genomic sequences. The identification of DNA motifs with biological functions is still a huge challenge of genome analysis. Many functional and essential motifs have the particularity to be very frequent all along the chromosome or to be concentrated in some particular regions (e.g. in front of genes) or to be co-oriented with the replication direction. The prediction of functional motifs is then mostly based on statistical properties of pattern occurrences in Markovian sequences. This lecture will be mostly devoted to such properties with a special focus on pattern frequency. How to compute or approximate the count distribution to assess motif exceptionality? How to test if a motif is significantly unbalanced between two (sets of) sequences? How to deal with more complex motifs? What is the distribution of the waiting time between occurrences? How to model motif occurrences to find regions significantly enriched with a given pattern? etc. Examples of functional motifs will illustrate all these questions and we will see how the Chi motif has been identified in Staphylococcus aureus thanks to its statistical properties.