R'MES: Finding exceptional motifs in sequences
Abstract
The R’MES project started in 1995. This is now the 3rd version. The main question R’MES addresses is ”does this motif occur in that biological sequence with an expected frequency?” In other words, can we observe it so many times, or so few times, just by chance? Usually, when the answer is no, such a motif is a candidate to have a particular biological meaning. To do so, we calculate an exceptionality score for each word of a given length (or for each given set of words); this score is a one-to-one transformation of the corresponding p-value. The p-value is the probability that a random sequence having the same 1- up to (m + 1)-letter word composition as the biological sequence contains as many occurrences of the given word. This probability is approximated thanks to rigorous statistical approximations of the word count distribution, namely either a Gaussian distribution (for frequent words) or a compound Poisson distribution (for rare words). Details about the statistical results on word counts in random sequences can befound in [1]. R’MES is getting enriched thanks to novel questions from the biologists. R’MES can now for instance compute an exceptionality score related to the skew of an oligonucleotide; the typical question is indeed “does this motif occur significantly more often on the leading strand than on the lagging strand?” At the moment, we are implementing the statistical tests proposed by [2] to compare motif exceptionalities between two different sequences. In the talk, we will illustrate how we have identified the Chi site of Staphylococcus aureus [3] and the matS site of Escherichia coli [4] thanks to R’MES.
Domains
Applications [stat.AP]Origin | Files produced by the author(s) |
---|
Loading...