Data Mining Using Hidden Markov Models (HMM2) to Detect Heterogeneities into Bacteria Genomes
Résumé
The Streptococcus genus contains both pathogenic bacteria and bacteria used in the food-processing industry. We are developing a statistical segmentation method to identify heterogeneous sequences such as sequences acquired from recent horizontal transfer or genes weakly or strongly expressed. The method is based on second order Hidden Markov Models (HMM2). After an automatic unsupervised training, this method allows to demarcating some particular areas into a genome. After checking the efficiency of such models on various controls and on chimeric sequences generated in silico, we choose a HMM2 (3-mer, 5 states) to analyse the complete genome sequence of S. Thermophilus CNRZ1066 (1.8 Mb). More the 80 atypical segments were extracted and are currently analysed further.