Constrained Dynamic Programming and Supervised Penalty Learning Algorithms for Peak Detection in Genomic Data - INRAE - Institut national de recherche pour l’agriculture, l’alimentation et l’environnement Accéder directement au contenu
Article Dans Une Revue Journal of Machine Learning Research Année : 2020

Constrained Dynamic Programming and Supervised Penalty Learning Algorithms for Peak Detection in Genomic Data

Toby Dylan Hocking
  • Fonction : Auteur
  • PersonId : 1078823
Paul Fearnhead
  • Fonction : Auteur
  • PersonId : 1078824
Guillaume Bourque
  • Fonction : Auteur
  • PersonId : 1078825

Résumé

Peak detection in genomic data involves segmenting counts of DNA sequence reads aligned to different locations of a chromosome. The goal is to detect peaks with higher counts, and filter out background noise with lower counts. Most existing algorithms for this problem are unsupervised heuristics tailored to patterns in specific data types. We propose a supervised framework for this problem, using optimal changepoint detection models with learned penalty functions. We propose the first dynamic programming algorithm that is guaranteed to compute the optimal solution to changepoint detection problems with constraints between adjacent segment mean parameters. Implementing this algorithm requires the choice of penalty parameter that determines the number of segments that are estimated. We show how the supervised learning ideas of Rigaill et al. (2013) can be used to choose this penalty. We compare the resulting implementation of our algorithm to several baselines in a benchmark of labeled ChIP-seq data sets with two different patterns (broad H3K36me3 data and sharp H3K4me3 data). Whereas baseline unsupervised methods only provide accurate peak detection for a single pattern, our supervised method achieves state-of-the-art accuracy in all data sets. The log-linear timings of our proposed dynamic programming algorithm make it scalable to the large genomic data sets that are now common. Our implementation is available in the PeakSegOptimal R package on CRAN.
Fichier principal
Vignette du fichier
HockingEtal_JMLR.pdf (760.34 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-02961069 , version 1 (08-10-2020)

Identifiants

  • HAL Id : hal-02961069 , version 1

Citer

Toby Dylan Hocking, Guillem Rigaill, Paul Fearnhead, Guillaume Bourque. Constrained Dynamic Programming and Supervised Penalty Learning Algorithms for Peak Detection in Genomic Data. Journal of Machine Learning Research, 2020, 21, pp.1 - 40. ⟨hal-02961069⟩
15 Consultations
14 Téléchargements

Partager

Gmail Facebook X LinkedIn More