Adjacency-constrained hierarchical clustering of a band similarity matrix with application to genomics

Genetic information is coded in long strings of DNA organized in chromosomes. High-throughput sequencing now allow to study biological phenomena along the whole genome at a very high resolution (for example using RNAseq, DNAseq, ChipSeq, HiC...). In most cases, we expect neighboring positions to behave similarly and using this a priori information is a way to tackle the complexity of genome-wide analyses. It is common practice to partition each chromosome into relevant regions, because the obtained regions hopefully correspond to biological relevant or interpretable units (genes, binding site, TADs, ...) and also because the obtained partition can be used as a dimensionality reduction method and allows statistical methods to be used more easily at the region level than at the position level. In most cases, regions of interest are however unknown and should be discovered using the data. A popular way to do so is to aggregate neighboring and similar positions based on a measure of similarity between pairs of positions, in a hierarchical way that is close to the biological organization of the genome. This article focuses on a modification of the classical hierarchical agglomerative clustering (HAC), where only adjacent clusters (according to the ordering of positions within a chromosome) can be merged. Adjacency-constrained HAC is implemented in the R package rioja. Our main contribution with respect to existing works is an efficient implementation of adjacency-constrained HAC in the case where the similarity between genomically distant objects can be considered as negligible. We propose an algorithm that is almost linear in time and space with respect to the number of objects to be clustered. It uses a sparse band strategy based on pre-computations of certain cumulative sums of similarities, combined with a min-heap approach to efficiently store and maintain a list of candidate merges. This algorithm is implemented in the R package adjclust, which is available at https://CRAN.R-project.org/package=adjclust. We provide applications to SNP and Hi-C datasets.

Domaines

Mathématiques [math] Informatique [cs] Sciences du Vivant [q-bio] Biologie végétale

Migration ProdInra : Connectez-vous pour contacter le contributeur

https://hal.inrae.fr/hal-02788245

Soumis le : vendredi 5 juin 2020-03:13:07

Dernière modification le : mercredi 18 décembre 2024-09:48:45

Dates et versions

hal-02788245 , version 1 (05-06-2020)

Identifiants

HAL Id : hal-02788245 , version 1
PRODINRA : 466209

Citer

Christophe Ambroise, Alia Dehman, Michel Koskas, Pierre Neuvial, Guillem Rigaill, et al.. Adjacency-constrained hierarchical clustering of a band similarity matrix with application to genomics. Journée Régionale de Bioinformatique et Biostatistique, Génopole Toulouse, Dec 2018, Toulouse, France. ⟨hal-02788245⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-PARIS7 AGROPARISTECH UNIV-TLSE2 CNRS UNIV-EVRY INSA-TOULOUSE INRA INSMI IMT UT1-CAPITOLE MIA-PARIS USPC LAMME IPS2 UNIV-PARIS-SACLAY INSA-GROUPE INRAE UP-SCIENCES GS-ENGINEERING GS-COMPUTER-SCIENCE INRAEOCCITANIETOULOUSE UNIV-UT3 UT3-TOULOUSEINP MATHNUM MIAT BIOLOGIE_ET_AMELIORATION_DES_PLANTES INTERDISCIPLINARITES MITI

93 Consultations

0 Téléchargements