A Random Forests Framework for Modeling Haplotypes as Mosaics of Reference Haplotypes
Abstract
Many genomic data analyses such as phasing, genotype imputation, or local ancestry inference share a common core task: matching pairs of haplotypes at any position along the chromosome, thereby inferring a target haplotype as a succession of pieces from reference haplotypes, commonly called a mosaic of reference haplotypes. For that purpose, these analyses combine information provided by linkage disequilibrium, linkage and/or genealogy through a set of heuristic rules or, most often, by a hidden Markov model. Here, we develop an extremely randomized trees framework to address the issue of local haplotype matching. In our approach, a supervised classifier using extratrees (a particular type of random forests) learns how to identify the best local matches
between haplotypes using a collection of observed examples. For each example, various
features related to the different sources of information are observed, such as the
length of a segment shared between haplotypes, or estimates of relationships between
individuals, gametes, and haplotypes. The random forests framework was fed with
30 relevant features for local haplotype matching. Repeated cross-validations allowed
ranking these features in regard to their importance for local haplotype matching. The
distance to the edge of a segment shared by both haplotypes being matched was
found to be the most important feature. Similarity comparisons between predicted and
true whole-genome sequence haplotypes showed that the random forests framework
was more efficient than a hidden Markov model in reconstructing a target haplotype as
a mosaic of reference haplotypes. To further evaluate its efficiency, the random forests
framework was applied to imputation of whole-genome sequence from 50k genotypes
and it yielded average reliabilities similar or slightly better than IMPUTE2. Through this
exploratory study, we lay the foundations of a new framework to automatically learn
local haplotype matching and we show that extra-trees are a promising approach for
such purposes. The use of this new technique also reveals some useful lessons on
the relevant features for the purpose of haplotype matching. We also discuss potential
improvements for routine implementation.
Fichier principal
Faux et al. - 2019 - A Random Forests Framework for Modeling Haplotypes.pdf (2.15 Mo)
Télécharger le fichier
Origin | Publisher files allowed on an open archive |
---|---|
Licence |