Old wine in new bottles : Factorial analyses in the age of multi-omics
Résumé
Factorial analyses are used across a wide variety of modern scientific disciplines, but they have a long and rich history. Principal Component Analysis (PCA), which is arguably the best known among them, can be traced back over a century to work published by Karl Pearson in 1901. However, the full potential of these methods could only be fully explored more recently with the advent and the rapid rise in power of computers.
Roughly speaking, factorial analyses aim at reducing the dimensionality of a dataset while retaining as much of the variation present in the dataset as possible. This is achieved by performing a singular value decomposition of the dataset, occasionally following an appropriately chosen transformation.
Initially, the use of these methods was restricted to the analysis of small-sized single data tables, with only a few variables. However, from the 1960’s onward, advances in both data collection and computing power led to a renewed interest in these methods, with the so-called “Exploratory Data Analysis” approach (“Let the Data speak”), championed by John Tukey, and the development of multivariate descriptive analysis tools, mainly in France with Jean-Pierre Benzécri, but also in other countries including Japan and Italy.
Their scope has gradually been extended to the analysis of two or more linked data tables (multi-table analyses, i.e. Multiple Factorial Analysis), as well as to approaches to explicitly model covariates (Redundancy Analysis).
More recently, technological advances have enabled the generation of high-throughput biological data, corresponding to large amounts of heterogeneous, complex and high dimensional data at multiple molecular levels (so-called multi-omics data, including the genome, transcriptome, methylome, etc). Comprehensive exploratory, descriptive and predictive analyses for these multi-omics data are critical for extracting their full potential.
The framework of factorial analyses provides a variety of flexible methods for this purpose, as it can integrate a variety of types of data (binary, qualitative, quantitative) and diverse data structures. For instance, multi-block analyses (e.g., Multiple Factor Analysis and Co-inertia Analysis) provide a synthetic view of the relative influence of each data block and provide a consensus representation of the full dataset. Partial Triadic Analysis is a special case of multi-block analysis that deals with so-called data cubes, where omics data are repeated (e.g., across tissues or across time). Interestingly, unlike predictive methods, these descriptive methods do not include a matrix inversion step, implying that they can be used without constraint on the number of samples (observations) with respect to the number variables, thus obviating the need for regularization. This represents a considerable advantage in the context of multi-omics data, where data blocks (e.g., transcriptome-wide expression in a given tissue) are often available for a limited number of observations (e.g., animals) measured on a huge number of variables (e.g., genes).
Finally, Redundancy Analysis (RDA), which corresponds to the supervised (constrained) version of factorial analysis methods, can be used to model omics data with respect to a set of covariates, such as sanitary, geography, climate, or herd system. One limitation is that only a small number of covariates with respect to the number of observations can be accounted for in such an approach.
We will illustrate the broad interest of different factorial analysis methods in the context of multi-omics data through three specific examples:
a)Quantification of the structuring impact of geography on genetic diversity in cattle and goats: an RDA on Italian bovine and ovine genomic data was used to quantify and compare the effect of geography on the genetic structuring of both species.
b)Identification of covarying gene expression and metabolite levels in layer chickens: a Co-inertia Analysis identified significant covariation between metabolomic and transcriptomic data of layer chickens subjected to different abiotic stresses.
c)Identification and characterization of individuals with atypical multi-omic profiles in a large-scale human cancer study: a Multiple Factor Analysis was extended in the padma approach to characterize global sources of variability from multi-omic data, identify individuals with atypical profiles within a population, and highlight genes and omics with a strong contribution to these profiles.
These examples illustrate the versatility of factorial analysis methods in the multi-omics era, and notably their strengths for tackling a variety of questions and structures and dealing with high-dimensional, complex, and heterogeneous data.
Origine | Fichiers produits par l'(les) auteur(s) |
---|