Chi-square processes for gene mapping in a population with family structure

Detecting a quantitative trait locus, so-called QTL (a gene influencing a quantitative trait which is able to be measured), on a given chromosome is a major problem in Genetics. We study a population structured in families and we assume that the QTL location is the same for all the families. We consider the likelihood ratio test (LRT) process related to the test of the absence of QTL on the interval [0, T] representing a chromosome. We give the asymptotic distribution of the LRT process under the null hypothesis that there is no QTL in any families and under local alternative with a QTL at t⋆∈[0,T]\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$t^{\star }\in [0, T]$$\end{document} in at least one family. We show that the LRT is asymptotically the supremum of the sum of the square of independent interpolated Gaussian processes. The number of processes corresponds to the number of families. We propose several new methods to compute critical values for QTL detection. Since all these methods rely on asymptotic results, the validity of the asymptotic assumption is checked using simulated data. Finally we show how to optimize the QTL detecting process.


Introduction
Detecting a quantitative trait locus, so-called QTL (a gene influencing a quantitative trait which is able to be measured), on a given chromosome is a major problem in Genetics. For example, Li et al. (2006) detected QTL responsible for reduction of grain shattering in cultivated rice, Frary et al. (2000) highlighted the presence of a QTL responsible for tomato fruit size, Silva et al. (2011a, b) looked for QTL affecting lactose in Brazilian Gir dairy cattle. In this paper, we study a population structured in families and we assume that the QTL location is the same for all the families. Each family is a set of offsprings from one sire. The problem is that a QTL can be detected in one family if and only if the sire is heterozygous at the QTL. As a result, geneticists focus on a few families. The individual belongs to a family labeled by i ∈ {1, . . . , I } and its random label C ∈ {1, . . . , I } is distributed according to a multivariate distribution, i.e. P(C = i) = π i , i = 1, . . . , I ; The chromosome will be represented by the segment [0, T ]. The distance on [0, T ] is called the genetic distance, it is measured in Morgans. A so-called "genome information" at location t is denoted X (t) which takes values in {−1, 1}. The admitted model for the stochastic structure of X (.) is due to Haldane (1919) which states that: where for any a ∈ R, δ a denotes the point mass at a and N (.) is a standard Poisson process on the interval [0, T ]. In a more practical point of view, this model assumes no crossover interference and the Poisson process represents the number of crossovers on [0, T ] which happen during meiosis. Let us denote by Y , the so-called Quantitative Trait random variable. The stochastic model, associated to Y is defined by where μ i and q i are respectively the polygenic and QTL effects within family i, and ε is a standard normal random variable. t is the true location of the QTL. Recall that the location t of the QTL is the same for all the families.
In fact the "genome information" will be available only at certain fixed locations called "markers" t 1 = 0 < t 2 < · · · < t K = T and the observation will be (Y, X (t 1 ), . . . , X (t K ), C) .
Our dataset Y j , X j (t 1 ), . . . , X j (t K ), C j=1,...,n is supposed to be obtained by collecting n independent and identically distributed observations (i.i.d.) copies of the random vector (Y, X (t 1 ), . . . , X (t K ), C). A so-called Haldane's function denoted by r is considered. This function, from [0, T ] 2 into [0, 1/2] is defined as follows: with the conventionr (t, t ) = 1 − r (t, t ). For each (t, t ) ∈ [0, T ] 2 the quantity r (t, t ) represents the probability of recombination of two loci located at t and t . It can be proved that, conditionally on X (t 1 ), . . . , X (t K ) and C, Y obeys to the following mixture model with known weights: where f (m,σ ) is the Gaussian density with parameters (m, σ ) and where the function p(t ) is the probability that X (t ) = 1 conditionally on the flanking markers. It can be expressed from the functions r and r , see Sects. 2 and 3. n (t) will denote the likelihood ratio test (LRT) statistic, at location t (see Sect. 2 for a precise definition) of the null hypothesis {q i = 0, i = 1, . . . , I } (i.e. no QTL in any family). The challenge is that the true location t is not known. As a result, at each location t ∈ [0, T ], the presence of a QTL is tested and considering the maximum of n (.) gives the LRT of {q i = 0, i = 1, . . . , I } on the full chromosome. Note that arg sup n (t) is a natural estimator for the QTL location.
Some theoretical results about the LRT process and using approximations, are present, in Rebaï et al. (1994Rebaï et al. ( , 1995, Cierco (1998), Azaïs and Cierco-Ayrolles (2002), Azaïs and Wschebor (2009), Chang et al. (2009). In Azaïs et al. (2014), the focus is on the exact model. However these papers deal with only one family (I = 1). In practice, geneticists look for the QTL not in one family but simultaneously in several families, each of them defined by a different sire. This design is called daughter design (Weller et al. 1990). Since a QTL can be detected in one family if and only if the sire is heterozygous at the QTL, considering a few families increases the chances to study families whose sires are heterozygous at the QTL. As a result, in this paper, we address the problem of the asymptotic distribution of the LRT process when a few families are considered (I ≥ 1). Our main result (Theorems 1 and 2) is that the distribution of the LRT statistic is asymptotically that of the maximum of the square of I independent and "non linear interpolated Gaussian processes". Then, using our theoretical results, we are able to propose methods, as a function of the genetic map, to compute thresholds (i.e. critical values) for QTL detection. Since all these methods rely on asymptotic results, the validity of the asymptotic assumption is checked using simulated data. Moreover, we show how to optimize the detecting process by comparing performances of a global test and a multiple testing procedure. Our methods are available in a Matlab package with graphical user interface: "imfamily.zip". It can be downloaded at http://charles-elie.rabier.pagesperso-orange.fr/doc/articles.html. These methods are alternatives to permutation methods (e.g. Jung et al. (2007)), generally used in genetics, that enable to compute empirically the distribution of the maximum of the process (Churchill and Doerge 1994). Our methods present the advantage to be largely faster than permutation methods. We will also show on simulated data that the mixture model approach is more rewarding than the linearized likelihood approach Haley (1992) which is very popular in our research field. We refer to the book of Van der Vaart (1998) for elements of asymptotic statistics used in proofs. We also refer to Weller et al. (1990), Siegmund and Yakir (2007), Wu et al. (2007) for some genetic background and to Ron et al. (2001), Chen et al. (2006), Weller et al. (2008) for the application field of our study. Typically, the study of Silva et al. (2011a, b) on QTL affecting lactose in Brazilian Gir dairy cattle is an example of application (K = 27, I = 14, n = 657, max π i = 0.19), based on a permutation threshold and on a linearized likelihood.
Our paper ends with an illustration inspired from human real data (Phase 3 release 2 of the HapMap Project). Although the daughter design is not realistic in humans, it is always interesting to use patterns from real data. For instance, as in animal data, human data present the problem of informativeness of genetic markers. These human data, which present a high density of markers (K = 75, 245), can be analyzed easily, with the help of our interpolation described in Theorems 1 and 2. Besides, we were able to recover, on simulated data, the QTL linked to human height on chromosome 7, highlighted by Gudbjartsson (2008) in a Genome-Wide Association Study (GWAS).

Main results: two genetic markers
To begin, we suppose that there are only two markers (K = 2) located at 0 and T : and It is clear that p(t ) is the probability appearing in (2). An application of the rule of total probabilities leads to We can remark that we have Let θ = (q 1 , . . . , q I , μ 1 , . . . , μ I , σ) be the parameter of the model at t fixed and θ 0 = (0, . . . , 0, μ 1 , . . . , μ I , σ) the true value of the parameter under H 0 . The likelihood of the triplet (Y, X (t 1 ), X (t 2 ), C) with respect to the measure λ ⊗ N ⊗ N ⊗ N , λ being the Lebesgue measure, N the county measure on N, is at a position t: Note that the notation g i (t) will be useful in the generalization to several markers (Sect. 3). In what follows, l t (θ ) will be the loglikelihood. We first compute the Fisher information at a point θ 0 that belongs to H 0 : After some calculations, we find Before introducing our main theorem, let us define the LRT statistic and the alternative hypothesis. The LRT at t, for n independent observations, will be defined as where θ is the maximum likelihood estimator (MLE), and θ |H 0 the MLE under H 0 .
On the other hand, in order to define the alternative hypothesis (so-called H λt ), the location t of the QTL has to be added in the definition. The alternative hypothesis will be the following: H λt : "there is a QTL at the position t in at least one family".
Besides, in order to deal with Le Cam (1986)'s theory, we will consider local alternatives.
Theorem 1 Suppose that the parameters (q 1 , . . . , q I , μ 1 , . . . , μ I , σ) vary in a compact and that σ is bounded away from zero. Let H 0 be the null hypothesis {q i = 0, i = 1, . . . , I } and define the following local alternative H λt : "there is at least one q i = λ i / √ n, with λ i ∈ R , at the position t .
With the previous defined notations, as n tends to infinity, under H 0 and H λt where: is the continuous and the non linear interpolated process such as: where σ .
The proof of Theorem 1 is given in Appendix 1. Let us recall the definition of a Chi-square process from Davies (1987).
Definition 1 A Chi-square process W (.) with d degrees of freedom is a process such as: where the V i (t) are independent for each t and distributed as a standardized Normal under the null hypothesis.
As a consequence, the limiting process I i=1 Z i (.) 2 of Theorem 1 is a Chi-square process with I degrees of freedom where the Z i (.) are independent and identically distributed (a particular case of formula 7). Note that Theorem 1 could easily be generalized to selective genotyping experiments, that allow to reduce genotyping costs, by genotyping only extreme individuals. In other words, under selective genotyping, the genome information at markers X (t 1 ), . . . , X (t K ) is available for one individual, if and only if Y ≥ S + or Y ≤ S − , where S + and S − denote two real thresholds (constant). Typically, an additional factor would appear in the mean function m i t (.) of the processes Z i (.) introduced in Theorem 1 (see Rabier (2015) for more details).
On the other hand, Theorem 1 could also be adapted to the interference model, where contrary to Haldane, crossovers do not occur independently from each others. In that case, the functions α(.) and β(.) would be linear functions (see Rabier (2014)).

Several markers: the "Interval Mapping" of Lander and Botstein (1989)
In that case suppose that there are K markers 0 = t 1 < t 2 < · · · < t K = T . We consider values t, t or t of the parameters that are distinct of the markers positions, and the result will be prolonged by continuity at the markers positions. For t ∈ [t 1 , t K ]\T K where T K = {t 1 , . . . , t K }, we define t and t r as: In other words, t belongs to the "Marker interval" (t , t r ).

Theorem 2
We have the same result as in Theorem 1, provided that we make some adjustments and that we redefine each process Z i (.) in the following way: • in the definition of α(t) and β(t), t 1 becomes t and t 2 becomes t r • under the null hypothesis, the process Z i (.) considered at marker positions is the "squeleton" of an Ornstein-Uhlenbeck process: the stationary Gaussian process with covariance ρ(t k , t k ) = exp(−2|t k − t k |) • at the other positions, Z i (.) is obtained from Z i (t ) and Z i (t r ) by interpolation and normalization using the functions α(t) and β(t) • at the marker positions, the expectation is such as m i • at other positions, the exception is obtained from m i t (t ) and m i t (t r ) by interpolation and normalization using the functions α(t) and β(t).
A proof is given in Appendix 2. Note that when the number of genetic markers is infinite, each process Z i (.) is an Ornstein-Uhlenbeck process. As a consequence, when the number of genetic markers is infinite, I i=1 Z i (.) 2 is an Ornstein-Uhlenbeck Chi-square process with I degrees of freedom (OUCS(I)) since the processes Z i (.) are independent. In Fig. 1, we consider a chromosome of length T = 60 cM and 3 families (i.e. I = 3). We focus on two genetic maps: • an infinite number of genetic markers • only 4 markers located every 20 cM.
One path of each asymptotic process is presented in this figure. We can notice that the path of the OUCS(3) is very jerky whereas the path of the process corresponding to the sparse map is smooth due to the interpolation between markers.

Introducing the methods
We propose several new methods, as a function of the map considered, to compute thresholds for the supremum of the LRT process under H 0 . In particular, two kinds of maps are studied: • a sparse map: a few markers covering the chromosome • a dense map: a high density of markers pretty close to each other.
We will assume that when the map is dense, tests are performed only on markers, whereas when the map is sparse, tests are also performed between markers (cf. Example 11.3, p. 248 of Wu et al. (2007)). Under a sparse map, thresholds can be obtained according to the most appropriate method which depends on the number of families I : • For I = 1, the problem is the same as computing the distribution of the maximum, i-absolute, value of a Gaussian vector. This can be done by a Discrete Monte-Carlo Quasi Monte-Carlo method (DMCQMC). In particular, the method for numerical computation of a multivariate normal probability (Genz 1992) can be considered. It uses a transformation that simplifies the problem and places it into a form that allows efficient calculation using MCQMC methods. A simple MC method using N points has errors that are typically O(1/ √ N ) whereas a MCQMC method has errors approximatively O(1/N ), that's why the focus here is on MCQMC.
• For I > 1, a Discrete Monte-Carlo (DMC) method can be performed. According to Theorem 2, when we test only on markers, the asymptotic process is a Discrete Ornstein-Uhlenbeck Chi-square process with I degrees of freedoms (DOUCS(I)). In this case, the processes Z i (.) are simply AR(1) processes. Then, in order to obtain values between markers, we can complete by interpolation using formula (6). As a result, the threshold is easily obtained by a DMC method based on a large number of sample paths (denoted nspaths) of the asymptotic process.
Under a dense map, we propose theoretical methods to obtain the thresholds. As mentioned previously, when the number of genetic markers is infinite, the LRT process is asymptotically an OUCS(I) process. In Rabier and Genz (2014), we propose an approximative formula (named DF here) for the threshold of the supremum of the OUCS(I) process. It is based on Delong (1981)'s work on Brownian motion. This formula is suitable when I and the threshold are large. Besides, statistical tables given by Estrella (2003), for the threshold of the supremum of the OUCS(I), are also available. Note that, in order to obtain its exact tables, Estrella improved Delong's work on hypergeometrics functions. In the following, Estrella's method will be denoted ET. Table 1 is a summary of the different methods.

Applications under the null hypothesis
In this section, the focus is on thresholds corresponding to the 95 % quantile of the supremum of the LRT process under H 0 . In order to illustrate the different methods, a sparse map and a dense map are considered. Since all the methods are based on asymptotic results (cf. Theorems 1 and 2), populations of different sizes have been simulated in order to check when the asymptotic regime is reached. In what follows, npop will denote the number of populations whereas n is the size of a population. The map consists of 4 genetic markers equally spaced every 20 cM (T = 60 cM). A test is performed every 5 cM The map consists of 4 genetic markers equally spaced every 20 cM (T = 60 cM). A test is performed every 5 cM (σ = 1, μ 1 = −0.37, μ 2 = 0.03, μ 3 = 0.06, μ 4 = −0.26, μ 5 = 0.27, npop = 40,000)

Sparse map
The sparse map consists of a chromosome of length T = 60 cM with 4 genetic markers equally spaced every 20 cM. The presence of a QTL is tested every 5 cM.
In Table 2, thresholds are presented as a function of I . In Table 3, the focus is on the number of false positives (NFP) as a function of the number of individuals n (thresholds given in Table 2). Using the Binomial distribution, a 95 % confidence interval is computed (in brackets in the tables) for the true percentage of the number of false positives. According to Table 3, when there are on average 200 individuals per family (i.e. n = 200 I ), NFP is not significantly different from 5 %. When n = 50 I , we can consider that NFP is still fair (even if it is significantly different from 5 %) whereas when n = 30 I , NFP is not so nominal.

Dense map
The dense map consists of a chromosome of length T = 50 cM with 501 genetic markers equally spaced every 0.1 cM.
The thresholds and the NFP are respectively compared in Tables 4 and 5. This aspect suggests fast convergence to the asymptotic regime.

Remark
ET is not appropriate for the sparse map for two reasons: • ET is based on Ornstein-Uhlenbeck (OU) process which is much more irregular than the process Z 1 (.) (OU can be viewed as a stationary version of the Brownian motion). When I = 1, this can be formalized by the use of Slepian type inequalities, specially Lemma 2.1 in Azaïs and Wschebor (2009) which comes from Plackett (1984). It can be proved that the covariances are smaller in the case of OU process than for the process Z 1 (.). It implies that the maximum of OU is stochastically greater than the maximum of Z 1 (.). Since P sup Z 1 (.) > u ≈ 2P sup Z 1 (.) > u , this argument can be approximately extended to the absolute value. • For the sparse map, the focus is not on the continuous process but on the dicrete process: the maximum of a continuous process is always greater than the discrete one.
To sum up, ET will give too large thresholds.

Motivation
A few sires are heterozygous at the QTL and others are homozygous. As mentioned previously, a QTL can only be detected in a family defined by an heterozygous sire. Thus, two questions arise: • Is it always profitable to include all the families in the analysis?
• Do we have to analyze families all together or separately?
We consider here, the sparse map of Sect. 4.2. As previously, tests are performed every 5 cM. We will consider tests at the 5 % significance level.

About the QTL effects
When we deal with I families, since the total number of individuals is n, the expected number of individuals in family i is only nπ i . Hence, in order to study the evolution of the power of the Interval Mapping with the number of families, we will consider λ i = λ √ π i (note that when I = 1, we have λ 1 = λ since π 1 = 1). As a consequence, the mean function, m i t (t), of the asymptotic process Z i (.), is proportional to λ and does not depend on i (cf. Theorem 1 and 2).

How to optimize the QTL detecting process
Only asymptotic results are studied here (cf. Theorems 1 and 2). Figures 2, 3, and 4 are related to question 1 whereas Figs. 5, 6, and 7 are related to question 2.
In Figs. 2, 3, and 4, the power is represented as a function of the QTL location, the number of families and the QTL effects. In Figs. 5, 6, and 7, we compare the power of the approach which consists in analyzing all families together (as previously), and the power of the approach which consists in analyzing families separately. Note that we used a discretization for the QTL location t (every 5 cM).
As expected, when all the sires are heterozygous, the power increases with the number of families (cf. Fig. 2 for λ = 2). Besides, for a given number of families, the power increases with the proportion of heterozygous sires (cf. Fig. 3 for I = 5, λ = 2 and various number number nz of non zeros λ i 's).
According to Fig. 4, it is almost as powerful to consider only one family whose sire is heterozygous (cf. curve I = 1 with nz = 1), as to consider 5 families with only two heterozygous sires (cf. curve I = 5 with nz = 2). As a result, it is much more powerful to consider one family of an heterozygous sire (cf. curve I = 1 with nz = 1) as to consider 5 families with only one heterozygous sire (cf. curve I = 5 with nz = 1). Hence, if the families could be sorted in advance, it would be more powerful to concentrate the analysis on the families with a segregating QTL (i.e. families of heterozygous sires). Furthermore, once the families targeted, it would be more powerful to remove the families with very small QTL effects (not illustrated here). Indeed, these families are a source of noise to our model.
In practice, the segregating families are not known before the statistical analysis and the true question is: do we have to analyze all the families together (so-called "global approach") or analyze families separately (so-called "Bonferroni approach") ? Indeed, since our results are asymptotic, the variance is not better estimated when the global approach is considered. Figures 5, 6, and 7 are related to these two approaches. When the global approach is considered and when H 0 is rejected, it only comes out that there is a QTL in at least one family (i.e. at least one sire is heterozygous), but this family is not known. As a result, in order to answer the same question, we define, for the Bonferroni approach, the test statistic U and the critical region C R, which results from a Bonferroni correction: The Bonferroni correction allows to have P H 0 (U ∈ C R) ≤ 0.05. Obviously, the power of the Bonferroni approach is P H λt (U ∈ C R). Note that we could have considered other multiple testing procedures (e.g. Benjamini and Hochberg (1995), Didelez et al. (2006)).
In Fig. 5, the focus is on the particular case for which there is only a QTL in family 1. The power of the two approaches is represented as a function of the QTL location t and the number of families. We can notice that the Bonferroni approach is more powerful than the global approach. In Fig. 6, the focus is on the particular Crosses refer to I = 12, rectangles to I = 7 and stars to I = 5 (λ = 2, σ = 1, nspaths = 100,000). The map consists of 4 genetic markers equally spaced every 20 cM (T = 60 cM). A test is performed every 5 cM case for which there is a QTL in each family. In that case, the Bonferroni approach is outperformed by the global approach. Figure 7 represents the mean power of the two approaches. Every alternative hypotheses have been considered (i.e. for a given I , we have considered nz = 1, . . . , I ). Equiprobability concerning all these hypotheses has been supposed. According to the figure, for a given number of families, there is a mean increase in terms of power of at least 15 % when the global approach is considered.

Conclusion
It comes out from this study that in order to optimize the QTL detecting process, it is required: • To target, whenever possible, families with the largest QTL effects and then, to analyze all these families together. • When it is not possible to target families, to analyze all the families together.

Comparison between different global tests 6.1 Mixture model vs linearized likelihood
Tables 6 and 7 compare on the sparse map, two approaches regarding the data analysis. Note that different number of families have been considered assuming that all the sires were heterozygous (λ = 2, π i = 1/I ). The first approach is the one theoretically studied in this paper. It relies on the mixture model and MLE are computed every 5 cM, with the help of the EM algorithm. The thresholds used are the same as in Table  2. The second approach is the linearized likelihood method (Haley 1992). It consists in approximating the mixture of formula (2) by only one distribution: As a consequence, when this approximation is used, analytical formulas (as in a linear model) are available regarding the different estimators. Besides, thresholds are usually obtained from permutation tests, that enable to compute empirically the distribution of the maximum of the process (Churchill and Doerge 1994).
According to Tables 6 and 7, the method based on the mixture model is more powerful in all cases. The largest difference of power is observed for n = 30 I (approximately 6 %).
Note that for n = 200 I , the approach relying on the mixture model is still slightly more interesting. We will show in Sect. 8 that a major drawback of the permutation method is that it requires a large amount of time to get computed, which is not the case of our proposed methods.
Last, let us focus on the method based on mixture model: we can notice in Tables 6 and 7 that the theoretical power is always located, whatever the value of n, in the 95 % confidence interval for the true value of the power. It validates our asymptotic results.

Interpolated process vs discrete process
We propose to compare here the powers of two statistical tests. The first one is the LRT studied in details in this paper. Recall that it is based on the test statistic: The map consists of 4 genetic markers equally spaced every 20 cM (T = 60 cM). A test is performed every 5 cM (λ = 2, t = 25 cM, nspaths = 100, 000 for the theoretical power and npop = 10, 000 for the Empirical Power, μ 1 = −0.37, μ 2 = 0.03, μ 3 = 0.06, μ 4 = −0.26, μ 5 = 0.27, σ = 1, nz = I ). Thresholds are given between parentheses, and confidence intervals for the true value of the power are given between brackets where Z i (.) is the interpolated Gaussian process obtained in Theorems 1 and 2. The second one relies on the test statistic:  where t 1 , · · · , t K are the markers positions. The aim of this comparison is to quantify the usefulness of the interpolated process in the QTL detection.
We consider a chromosome of size T = 1 M and two different genetic maps: • map 1 consists in 6 markers equally spaced every 20 cM • map 2 consists in 51 markers equally spaced every 2 cM.
One QTL is present on the chromosome at t = 0.4 M (on a marker) or t = 0.5 M (between two markers) for map 1 and at t = 0.4 M (on a marker) or t = 0.51 M (between two markers) for map 2. For each test statistic, threshold and power are based on 100,000 paths of the corresponding process under the null hypothesis and under the alternative of one QTL located at t in all families (λ = 2, π i = 1/I ). The power gain is defined as the difference between the power of the LRT and the power of the test relying only on marker locations. A confidence interval for the power gain is given based on 30 simulations of 100,000 paths. According to Table 8, when the QTL is located on a marker, there is no need to use the interpolation results to test for a QTL: the power gain is too low or even slightly negative in some cases. When the QTL is located between the markers, the power gain is all the more important as the map is more sparse; furthermore, when we increase the intensity of the Poisson process (modelling the number of recombinations), we decorrelate the markers and the power gain is more important using the interpolation results.

Behavior of the LRT when the main assumptions are violated
The proposed LRT is based on two main assumptions. First, we consider that the QTL location is the same for all the families. Secondly, we assume that there is only one QTL located on [0, T ]. In this section, we investigate the behavior of the LRT when these assumptions are violated.

QTL locations are different across families
Let us consider the case where the true QTL location is different across families. In what follows, t i will denote the QTL location for family i. In this context, the mean function of the process Z i (.) is still an interpolated function relying on the functions α(t) and β(t), except that the quantities m i t (t 1 ) and m i t (t 2 ), from Theorem 1, are now replaced by the quantities m i t i (t 1 ) and m i t i (t 2 ) defined in the following way: The proof is the same as the proof of Theorem 1, provided that we replace X (t ) by X (t i ) in formula (14). In presence of several markers, it can be seen easily that the formula becomes Tables 9 and 10 focus on the cases I = 3 and I = 5 respectively. The genetic map considered is the sparse map (introduced in Sect. 4.2). Table 9 investigates two different scenarios: 1. the QTLs present in the first two families, are located on the same genetic marker (t 1 = t 2 = 0.2) 2. the QTLs present in the first two families, are located at the middle of a marker interval (t 1 = t 2 = 0.5) Note that for both scenarios, the location t 3 of the QTL in the third family is allowed to vary along the genome. According to Table 9, for scenario 1, the theoretical power reaches its maximum when t 3 takes the same value as t 1 and t 2 . In other words, the QTLs have to be located on the same genetic marker, for the three families. However, under scenario 2, the maximum power is reached when t 3 is equal to 0.4, which is a different location from the QTL locations in the first two families. This surprising I = 3, T = 0.60, K = 4, ∀k = 1, . . . , 4 t k = 0.20(k − 1), ∀i λ = λ i √ π i = 2, σ = 1, 100,000 paths for the theoretical power  result is due to the fact that the signal is maximum when the QTL is located on a genetic marker, whereas there is a loss of power when the QTL is not located on a genetic marker. Figure 8 illustrates these two scenarios in a noiseless setting (i.e. ideal situation). As expected, for scenario 1, the signal is maximum when t 1 = t 2 = t 3 = 0.2. Under scenario 2, we can observe that it is maximum for t 3 = 0.4 and t 3 = 0.6. As a result, it is more rewarding to have a QTL located on one genetic marker in one family, than all QTLs located at the middle of a marker interval. Recall that the Theoretical power, computed in Table 9, was obtained by Monte Carlo and in presence of noise in the model. It was maximum for t 3 = 0.4 under scenario 2. Last, we obtain the same kind of conclusions when we deal with 5 families (cf . Table 10). Table 11 Theoretical power and empirical power as a function of the number m of QTLs, their effects, and the number I of families

Several QTLs are lying on the genome
Let us consider now that m QTLs lie on the genome, at locations t (1) < t (2) < . . . < t (m) . In order to make the reading easier, we will assume that the QTLs are located at the same location across families. Let q s,i denote the effect of the s-th QTL in family i and let λ s,i denote the constant such as q s,i = λ s,i / √ n. In this context, the mean function of the process Z i (.) is still an interpolated function relying on the functions α(t) and β(t), except that m i t (t 1 ) and m i t (t 2 ) are now replaced by the quantities m it (t 1 ) and m it (t 2 ) defined in the following way: The proof relies heavily on the proof of Theorem 1, provided that we replace formula (13) by the following formula: where S 0 n (., i) is the process obtained under H 0 .
In presence of several markers, it can be seen easily that formula (8) becomes σ . Table 11 focuses on the sparse map and investigates the power of the LRT when either 1, either 2 or 3 QTLs lie on the genome. We studied in particular the following configurations: Note also that the power is reported as a function of the QTL effect signs. For all cases, the absolute value of the constant linked to the QTL effect was equal to 2 (i.e. ∀(s, i), |λ s,i √ π i | = 2), in order to deal with small QTL effects. The theoretical power was obtained by generating 100,000 paths of the asymptotic process, whereas 1000 samples of size n were considered regarding the empirical power. We studied the cases I = 1, I = 2, I = 3, and a sample of size n = 200 I , n = 50 I , or n = 30 I . According to Table 11, there is a good agreement between the empirical power and the theoretical power for n = 200 I . Besides, the power increases with the number of families whatever the scenario studied. As expected, the power associated to the usual case m = 1 is fair in all cases, in particular for I = 3, 5. However, when the number of QTLs increases (m = 2 or m = 3), the power highly depends on the QTL effect signs and the distance between QTLs. For instance, for m = 2, we observe a large decrease when the 2 QTLs of opposite signs become closer from each other. Last, we can notice that using the LRT under a 3 QTLs scenario can be more powerful than when only 1 QTL lie on the genome. To conclude, the use of the test statistic sup n (.) is appropriate for testing and localizing one QTL on [0, T ], but it is not so reliable when more than one QTL (i.e. m > 1) lie on [0, T ]: it highly depends on the parameter values.

Illustration on human data
To conclude this study, we propose to give an illustration inspired from human real data. Although the daughter design is not realistic in humans, it is always interesting to use patterns from real data.
According to its dedicated website at http://hapmap.ncbi.nlm.nih.gov, "the International HapMap Project is a multi-country effort to identify and catalog genetic similarities and differences in human beings. The Project is a collaboration among scientists and funding agencies from Japan, the United Kingdom, Canada, China, Nigeria, and the United States. The goal of the International HapMap Project is to compare the genetic sequences of different individuals to identify chromosomal regions where genetic variants are shared." In this context, we downloaded data from Phase 3 release 2 of the HapMap Project. We focused on phased haplotype data of the following populations: We investigated the presence of a QTL linked to human height on chromosome 7 in humans. In a previous GWAS, Gudbjartsson (2008) highlighted the presence of a QTL located on a marker called "rs798544". As a consequence, our goal here is to check, with the help of simulated data, if we are able to recover this QTL using our global test.
To begin with, let us introduce the notion of marker informativity. At a given marker k, in order to know the genome information X (t k ) of the offspring of one sire, the sire has to be heterozygous at this genetic marker, otherwise the marker is considered as uninformative. As a consequence, the "genome information" is available only at certain fixed locations called "informative markers", instead of "markers". These "informative markers" depend obviously on the family (i.e. on the sire): one given marker can be informative in one given family, and uninformative in other families. In what follows, 0 ≤ t i 1 < t i 2 < · · · < t i K i ≤ T will denote the locations of informative markers in family i. Note that K i refers to the number of informative markers in this family. As a result, an observation is now Let T i K i be the quantity such as T i K i = {t i 1 , . . . , t i K i }. In this context, it is straightforward to show that the asymptotic process Z i (.) verifies now ∀t ∈ [t i 1 , t i K i ]\T i K i : and the α i (t) and β i (t) are the analogue of α(t) and β(t), relying now on the informative markers of family i. A proof is given in Sect. 1. The mean function of Z i (.) is still an interpolated function, based now on α i (t) and β i (t). Last, the limiting process 2 is a Chi-square process with I degrees of freedom where the Z i (.) are independent and not identically distributed. Note that the cases t i 1 = 0 and t i K i = T are discussed in the proof in Sect. 1 (see also below).
According to Hapmap data, chromosome 7 is of length T = 1.86M. A total of 75,245 markers are available and the locations of these markers are perfectly known. We considered a maximum of 14 families, and only 63,112 markers were found to be informative in at least one family. Table 12 gives the number of informative markers in each family. We can notice that the number of informative markers varies from 19,016 to 22,137: a large decrease in terms of marker density is observed, after filtering.
In the same way as what has been done before, we focused on different number of families: I = 6, 10, 14. The set of chosen families was {1, 4, 6, 8, 10, 12} for I = 6, and {1, 2, 4, 6, 8, 9, 10, 11, 12, 13} for I = 10 (cf . Table 12), ensuring the presence of all the different kinds of populations (CEU, CHB, ...). The number of informative markers in at least one family was equal to 54,420 for I = 6, and equal to 60,819 for I = 10. In what follows, we will describe our simulation framework only for the case I = 14. Other cases can be deduced easily. The genome of each sire was created by considering two haplotypes from the phased data of the chosen population. Note that once the sire's genome is built, we can extract easily the informative markers (i.e. the heterozygous markers). 50 offsprings were generated by family, and the LRT was computed over the grid defined by the 63,112 markers informative in at least one family. When the location did not match an informative marker in a given family, the value of Z i () was obtained by interpolation with the help of the functions α i (.) and β i (.). Besides, when the first informative marker did not match the first extremity of the chromosome (i.e. t i 1 = 0), we considered (cf. proof in Sect. 1),  Table 13 compares thresholds obtained according to three methods available under a dense map: DF, DMC, and the permutation method. Note that ET is not reported since statistical tables given by Estrella (2003) do not cover the case T = 1.86. The DMC method was computed with the help of I independent AR(1) processes: we generated paths of a DOUCS(I) with a constant discretization step equal to 1.86/63,111. In other words, it assumes that the informative markers are the same across families, and equally spaced. Last, the permutation threshold was obtained by generating one population: it handles the fact that informative markers change across families. According to Table  13, as expected, DF gives the largest threshold. Recall that it relies on the continuous OUCS(I). We can also notice that, for I = 10 and I = 6 the DMC threshold is greater than the one obtained by permutation. It is not the case for I = 14.
Then, the true level associated to each method was computed by generating 1000 populations. For instance, for I = 10, it was found respectively equal to 3.3, 3.6, and 4.1 % for DF, DMC and the permutation method. In all cases, DF was the most conservative method, and the shuffling method, which handles the informativity correctly, seemed less conservative than DMC.
However, the permutation method has a major drawback. It requires a large amount of time in order to compute the threshold (112h24 for I = 14, 78h10 for I = 10, 42h30 for I = 6). In opposite, DF can be obtained instantaneously, and DMC computation time remains reasonable (4h06 for I = 14, 2h40 for I = 10, 1h29 for I = 6).
Recall that our analysis relies only on one chromosome (number 7) in humans. In this context, our proposed methods seem to be the most appropriate for a whole genome study (23 chromosomes). Another advantage of our theoretical study is the following. In order to obtain the values of the process Z i (.) at each uninformative marker of family i, it is now possible to complete by interpolations with the functions α i (.) and β i (.). Usually, geneticists perform either an EM algorithm to compute the MLE, or they choose a linearized likelihood method (cf. Sect. 6). The EM algorithm is time consuming, and performing a linearized likelihood method at each uninformative marker can also be challenging, specially when the number of such markers is large. In contrast, completing by interpolation is very fast.
To conclude, Table 14 investigates the power. In most cases, the permutation method is slightly more powerful than DF and DMC. Besides, as expected, considering families without QTL, decreases the power of the global test.
To conclude, in most cases, we were able, with our simulated data, to detect the QTL linked to human height on chromosome 7, highlighted by Gudbjartsson (2008).
Let S n (., i) be the following process, for n observations: According to Lemma 1, We will call Z i (.) the limiting process of S n (., i).

Study under H 0
Without loss of generality, let us assume n = 1 and let us consider the process S(., i) defined in the following way: where h(t) = x(t)/ E x 2 (t) . h(.) is a random process, independent of Y and C. It is easy to see that E {S(t, i)} = 0, V {S(t, i)} = E {h(t)} 2 = 1.

So, we have
E Z i (t) = 0, V Z i (t) = 1 and Cov Z i (t 1 ), Z i (t 2 ) = ρ(t 1 , t 2 ). A direct application of central limit theorem implies that Z (t 1 ) and Z (t 2 ) have a limit distribution which is a Gaussian distribution. According to formula (9), we have n (t) = I i=1 S 2 n (t, i) + o P θ 0 (1). As a result, n (.) Due to the interpolation, we have

Study of the supremum of the LRT process
Since the model with t fixed is regular, we have the relationship (cf. section "Study under H 0 ") under the null hypothesis. Our goal is now to prove that the rest above is uniform in t.
Let us consider now t as an extra parameter. Let t * , θ * be the true parameter that will be assumed to belong to H 0 . Note that t * makes no sense for θ belonging to H 0 . It is easy to check that at H 0 the Fisher information relative to t is zero so that the model is not regular.
It can be proved that Assumptions 1, 2 and 3 of  holds. So, we can apply Theorem 1 of where the observation X j stands for Y j , X j (t 1 ), X j (t 2 ), C j and where D is the set of scores defined in , see also Gassiat (2002). A similar result is true under H 0 with a set D 0 . Let us precise the sets of scores D and D 0 . This sets are defined at the sets of scores of one parameter families that converge to the true model p t * ,θ * and that are differentiable in quadratic mean. It is easy to see that where l is the gradient with respect to θ . In the same manner where now the gradient is taken with respect to μ 1 , …, μ I and σ only. Obviously, this gradient does not depend on t.
Using the transform U → −U in the expressions of the sets of score, we see that the indicator function can be removed in formula (15). Then, since the Fisher information matrix is diagonal (see formula (5)) , it is easy to see that This is exactly the desired result. Since the model with t * fixed is differentiable in quadratic mean, the alternative defines a contiguous sequence of alternatives. By Le Cam's first lemma, relation (15) remains true under the alternative.

Lemma 2
We have the following relationship: Let S n (., i) be the following process, for n observations: According to Lemma 2 We will call Z i (.) the limiting process of S n (., i). Let us consider now the case where the first informative marker does not lie at the beginning of the chromosome (0 < t i 1 ). Let t ∈ [0, t i 1 [, we have Recall that in the classical situation, when t have two flanking markers: x i (t) = 2 P X (t) = 1 | X (t ,i ), X (t r,i ), C = i − 1. In our case,x As a result, ∀t ∈ [0, t i 1 [ S n (t, i) = S n (t i 1 , i).
By symmetry, when t i K i < T , we have ∀t ∈]t i K i , T ] S n (t i K i , i) = S n (t, i).
To conclude, we just have to use same kind of arguments as in formula (9) in order to prove that the LRT process converges asymptotically to the process I i=1 Z i (.) 2 .