Parametric versus nonparametric: the fitness coefficient

The fitness coefficient, introduced in this paper, results from a competition between parametric and nonparametric density estimators within the likelihood of the data. As illustrated on several real datasets, the fitness coefficient generally agrees with p-values but is easier to compute and interpret. Namely, the fitness coefficient can be interpreted as the proportion of data coming from the parametric model. Moreover, the fitness coefficient can be used to build a semiparamteric compromise which improves inference over the parametric and nonparametric approaches. From a theoretical perspective, the fitness coefficient is shown to converge in probability to one if the model is true and to zero if the model is false. From a practical perspective, the utility of the fitness coefficient is illustrated on real and simulated datasets.


Introduction
A challenge of data analysis is to assess the quality of a model. The traditional approach relies on goodness-of-fit tests where, loosely speaking, the ability of a model to fit the data is measured through distances between the observed values and the values expected under the model. Examples include the classical Pearson's chi-squared test [2], the Kolmogorov or Cramer-von-Mises goodness-of-fit tests [3,8,10], or likelihood-ratio based statistics [5,6,50] (see [5] for an empirical likelihood approach).
In this context, p-values have emerged as natural instruments to measure the amount of evidence in favor of the model. However, the use of p-values is subjected to several difficulties: (i) their calculation might require computationally intensive strategies as the bootstrap [21,24,45]; (ii) interpretation is notoriously difficult [29,38] as emphasized again in a recent ASA statement [52]; (iii) whenever some evidence has been found against the model, no information is delivered to improve inference.
In this paper, we introduce the fitness coefficient, a new criterion for simultaneously measuring the amount of evidence of a model and improving inference. Our goal is to provide an alternative approach to the use of p-values in goodness-of-fit testing that is no longer sensitive to the difficulties (i)-(iii).
Let X 1 , . . . , X n be independent d-variate observations with common density f 0 . Let P = {f θ : θ ∈ Θ} be a family of probability density functions representing the model. Given the maximum likelihood estimator fθ n (based on the model), and the standard kernel density estimatorf n (free from the model) with kernel K : R d → R + and bandwidth h n > 0, define the fitness coefficientα n asα n ∈ argmax wheref LR i,n is called the leave-and-repair (LR) kernel estimate of f 0 (X i ) and is given bŷ with ∆ n ≥ 0 and q : R d → R + . The LR estimate is a modification of the well-known leave-oneout (LOO) estimate usually employed in cross-validation procedures [18] and semiparametric estimation [9]. The fitness coefficient has the following advantages.
(i) The fitness coefficient is easy to compute. It is the minimizer of a simple one dimensional concave function.
(ii) The fitness coefficient is a measure of model quality. As seen in (1), the fitness coefficientα n follows from a competition between the parametric and the nonparametric approach so as to maximize the likelihood of the observations. Hence, whenever the model is sufficiently true, we expect a value ofα n relatively close to one. This is because the parametric estimator is likely to be more accurate than the nonparametric one. On the opposite, whenever the model is wrong, we expect a value ofα n close to zero. Becauseα n fθ n + (1 −α n )f n is a mixture distribution between the parametric and the nonparametric estimates, the fitness coefficient is interpreted as the proportion of data distributed under the model. For instance, if one draws a bootstrap sample from the combinationα n fθ n + (1 −α n )f n then the fitness coefficientα n is the proportion of data drawn from the fitted model fθ n . Therefore, the less the value ofα n , the less the bootstrap sample shall be "contaminated" by the nonparametric part of the combination.
To show the capability of the fitness coefficient, we compare it to p-values on a real data example. Consider the problem of testing whether a given sample comes from a normal distribution. Specifically, we have 38 samples each consisting of n = 409 financial returns of a company from the French stock market CAC40 and we wish to measure the quality of the normal model for each of these samples. On the one hand a goodness-of-fit test based on the Cramer-Von-Mises statistic [8] is carried out. On the other hand the fitness coefficient defined by (1) is computed with a Gaussian kernel K, ∆ n = 1/n, q(x) = t ν (x/100) where t ν is the density of a Student-t distribution with ν = 3 degrees of freedom, and h n given by [42] (p. 48). In Figure 1, we plotted the values of the fitness coefficient against the p-values on a logarithmic scale. We can see a clear positive dependence relationship: large values for the fitness coefficient correspond to large p-values. This suggests that, if one had used p-values to assess the fitness of the normal model, he or she could have done so with the fitness coefficient.
The quality criterion induced by the fitness coefficient is different than that of information criteria [4,7] such as the Akaike information criterion [1] or the Bayesian information criterion [40] which focus on the relative performances between models. Note that convex parametric combinations recently have been proposed in the Bayesian literature [23] to assess the fitness of a certain parametric model against another.
(iii) The fitness coefficient is useful to get robust semiparametric estimators. The fitness coefficient offers a natural semiparametric alternativeα n fθ n + (1 −α n )f n for estimating the probability density function f 0 of the observations. The idea of forming such a convex combination to get an estimator robust to misspecification while retaining a performance comparable to parametric estimators when the true density is close to the model was originally developed by Olkin and Spiegelman [33]. Their method, referred to as the OS method, consists of computingα OS n ∈ argmax wheref n is the standard kernel density estimator. The OS method and the LR method given in (1) differ in the choice of the nonparametric estimator in the combination. The OS method was noticed to be sensitive to the choice of the bandwidth [12,36]. Rather than considering the likelihood of the observations, some authors [25,36,43] investigate strategies based on the mean squared error between the combination αfθ n + (1 − α)f n and the true density f 0 , but then the solution depends on the unknown distribution and hence heavy bootstrap methods need to be employed.
To improve inference, there exist also other approaches than that of forming a convex combination between the parametric and nonparametric estimators. The locally parametric nonparametric estimation is developed for instance in [19,20,46], but is less appealing from the point of view of model quality assessment because they do not provide any "fitness coefficient".
Main contributions. By introducing the fitness coefficient, we provide a new measure for assessing the quality of a model and an alternative to the OS method to get robust semiparametric estimators. Under mild conditions, the fitness coefficientα n is shown to converge in probability to one if the model is true and zero otherwise, a property called consistency. Even if the fitness coefficient is maximizing some objective function (over α ∈ [0, 1]), classical results from M-estimation theory does not apply because, when the model is true, the limiting objective function is independent from α. The proposed approach follows from a fine comparison between the rates of convergence of fθ n andf n . We moreover provide examples of densities f 0 that satisfy our set of assumptions. Using some real data as well as extensive simulations, we observed that the LR approach is more stable than the OS approach and leads to more accurate inference. This is in agreement with our theoretical analysis, which cannot include the OS method as an example.
Outline. In Section 2, we introduce some quantities of interest and motivate the use of the LR estimatorf LR i,n to compute the fitness coefficientα n . The consistency of the fitness coefficient is stated in Section 3 where some examples are given. In Section 4, numerical experiments are designed to measure the robustness of the fitness coefficient and the performance of the corresponding density estimators. All the proofs are postponed to the Appendix.

The leave-and-repair estimator
The aim of this section is to define and motivate the use of the leave-and-repair (LR) estimator f LR n (2) introduced in the definition of the fitness coefficientα n (1). Compared with the OS method given in (3), the use of the LR estimator might seem irrelevant at first view but in fact plays an important role to ensure the good behavior of the fitness coefficient.
The kernel density estimator of f 0 at x ∈ R d is given bŷ For any h > 0, define the function f h as the convolution product between , we see thatf n (X i ) has a positive h n -dependent bias when estimating f hn (X i ), conditionally on X i . When studying the estimator decomposition, this bias term spreads to the diagonal terms of some U -statistics and gives rise, in the end, to some non-negligible terms. This phenomenon is common in semiparametric statisitcs, and has been noticed for instance in Remark 4 in [35].
To overcome the undesirable effects caused by this bias term, the leave-one-out (LOO) estimator of f hn (X i ), given bŷ has been successfully used in several cross-validation procedures aiming at selecting the bandwidth, either based on the likelihood [17,26,18] or on the mean squared error [37,44] (see [27] for a comparison). Since then, LOO estimators have been frequently used in semiparametric studies [9]. The LR estimator proposed in this paper is inspired, but different, from the LOO estimator. In view of (2), the LR estimator satisfieŝ If ∆ n = 0 the LR estimator is equal to the LOO estimator. If q = K(0) and ∆ n = 1/((n − 1)h d n ) the LR estimator is equal to (n/(n − 1))f n (X i ). In general, one can think of ∆ n as of order 1/n and q as a density, yielding that q(X 1 ) > 0 has probability 1.
The heuristic for using the LR estimatorf LR n,i instead of the LOO estimatorf LOO n,i is as follows. It is well-known that the Kullback-Leibler divergence of kernel density estimates depends crucially on the tails of the true distribution f 0 [18,39]. As shown in [18], if the tail is too heavy and the kernel K(x) vanishes too quickly as x becomes large then the Kullback-Leibler divergence associated to the kernel density estimate goes to minus infinity. This is because some of thef LOO n,i , i = 1, . . . , n, might have very small values (possibly zero), leading to very large values (possibly infinite) for some of the log(f LOO n,i ). These values are involved in the computation of the Kullback-Liebler divergence and play an important role in our proofs when dealing with the likelihood of the nonparametric estimate. We built the LR estimator to overcome this issue by simply adding ∆ n q(X i ) to the LOO estimator. We coined the term leave-and-repair because the term ∆ n q(X i ) repairs the LOO estimator. Sincef LR n,i ≥ ∆ n q(X i ), the LR estimator is not subjected to the difficulties of the LOO estimator. By adding the term ∆ n q(X i ) in (2), however, a biais is introduced: now one has E[f LR n,i |X i ] − f hn (X i ) = ∆ n q(X i ). Thus, there is a biais-variance tradeoff controlled by the sequence ∆ n that must go to zero slowly enough to keepf LR n,i away from zero but also fast enough to keep the biais as small as possible. The right compromise is given in the conditions in Theorem 2 (for instance ∆ n = 1/n is one possibility).
Concerning the parametric estimator fθ n , we follow [33] by considering the maximum likelihood estimator. Let P = {f θ : θ ∈ Θ} be the parametric model where Θ ∈ R p is such that for each θ ∈ Θ, f θ : R d → R + is a measurable function satisfying f θ (x) dx = 1. The maximum likelihood estimator of f 0 based on P and X 1 , . . . , X n is fθ n whereθ n (when it exists; this is further assumed) is defined asθ The good behaviour of the maximum likelihood estimator is subjected to the assumption that f 0 ∈ P, that is, there exists θ 0 ∈ Θ such that f 0 = f θ 0 .
To conclude the section, we consider existence and uniqueness of the fitness coefficientα n . The existence follows from the use of the LR estimatorf LR n,i . Uniqueness ofα n is obtained under the mild requirement that the parametric and nonparametric estimators are distinguishable on the observed data. Proposition 1. Suppose that q(X 1 ) > 0 a.s. and fθ n (X i ) =f LR n,i for at least one i ∈ {1, . . . , n}. Then the fitness coefficient exists and is unique.
The proof is given in Appendix A.

Consistency of the fitness coefficient
Recall that consistency of the fitness coefficientα n meansα n → 1 in probability if f 0 ∈ P and α n → 0 if f 0 / ∈ P, where f 0 is the true underlying density and P is the parametric model. Section 3.1 and Section 3.2 contain the main consistency theorem and some examples satisfying our set of assumptions, respectively.

Assumptions and main result
Let · 2 be the Euclidean norm and for any set S ⊂ R d and any function f : S → R, define the sup-norm as f S = sup x∈S |f (x)|. Denote by λ the Lebesgue measure on R d . Introduce the density level sets We shall assume the following.
(H1) The density f 0 is bounded and continuous on R d and the gradient ∇f 0 of f 0 is bounded on R d , and satisfies, for every x ∈ R d and u ∈ [−1, 1] d , where g is positive, bounded, integrable and g(x) 2 /f 0 (x) dx < ∞.
(H2) The kernel function K : R d → R + integrates to 1 and takes one of the two following forms, Whereas (H1) and (H2) are rather classical in the kernel smoothing littereature (see the remarks just below Theorem 2), the following assumption is specific to our approach. We shall see in Section 3.2 that this is satisfied for densities with classical tails.
There exist β ∈ (0, 1] and c > 0 such that S c For the sake of clarity, the (classical) assumptions dealing with the parametric model are postponed to the appendix: (A1) and (A2). They are taken from the monographs [48] and [31], and they mainly ensure the asymptotic normality ofθ n whenever f 0 ∈ P.
(ii) When Appendix B is dedicated to the proof of Theorem 2. We did not follow the approach used in [33], which, we believe, is unsatisfactory because they do not consider the case whenα n lies in the border of [0, 1]. Actually, this is not straightforwardly remedied as the eventα n = 0 or α n = 1 has a non-negligible probability (as illustrated in the numerical experiments in section 4.1). The smoothness assumption stated in (H1) and the symmetries in the kernel function ensure a control of order h 2 n of the bias f h (x) − f (x), uniformly in x ∈ R d (see Lemma B.7 stated in Appendix B.5). Such a rate could be improved by using higher order kernels but this is not necessary here. Assumption (H2), (a) and (b), are borrowed from the empirical process literature; see among others [32,15,11]. They permit to bound, uniformly in x ∈ R d , the variance termf n (x) − f h (x). The fact that the kernel has a compact support can be alleviated at the price of additional technicalities in the proof and assuming that the tails of the kernel are light enough. We did not include this analysis in the paper for reasons of clarity.
For any dimension d ≥ 1, there exists a couple of sequences (h n , ∆ n ) n≥1 that fulfills the restrictions (i), (ii) of Theorem 2 and (H2). For instance, the optimal bandwidth h n ∝ n −1/(d+4) , which minimizes the asymptotic mean integrated squared error [51, equation (2.5)], and ∆ n = 1/n, is one such sequence. This means that, in practice, one can choose the bandwidth according to the various methods of the literature, see e.g. [42].
An interesting point in Theorem 2 is the two opposite roles played by the sequence ∆ n in (i) and (ii), respectively. The consistency when f 0 ∈ P requires ∆ n to be as small as possible whereas when f 0 / ∈ P, ∆ n must not be too close to 0. In the proof, the case ∆ n = 0 (leave-oneout) as well as ∆ n q(X i ) = K(0)/(nh d ) (OS method) need to be excluded, suggesting that these other options are not consistent under our set of assumptions.

Distributions and bandwidth sequences satisfying (H3)
For densities f 0 with unbounded supports, the verification of Assumption (H3) only depends on some tail function g 0 associated to the density f 0 . The meaning of this is made precise in the following proposition.

Proposition 3. Suppose that for any
as t → 0 and if for any as n → ∞, then (H3) is valid for f 0 with the same value of β.
The proof of Proposition 3 is given in Appendix A. The function g 0 in Proposition 3, not necessarily a proper density function, represents the rate of decrease of f 0 (x) as x → ∞.
Hence the verification of (H3) by f 0 only depends on the component f 2 .
Putting g 0 ∝ f 0 (the symbol ∝ stands for proportionality) in Proposition 3 amounts to check (H3) directly, which is done in the following examples.
Therefore, a sufficient condition on h n guaranteeing (4) is that h 2 n log(n) → 0, which is satisfied under (H2).
The computations are very similar to the one presented in the Gaussian case. We find β = 1 and the condition on h n becomes h n log(n) → 0 which is always true under (H2). Hence, as for Gaussian tails, when the tails are exponential, (H3) is automatically satisfied under (H2).
For simpicity, as in the Gaussian example, we focus on The three examples considered above are informative on the interplay between the tails of f 0 and the choice of h n . For distribution with light enough tails, including Gaussian, exponential and polynomials with k ≥ 6, the conditions on h n required by (H3) are already fulfilled when assuming (H2). Consequently, the optimal bandwidth which has order n −1/5 is included by our set of assumptions. In contrast, as soon as k < 6 in the polynomial case, we have the additional condition that nh k n → 0.

Numerical illustrations
In all the simulation experiments, we have set ∆ n = 1/n, where t ν is the density of a Student-t distribution with ν = 3 degrees of freedom, µ q = 0 and σ q = 100. With such a large variance and heavy tails, this choice of q is non informative. We made µ q and σ q vary but the results were very similar, suggesting that the choice for q has little effect in practice (at least for light-tailed distribution). In all the experiments but those in Section 4.1, the bandwidth was chosen according to the well known rule of thumb given in [42] (p. 48, equation (3.31)). The choice of the bandwidth is discussed in Section 4.1. In Section 4.2, we study the behavior of the fitness coefficient and the performance of the estimators with respect to the amount of evidence of the model. In Section 4.3, we use the LR method for protection against misspecification. All the numerical experiments were carried out with the R software.

Sensitivity to the bandwidth: comparison of the fitness coefficient and the OS coefficient
In this section, we study how a change in the bandwidth affects the fitness coefficient and the OS coefficient. We reanalyze the data of Olkin and Spiegelman [33] where θ = (µ, σ), µ is a real location parameter and σ > 0 a dispersion parameter obeying var f θ = π 2 σ 2 /6 and E f θ = µ − σγ, where γ ≈ 0.58 is the Euler-Mascheroni constant. The maximum likelihood estimator is given byθ n ≈ (62.1, 5.4).
Let h denote the bandwidth. In [33], it was arbitrarily chosen h = 0.7s, where s is the standard deviation of the data. This yieldsα OS 21s then one getsα OS n ≈ 0. All the above values for h are grounded by well-known bandwidth selection methods, see the textbook [42] (p. 47, eqn (3.30) and p. 48, eqn (3.31)) and [41]. By contrast, the fitness coefficient yieldsα n ≈ 1. These findings are summarized in Figure 2 (a) , where the coefficients are represented as functions of h. We see that the OS coefficient is sensitive to the choice of the bandwidth: a slight difference in h can yield a large difference in α OS n =α OS n (h) especially in the range 0.4 ≤ h ≤ 0.8. On the opposite, the fitness coefficient is more robust: the estimated value forα n (h) remains close to one in a large range for h. In Figure 2 (a), the fitness coefficient and the OS coefficient contradict each other and no more credit can be given to any one of them because the ground truth is unknown.
To observe the behavior of the coefficients when the model is known to be true, we simulated n = 400 observations according to a Gumbel distribution with mean and standard deviation equal to those of the wind speed data, that is, 59.1 and 6.55 respectively. The results are shown in Figure 2 2 for all h chosen by the bandwidth selection methods of the literature. These results tend to indicate that the fitness coefficient is consistent but the OS coefficient is not. Let us note that Figure 2   found in the literature as above) values of h, the values of the coefficients are close to zero, as expected. This is illustrated in Figure 2 (c): the model is still Gumbel, but the n = 400 data points were generated according to a Gaussian distribution with mean 59.1 and standard deviation 6.55.

Performance of the methods when the model and the truth intertwine
Parametric estimators perform better than kernel density estimators when the model is approximately true, but worse otherwise. Can the semiparametric combination be uniformly best? Does the fitness coefficient goes to unity as the model approaches the truth?
To get some insight, the following numerical experiment was done. We generated samples of size n = 400 according to a density f t , for several values t in a certain index set, representing the "distance" between f t and the model. Two settings have been tested.

Setting 1
The parametric model is given by f θ ∼ N (θ, 1) and the curve of true distributions is given by The intersection between the model {f θ } and the family {f t } is given by f 0 ∼ N (0, 1); that is, θ = t = 0.

Setting 2
The parametric model is given by f θ ∼ N (0, θ 2 ) and the curve of true distributions is given by f t ∼ N (t, 1). The intersection between the model {f θ } the family {f t } is given by a N (0, 1) as well.
For each t, we compute the maximum likelihood estimator, the standard kernel density estimator, the fitness coefficient, the OS coefficient, and the semiparametric density estimator. The semiparametric density estimator is the combination between the maximum likelihood estimator and the kernel density estimator where the mixing coefficient can be either the fitness coefficient (LR method) or the OS coefficient (OS method). To assess the performances of the estimators, we compute the L2-distance to f t . The above procedure is repeated 500 times so that the errors are averaged over the repetitions. Figure 3 summarizes the results for the first setting. The errors for the parametric estimator, shown in Figure 3 (a), shrink sharply as the model and the truth intersect. The error for the nonparametric estimator is approximately constant. We see that the OS method performs poorly: it fails to give accurate estimates near the truth. This behavior is explained in Figure 3 (b), where we see that the values of the OS coefficient barely exceed 0.1. This is not the case for the fitness coefficient; the values stretch entirely the range [0, 1] and are consistent with the proximity between the truth and the parametric model. As a consequence, coming back to Figure 3 (a), the error of the LR method is near the minimum of the parametric and nonparametric errors. This means that, in practice, however close our parametric model is to the truth, we never lose by choosing the LR method. Even more interestingly is the fact that in the region where the parametric and the nonparametric estimators perform similarly, the LR method performs better: this corresponds to the values t ≈ −0.10 and t ≈ 0.15. This fact is clearly seen in Figure 3 (c) which pictures the averaged error integrated in the interval [−t, t]: the LR method always has the lowest curve.
The results for n = 50, 100, 200 and for setting 2 are similar and not shown here to limit the length of the paper.

Application to multivariate density estimation
It is well known that building accurate multivariate parametric models is an uncertain and difficult task. One way of addressing this problem consists of decomposing the target density f 0 into a copula c and the marginal densities f 1 , . . . , f d , that is, (here the {F j } stand for the distribution functions). This decomposition, also known as Sklar's theorem, is unique provided that the {F j } are continuous; for more details about copulas, see e.g. [13] or the books [30,22]. The copula is assumed to belong to a parametric model {c ξ , ξ ∈ Ξ} and the true underlying parameter ξ is estimated [14] bŷ where R i,j is the rank of X i,j among (X 1,j , . . . , X n,j ) and X i,j stands for the j-th coordinate of the i-th observation. The marginals are estimated in a separate step. If one of the marginals is misspecified, the estimation of the joint distribution is biased. In the following, a computer experiment illustrates that the LR method can help to reduce this bias by avoiding misspecification.
We have generated datasets of size n = 25, 50, 100, 150, . . . , 500 with a copula of the form (a so-called Gumbel copula) with ξ = 3 and marginals f 1 ∼ E(2), f 2 ∼ W (2, 1/2) where E(λ) is an exponential distribution with mean 1/λ and W (a, b) is a Weibull distribution with shape a > 0 and scale b > 0, that is, For each of the simulated datasets, the copula parameter ξ was estimated as mentioned above and the marginals were estimated under three scenarios. In the first scenario we estimate them nonparametrically with the standard kernel density estimator. In the second scenario we do as if both marginals were exponentially distributed and compute the maximum likelihood estimator.
In the third scenario we form the convex combination with the maximum likelihood estimator and the standard kernel density estimators, where the mixing coefficient is the fitness coefficient (LR method).
The results for n = 200 and marginal estimation are shown in Figure 4.  first marginal, that is, when the parametric model is well specified. We see only two lines because the parametric, semiparametric and the true densities are very similar, indicating thatα ≈ 1. Figure 4 (b) corresponds to the misspecified second marginal. Here this is the nonparametric and the semiparametric estimates which are nearly identical, indicating thatα ≈ 0. Figure 5 shows the estimation for the bivariate joint density. In Figure 5 (b) we see that one marginal misspecification led to a poor estimation of the joint density, especially in the joint tails. Figure 5 (c) shows the estimated joint density with the nonparametric strategy for the marginals. Drawbacks of nonparametric estimation are easily spotted: the estimated density is multimodal and assumes positive values where it should be null. Visually, the best performance is achieved with the semiparametric strategy in Figure 5 (d). The figures for n = 50, 100, 500 are similar and not shown to limit the length of the paper.
The squared L2-distances between the true joint density and the estimators are shown in Figure 6. The semiparametric strategy performs best for all sample sizes.

Appendix A Proofs of the propositions
We define the mixture likelihood function L n : [0, 1] → [ − ∞, +∞) as The fitness coeficientα n in (1) is then defined as a maximizer of L n (α) over [0, 1].

Appendix B Proof of Theorem 2
Theorem 2 follows from the application of two high-level results, corresponding respectively to the well-specified and misspecified case. Both high-level results take place in the following general framework: given a triangular sequence of non-negative real numbers ξ n,i , i = 1, . . . , n, n ≥ 1, we consider the mixture likelihood function given by Here the sequence (ξ n,i ) is left unspecified in order to highlight the assumptions that we need on the nonparametric part. This random sequence could be the non-parametric estimator evaluated at X i , i.e.,f n (X i ), the LOO estimatef LOO n,i or the LR estimatef LR n,i with ∆ n > 0. In this slightly new context, we defineα n asα n ∈ argmax α∈[0,1] L n (α).
In both cases, respectively, the misspecified and well-specified case, the approach taken is similar. We compare the empirical likelihood of the mixture to the one of the parametric estimate (in the well-specified case) or the nonparametric estimate (in the misspecified case).
In the proofs below, it is convenient to introduce the normalized version of L n (α), given bỹ where, for any real valued function f , we have introduced the short-cut notation f i for f (X i ).

B.1 Case (i) : the model is well-specified
We are based on some restricted mean quadratic error and some averaged linear error The proof of the following theorem is given in Section B.3.1.
Theorem B.1. Suppose that f 0 ∈ P and let S ⊂ R d and b > 0 be such that for all x ∈ S, f 0 (x) > b. If the following convergences hold in probability, as n → ∞, then,α n → 1 as n → ∞, in probability.
We now verify the conditions of the previous theorem when ξ n,i is the LR sequencef LR n,i and when (H1), (H2), (H3), (A1), (A2) and nh d n ∆ n → 0 are fulfilled.
Condition (6). The first convergence in (6) holds in virtue of (21) established in Section C. For the second one, it holds that Applying the first statement of Proposition C.1 in [35] (which is a consequence of Theorem 2.1 in [15]), we have, under (H1) and (H2), that Together with Lemma B.9, we obtain that The latter bound indeed goes to 0, in probability, as n → ∞.
Condition (7). We proceed as follows, with ξ n,i =f LR n,i : All this together implies that (7) holds true.
We already argued that (20) is implied by (A1). The continuity of f θ is deduced from the continuity of log(f θ ) provided by (A1). The bound given in (8) together with | log(∆ n )| 1/β ( | log(h n )|/nh d n + h 2 n ) → 0 implies the stated convergence with ξ n,i =f LR n,i .
A useful notation in the following iŝ A useful technical detail is there exists a sequence n → 0 such that the event has probability going to 1 as n → ∞. This is a consequence of (6). As we are establishing a result in probability, we can further suppose that this event is realized.
A key step in our approach is the following inequality, reminiscent of the Taylor development of the logarithm around 1, which might be derived by studying the concerned function. This kind of inequality is commonly used for studying likelyhood methods [16,28]. Applied tox i,n , it gives Note that whenever X i ∈ S, because it holds f 0,i |x i,n − 1| ≤ n , we have (for n small enough) that |x i,n − 1| < 1/2. This means that, for all i = 1, . . . , n, 1 {X i ∈S} ≤ 1 {|x i,n −1|<1/2} , and it follows Bounding the right-hand side with respect to α ∈ [0, 1 − ] gives sup .
By assumption, we have that Q .
The term between brackets goes to 1, in probability, implying that for every δ > 0, with probability going to 1, Hence it remains to note that, by (7), with probability going to 1, Q (np) n (S) > 0.

B.3.2 Proof of Theorem B.2
Note thatα n ∈ argmax α∈[0,1]Ln (α) exists because ξ n,i > 0 for all i, as explained in the proof of Proposition 1. Let > 0. The proof requires to show that with probability going to 1,α n < . This event is realized as soon as max α∈[ ,1]Ln (α) <L n (0). We analyse both terms separately. First we show thatL in probability, and then that there exists δ > 0 such that, with probability going to 1, Let η > 0, b n = (η/| log(∆ n )|) 1/β and c n = max i=1,...,n |ξ n,i − f 0,i |. We assume further that b n + c n < 1 and ∆ n < 1. We have The expectation of the term in the middle is bounded by | log(∆ n )|P(f 0 (X 1 ) ≤ b n ) of order | log(∆ n )|b β n = η, by assumption. The corresponding term goes to 0, as η is arbitrarily small. The expectation of the term in the right is smaller than E[(| log(q 1 )| + | log(f 0,1 )|)1 {f 0,i ≤bn} ], which goes to 0 because | log(q 1 )| and | log(f 0,1 )| are integrable. Hence it remains to obtain that the term in the left goes to 0. The mean-value theorem gives Now we establish (9) by obtaining one-sided inequalities. Take b n = (1/| log(∆ n )|) 1/β , suppose that b n + c n < 1, and use the monotonicity of the logarithm, to get that Taking the expectation, we find a bound in E[| log ] which goes to 0 as n → ∞ in virtue of the Lebesgue dominated convergence theorem. Then, taking 0 < η < 1, it holds thatL The first term in the right-hand side is decomposed according to By the mean value theorem, the term on the left is bounded by which goes to 0, by assumption. For the term on the right, notice that {α(f θ + ηf 0 ) + (1 − α)f 0 : α ∈ [ , 1], θ ∈ Θ} is Glivenko-Cantelli with envelop F Θ + 2f 0 . Then applying Theorem 3 in [47], the class formed by log(α(f θ + ηf 0 ) + (1 − α)f 0 ) is still Glivenko-Cantelli. Since for all θ ∈ Θ, the function | log( ηf 0 )| + log(F Θ + 2f 0 + 1) is an integrable envelop. Using again Theorem 3 in [47], the class formed by log(α(f θ + ηf 0 ) The integrability of the envelop and the fact that b n → 0 implies that It remains to use the inequality log( we get, using that Using standard results about the Hellinger distance [34] (chapter 3) we obtain Since f 0 / ∈ P and by the continuity assumption on f θ , it holds that inf θ∈Θ |f θ − f 0 |dλ > 0. Then, as η is arbitrary, the proof of (9) is complete.

B.4 Linear and quadratic error of parametric and nonparametric estimate
Important tools for dealing with the terms involvingf LR n,i are coming from U -statistic theory. We call U -statistic of order p with kernel w : R p → R, any quantity of the kind where the summation is taken over the subset D formed by the (i 1 , . . . , i p ) ∈ {1, . . . , n} p such that i k = i , ∀k = . The number of terms in the summation is then n(n−1) . . . (n−p+1). When the kernel w is such that, for every k ∈ {1, . . . , p}, E[w(X 1 , . . . , X p ) | X 1 , . . . X k−1 , X k+1 , . . . , X p ] = 0, it is called a degenerate U -statistic. In the proofs, we shall rely on the so-called Hajek decomposition [48,Lemma 11.11].
To establish the two following lemmas, Lemma B.3 and Lemma B.4, we are based on (H1), (H2) and (H3). One might note that the expressions (a) or (b) in (H2) on the kernel are not used in any of these lemmas. where Proof.
Note that The proof follows from the decomposition We will show that h d n A n → v K λ(S), in probability and that h d n C n → 0, in probability. This will be enough as B n ≥ 0, almost surely.
Proof that h d n A n → v K λ(S) in probability. Introduce the notation, for any h > 0, Developing, we find where u hn (i, j, k) is as short-cut for u hn (X i , X j , X k ). We treat A n relying on the Hajek projection of U -statistics. Up to a centering term, E [u hn (i, j, k) | X j , X k ], the U -statistic A n is a degenerate U -statistic. In the following we voluntary introduce this centering term in the summation to handle separately a degenerate U-statistic and another summation with less indices. By introducing, for any Treatment of the first term in (10). Note that w hn (i, j, k) defines a degenerate U -statistic, i.e., where w h is the symmetrized version of w h , i.e., for any triplet (x 1 , (3) ) where the sum is over all the 3! possible permutations of the set {1, 2, 3}. Using that the U -statistic with kernel w hn is degenerate, some algebra gives that We have, using Minkowski's inequality and the definition of the conditional expectation, that Consequently, in virtue of (15) in Lemma B.7, we have shown that The previous rate, multiplied by h 2d n , goes to 0, hence, this term is negligible. Treatment of the second term in (10). We continue the study of A n by considering The first term is a degenerate U -statistic of order 2 whose order 2 moments satisfy This is obtained by following exactly the same lines as in the treatment of the U -statistic w n and using (18) in Lemma B.7. As n −2 h −3d n × h 2d n → 0, the previous term is negligible. The second term is a sum of centred independent random variables with variance smaller than, in virtue of (14) in Lemma B.7, This is the same rate as the rate obtained for the (negligible) U -statistic of order 3 with kernel w n .
Treatment of the third term in (10). The study of A n continues by considering The term associated with double summation over j and k is a degenerate U -statistic, as Consequently, following the same lines as in the treatment of the first term of A n , and using (19) in Lemma B.7, we get which goes to 0, when multiplied by h 2d n . The remaining term is a sum of independent and identically distributed random variables. We have, by computing the variance of the centred average, where the first term, using (17) in Lemma B.7, is O(n −1/2 h −d n ) which goes to 0 when multiplied by h n . The dominating term is in fact the last one, as by Lemma B.8, it holds that Proof that h d n C n → 0 in probability. We are based on similar decompositions as for A n involving U -statistics. Let hn (x) = (f hn (x) − f 0 (x) + ∆ n q(x))/f 0 (x) and note that in virtue of Lemma B.9, it holds Then The term on the left is a degenerate U -statistic for which it holds Using (13) in Lemma B.7 and the previous bound for hn S , we find hn,1 where V h is defined in (12). The previous bound multiplied by h 2d n goes to 0. Using that E[a hn (1, 2) hn,1 1 {X 1 ∈S} ] = 0 and (16), the variance of the term on the right in C n is smaller than which, multiplied by h 2d n , goes to 0 by hypothesis. Hence h d n C n → 0, in probability and the proof is complete.
Proof. The decomposition is as follows The expectation of the last term is nh d n ∆ n q(x) dx which goes to 0 by assumption. We can now focus on the first and second term of the decomposition. Treatment of the second term in (11). Using that K(u) du = 1, the considered term is a centred empirical sum. Using Lemma B.9, its variance is then bounded by which goes to 0.
Treatment of the first term in (11). Using that K(u) du = 1, one can verify that it is a degenerate U -statistic. Here the variance can not be computed directly because the leading term E is not necessarily finite. Hence we decompose according to the X i in S bn and the others, with b n = ( /nh d n ) 1/β where β is given in (H3) and > 0. We introduce and define the linear operator Q P : L 2 (P) → L 2 (P) as Because E[k(X 1 , y)] = E[k(X 1 , X 2 )] = 1 for all y ∈ R d , one sees that Because the summation over Q P (k1 S bn ) is a degenerate U -statistics, we get that .
where the first term is a O P (1).

B.5 Auxiliary results
Recall some definitions, for any h > 0, u h (x, y, z) = a h (x, y)a h (x, z)1 {x∈S} , as well as the short-cut g(i, j, k) for g(X i , X j , X k ).
Lemma B.7. Under (H1) and (H2), if S ⊂ R d is such that for all x ∈ S, f 0 (x) > b > 0, we have, for any h > 0, where the constants C k , k = 0, . . . , 7 depends on K and f 0 only.
Proof. Remark that because K is bounded and |K(u)| du < ∞, we have |K(u)| k du < ∞, for any k ≥ 1. Note that, for every h > 0, We obtain (14) by writing To establish (15), note that For (16), write Inequality (17) follows from the lines To show (18) Whenever f 0 ∈ P, it holds thatθ n → θ 0 , in probability [31,Theorem 2.1] or [48,Lemma 5.35]. For now, asking the above Lipschitz condition to guarantee the Glivenko-Cantelli might seem a bit restrictive [31,Lemma 2.4], but this condition will also be required to derive asymptotic normality ofθ n as well as to obtain uniform convergence (over x ∈ R d ) of fθ n (x) to f θ 0 (x). Indeed, we have that for any δ > 0, with probability going to 1,θ n ∈ B(θ 0 , δ). Hence, using the mean-value theorem, we find for every x ∈ R d . Conclude using that˙ × sup θ∈B(θ 0 ,δ) f θ is bounded and the convergence in probability ofθ n to θ 0 .