Consistency of the k-Nearest Neighbor Classifier for Spatially Dependent Data

The purpose of this paper is to investigate the k-nearest neighbor classification rule for spatially dependent data. Some spatial mixing conditions are considered, and under such spatial structures, the well known k-nearest neighbor rule is suggested to classify spatial data. We established consistency and strong consistency of the classifier under mild assumptions. Our main results extend the consistency result in the i.i.d. case to the spatial case.

issues in spatial analysis is classification and pattern recognition. For example, in remote sensing technology or digital geography information, we need somehow to classify spatial data into patterns or images into types. Recently, Ref. [22] propose a novel probabilistic model for classification, that incorporates a network's structure into the classical logistic regression model. This model is mostly used to classify data produced by social network analysis taking into account the connection between nodes, but without any influence of the spatial coordinates. References [17][18][19][20][21] deal with kernel-based rules to classify temporally and spatially dependent data, and study asymptotic properties of classifiers. The aim of the present paper is to investigate whether the classical k-nearest neighbor classifier can be extended to classify spatial data. To the best of our knowledge, this work is the first one dealing with spatial data. The k-nearest neighbor method for estimating density and regression or data classification has been widely used and studied for many years in the i.i.d. case. Key references on this topic are: Refs. [2][3][4][5][6]. The use of the k-nearest neighbor method in the spatial case is due to [15] for density estimation. The real interest in the knearest neighbor method comes from the nature of the smoothing parameter. Indeed, in the traditional kernel method, the smoothing parameter is the bandwidth, which is a real positive number. Here, the number of neighbors k is the smoothing parameter and it takes its values in a discrete set. As we said previously, the other very important aspect of this method is that it allows the construction of a neighborhood adapted to the local structure of the data. The main difficulties with the kernel method appear when data are sparse; choosing the number of neighbors allows to avoid this problem and is adapted to the concentration of the data. Consistency of kernel-based rules on temporally or spatially dependent data has recently been investigated by [17][18][19][20][21] in finite and infinite-dimensional space. In this paper, we will establish the (strong) consistency of the k-nearest neighbor classifier for spatially dependent data. Let {(X i , Y i )} i∈Z N be a random field defined on some probability space ( , F, P) and taking values in R d × {0, 1}. In the problem of classification, for each i ∈ Z N , X i is a vector of features and Y i is the label (class) of X i . A point i = (i 1 , . . . , i N ) ∈ Z N will be referred to as a site. For n = (n 1 , . . . , n N ) ∈ (N * ) N , we define the rectangular region I n by I n = {i ∈ Z N : 1 ≤ i l ≤ n l , ∀l = 1, . . . , N }. We will write n → ∞ if min 1≤l≤N n l → ∞. Definen = n 1 × · · · × n N = card (I n ) . We wish to predict the label Y j of a new observation X j . The pair (X j , Y j ) may be described by μ, the probability measure for X j , and η(x) = E Y j /X j = x , the regression of Y j on X j = x. Assume that for each i ∈ Z N , (X i , Y i ) has the same distribution as the pair (X , Y ). We create a classifier g : R d −→ {0, 1} mapping X j into the predicted label of X j . The error rate, or risk, of a rule g is L(g) = P{g(X j ) = Y j }. This is minimized by the rule g * (x) = 0 if P{Y j = 0|X j = x} ≥ P{Y j = 1|X j = x} 1 otherwise, whose error rate L * = L(g * ) is called the Bayes risk and g * is called the Bayes rule. This optimal rule depends on the distribution of (X j , Y j ) which is generally unknown. we use the data D n = {(X i , Y i ) : i ∈ I n } to construct a classifier g n (x). The set D n is called training sample. The spatial version of the classical k-nearest neighbor rule given by Observe that the distance between two observations in R d or two sites in Z N will be computed by the Euclidean distance. We assume that μ is absolutely continuous with respect to the Lebesgue measure λ on R d , in other words, X has a density f with respect to λ, so that we can avoid messy technicalities necessary to handle distance ties. If we let η n (x) = i∈I n w ni Y i be the k-nearest neighbor estimator of η(x), (1.1) can be re-written as follows The best we can expect from g n (x) is to achieve the Bayes risk. Denote L n = L(g n ) the error rate of g n . The classifier g n (x) is called consistent if EL n −→ L * as n → ∞ and it is strongly consistent if L n −→ L * as n → ∞ with probability one. In this paper, we investigate both the consistency and strong consistency of g n under classical conditions.

Mixing Conditions
Let us first recall the definitions of mixing coefficients α introduced by [14] and β introduced by [13]. Let A and C be two sub -σ -algebras of F. The α-mixing coefficient between A and C is defined by For the proof of Lemma 3.2 (Berbee's lemma), we refer the reader to [1]. Denote S x,r the closed ball centered at x with radius r > 0.
with γ d is the minimal number of cones centered at the origin of angle π/6 that cover R d .
We refer the reader to [6] for the proof of Lemma 3.3. The number γ d defined in Lemma 3.3 exists according to [6,Lemma 5.5]. Now, we state the main results of this paper. In the following theorem, we investigate consistency of the k-nearest neighbor rule.
Theorem 3.4 Suppose that D n are observations of α-mixing random field such that Suppose in addition that (5.13) is satisfied and that as Then, as n → ∞, Theorem 3.4 extends Stone's consistency theorem (see [11]) to the spatial case when the probability measure μ is absolutely continuous under a slight modification of Stone's condition on the smoothing parameter k. Condition (3.1) is weaker than that used by [3,Theorem II.3] in the i.i.d. case (see also [4,Theorem 1]). In the following theorem, we investigate strong consistency of the k-nearest neighbor rule. and Then, as n → ∞, L n −→ L * with probability one.
Theorem 3.5 extends the strong consistency of [6,Theorem 11.1] to the spatial case under some mild additional condition on the smoothing parameter k. Observe that if , so that α(t) and β(t) tend to zero as t → ∞ with polynomial rate. In addition, if we take for example p =n 1/2N , (5) and (6) are satisfied for some θ > 4N .

Numerical Results
In this section, some numerical results are proposed toward some simulations. We consider a two-dimensional space (N = 2) with the random field simulated on a rectangular region of n 1 × n 2 sites. Without loss of generality, we take n 1 = n 2 = n. We focus on the case where ( where X 1,(i, j) are dependent normal variables with mean 0, variance 0.5 and covariance function c(u) = 0.5 exp(− u ) for all u ∈ R 2 with u = 0, and X 2,(i, j) are independent normal variables with mean 0 and variance 0.5. We let Y (i, j) = 1 if sin(X 1,(i, j) − X 2,(i, j) ) > sin(X 1,(i, j) +X 2,(i, j) ) and Y (i, j) = 0 otherwise. The R statistical programming environment is used to run simulations. First of all, we give a typical example by using the above scenario for n = 25 and we get the following figure. Figure 1 shows the labels of 625 feature vectors X (i, j) with their labeled sites on the region I (20,20) . Now, for each n ∈ {20, 30, 40, 50}, we simulate a sample of size n 2 on the rectangular region I (n,n) . Then, each sample is splitted into two sets. The first set contains n 2 − 100 elements of the sample for training and the other contains 100 elements of the samples for testing. Figure 2 displays the labeled samples for n = 20, 30, 40, 50. We apply the cross-validation criterion (CV) to the training samples to choose values of the smoothing parameter k by altering k with various values and choose that corresponding to the lowest C V (k) given by where g −i (n,n) (X i ) indicates the k-nearest neighbor rule based on leaving out the pair (X i , Y i ) and the summation is taken over all sites of a training sample. It is desirable for  k to be odd to make ties less likely. Then, for each n, we estimate the misclassification error rate (ER) using the associated test sample, i.e, where the summation is taken over all sites of a test sample and 1I A denotes the indicator of A. Table 1 includes the optimal chosen values of k together with the corresponding estimated misclassification error rates for one replication of each n.
To check the robustness of the proposed classifier, the above simulation is replicated 100 times, and the average error rate (AER) is obtained by averaging the error rates associated with the corresponding 100 test samples of each value of n. We keep the chosen values of k listed in Table 1 for each replication. Finally, we get the following table of average misclassification error rates. Table 2 displays the average error rates corresponding to n ∈ {20, 30, 40, 50}. It shows that the AER decreases when the size of training sample increases which make the results of this simulation study in line with the theoretical results.

Proofs
Define ρ n = ρ n (x) as the solution of the equation Note that the solution always exists since X has a density by assumption. Also define Proof of Theorem 1 By Theorem 2.2 in [6], we have Hence, it suffices to prove that Clearly, by (5.1), condition (1.2) implies that ρ n → 0 as n → ∞. By Lebesgue's density theorem together with (5.1), we have as n → ∞, for all x mod μ (μ-almost for all x ∈ R d ). Since |Y | ≤ 1, the dominated convergence theorem implies that as n → ∞, Therefore, by (5.4)-(5.5), it suffices to prove that as n → ∞, We have the following inequality Thus, we prove that the two terms in the right-hand side of (5.7) tend to zero as n → ∞.
For the first term, by Cauchy-Schwartz inequality, we get On the one hand, we have by (5.1) On the other hand, by Lemma 3.1, we have for some generic constant C > 0. Therefore, By (3.1) and (5.8)-(5.10) together with the dominated convergence theorem, we get It remains to prove that the second term in the right-hand side of (5.7) tends to zero as as n → ∞. To do that, let X (k) (x) be the k-nearest neighbor of x and denote Hence, we prove that as n → ∞, Observe thatη n (x) = η n (x) if we let Y i = 1 for all i ∈ I n . Consequently, the proof of (5.13) is the same as that of (5.11). Finally, combining (5.4)-(5.7) and (5.11)-(5.13), we get (5.3) and the proof is completed. Proof of Theorem 2 By (5.2), the proof is established if we prove that as n → ∞, (5.14) By (5.4)-(5.5), it suffices to prove that the proof of (5.15) is established if we prove that We first prove (5.16). To this aim, we use the blocks decomposition introduced by [7] (see also [16]) which will be useful afterwards. Without loss of generality, suppose for each l = 1, . . . , N , n l = 2 pq l where p = p(n) and q = q l (n) are strictly positive integers with p(n) ∈ [1, min 1≤l≤N n l /2] such that (3.2) and (3.3). Let We have card(J q ) = N l=1 q l := r . We define blocks as follow, for each j ∈ J q , We have One can easily prove that for all j ∈ J q , card S (i) j = p N and for all j = j , , for each i = 1, 2 N and j ∈ J q , and let ψ : {1, . . . , r } → J q be a bijection. We can define a lexicographicorder relation ≤ lex on J q as follows: ψ(m) ≤ lex ψ(m ) if m ≤ m . For any j ∈ J q , we can find m ∈ {1, . . . , r } with ψ(m) = j. Now, we use Lemma 3.2 together with a decomposition in blocks similar to that introduced by [7] (see also [16]) on the family of vectors W (i) ψ(m) , m = 1, . . . , r to generate independent copies W (i) ψ(m) , m = 1, . . . , r such that: they are mutually independent, for all m ∈ {1, . . . , r },W Then, for any > 0, we have We first find an upper bound for A n . We have by Markov's inequality Hence, it suffices to find an upper bound for example for To do that, we re-consider blocks decomposition and the lexicographic relation defined above. Denote for each m = 1, . . . , r , With the same method that was used to prove (5.11), we can easily prove that EF(W 1 , . . . ,W r ) → 0.
we let Y i = 1 for all i ∈ I n with η n (x) = 1 k i∈I n Consequently, if we proceed similarly to (5.12), we can easily show that the proof of (5.17) is the same as that of (5.16) and the proof is completed.