Theoretical analysis of cross-validation for estimating the risk of the k-nearest neighbor classifier

Alain Celisse; Tristan Mary-Huard

Article Dans Une Revue Journal of Machine Learning Research Année : 2018

Theoretical analysis of cross-validation for estimating the risk of the k-nearest neighbor classifier

(1) , (2, 3)

1
2
3

Alain Celisse

Fonction : Auteur
PersonId : 748170
IdHAL : alain-celisse

Université de Lille

Tristan Mary-Huard

Fonction : Auteur
PersonId : 748716
IdHAL : tristanmary-huard
ORCID : 0000-0002-3839-9067
IdRef : 22754093X

Mathématiques et Informatique Appliquées

Génétique Quantitative et Evolution - Le Moulon (Génétique Végétale)

Résumé

The present work aims at deriving theoretical guaranties on the behavior of some cross-validation procedures applied to the k-nearest neighbors (kNN) rule in the context of binary classification. Here we focus on the leave-p-out cross-validation (LpO) used to assess the performance of the kNN classifier. Remarkably this LpO estimator can be efficiently computed in this context using closed-form formulas derived by Celisse and Mary-Huard (2011). We describe a general strategy to derive moment and exponential concentration inequalities for the LpO estimator applied to the kNN classifier. Such results are obtained first by exploiting the connection between the LpO estimator and U-statistics, and second by making an intensive use of the generalized Efron-Stein inequality applied to the L1O estimator. One other important contribution is made by deriving new quantifications of the discrepancy between the LpO estimator and the classification error/risk of the kNN classifier. The optimality of these bounds is discussed by means of several lower bounds as well as simulation experiments.

Mots clés

Classification Cross-validation Risk estimation

Domaines

Mathématiques [math] Informatique [cs] Sciences du Vivant [q-bio] Biologie végétale

Fichier principal

2018_Celisse_Journal of Machine Learning Researchpdf_1 (573.65 Ko)

Origine	Fichiers éditeurs autorisés sur une archive ouverte

Migration ProdInra : Connectez-vous pour contacter le contributeur

https://hal.inrae.fr/hal-02621332

Soumis le : mardi 26 mai 2020-01:51:56

Dernière modification le : jeudi 14 mars 2024-03:13:25

Dates et versions

hal-02621332 , version 1 (26-05-2020)

Licence

Paternité

Identifiants

HAL Id : hal-02621332 , version 1
PRODINRA : 472085
WOS : 000452043700001

Citer

Alain Celisse, Tristan Mary-Huard. Theoretical analysis of cross-validation for estimating the risk of the k-nearest neighbor classifier. Journal of Machine Learning Research, 2018, 19, pp.1-54. ⟨hal-02621332⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

AGROPARISTECH CNRS INRA MIA-PARIS GQE UNIV-PARIS-SACLAY UNIV-LILLE INRAE GS-COMPUTER-SCIENCE GS-BIOSPHERA MATHNUM BIOLOGIE_ET_AMELIORATION_DES_PLANTES

116 Consultations

36 Téléchargements

Theoretical analysis of cross-validation for estimating the risk of the k-nearest neighbor classifier

Résumé

Mots clés

Domaines

Dates et versions

Licence

Identifiants

Citer

Exporter

Collections

Partager