ProTraS: A probabilistic traversing sampling algorithm
Résumé
In the process of knowledge discovery in big data, sampling is a technological brick that can be included in a more general framework to speed up existing algorithms and contribute to the scalability issue. Two challenging and connected problems arise with complexity: tuning and timing. ProTraS1 is a new algorithm that fulfills both requirements. It is driven by a unique parameter, the sampling cost. The cost is overestimated by the maximum within group distance and the group cardinality. It is an iterative algorithm, at each step a new representative is added, chosen as the farthest-first traversal item from the representative in the group with the highest probability of cost reduction. The novel algorithm is robust to noise and time optimized. A detailed comparison with alternative algorithms, conducted on various synthetic and real world data sets, shows that the proposal yields competitive results in terms of quality of representation for clustering, sampling size and sampling time.
Origine | Fichiers produits par l'(les) auteur(s) |
---|
Loading...