Avoiding Boosting Overfitting by Removing "Confusing Samples"

Introduction

Boosting methods are known to exhibit noticeable overfitting on some real world datasets, while being immune to overfitting on other ones. In this project we consider the case of overlapping class distributions and show that standard boosting algorithms based on minimization of average loss are not appropriate in this case and this inadequateness is the major source of boosting overfitting on real world data. In order to verify our conclusion we exploit the fact that overlapping classes’ task can be reduced to a deterministic task with same Bayesian separating surface by removing “confusing samples” – samples that would be misclassified by “perfect” Bayesian classifier. We propose an algorithm for removing confusing samples and experimentally study behavior of boosting trained on pruned data. Experiments confirm that removing confusing samples helps boosting achieve lower generalization error and avoid overfitting on both synthetic and real world data. Process of removing confusing samples also provides test error prediction from training set, which is experimentally proved to be quite accurate.
Experiments

To study the behavior of Boosting incorporated with removing of confusing samples we have conducted a set of experiments on both synthetic and real world data. We compare the performance of boosted stumps trained on full training set with boosted stumps trained on pruned dataset. We use Boosting algorithm described by Schapire, R., & Singer, Y. (1999). Stumps were chosen as base learners to avoid possible issues with base learner complexity pointed out by Reyzin, L. & Schapire, R. (2006). We used both synthetic and real world data (9 UCI datasets) in our experiments.

Publications

Vezhnevets A., Barinova O., 2007. Avoiding boosting overfitting by removing "confusing" samples. Accepted for The 18th European Conference on Machine Learning (ECML) 2007. (presentation)