Äîêóìåíò âçÿò èç êýøà ïîèñêîâîé ìàøèíû. Àäðåñ îðèãèíàëüíîãî äîêóìåíòà : http://graphics.cs.msu.ru/sites/default/files/download/avoiding_boosting_overfitting_by_removing_confusing_samples.pdf
Äàòà èçìåíåíèÿ: Tue Sep 15 15:27:21 2009
Äàòà èíäåêñèðîâàíèÿ: Mon Oct 1 23:35:57 2012
Êîäèðîâêà:
Avoiding Boosting Overfitting by Removing "Confusing Samples"
Alexander Vezhnevets Olga Barinova Moscow State University Dept. of Computational Mathematics and Cybernetics Graphics and Media Lab


Motivation: Boosting overfitting "mystery"
Boosting is known to sometimes reduce test error even after training error is zero
Basically no overfitting

Or to overfit heavily
Reported to so be worse then weak learner ­ a fte r s i n g l e i te r metimes a s i ngl e overfitting a ti o n

*No noise was added to the data!


Problem posing and definitions
Let T = ( xi , yi ) , i = 1, ..., n be the realizations of random variables distributed according to an unknown distribution P ( x, y ) sampled i.i.d.
Note, that we consider the general case of overlapping classes, which means that for some instances the probability of both labels is positive: x : p ( x, +1) > 0, p ( x, -1) > 0

No noise added (malicious noise, label noise, e.t.c.) Goals:
Find source of overfitting on real world data Explain why sometimes no overfitting occurs Find a way to counter overfitting


Related work
Regularization ­ penalize behavior
MadaBoost ­ bounding of training samples weights [Domingo et al.] Hypotheses weights shrinkage [Friedman] Soft margins ­ avoid fitting samples with large negative margins [Ratsch] ...

Loss function bounding
Sigmoid cost function [Mason et al.] Robustifying loss function by weight decay for observations [Rosset] ...

Noise tolerance ­ explicit model of noise
EM algorithm [Krause & Singer ] ...


Average empirical loss minimization
Given a loss function (convex): C : {-1, +1}â » » Minimize:
1 n


i =1

n

C ( yi , F ( xi )) min
F

In Boosting: C ( y, F ( x) ) = exp ( - y F ( x)
t ft ( x) t =1 1 t , ft = arg min , f n
F ( x) =

)



T


i =1

n

exp - yi


k =1

t -1

k fk + f






"True" empirical risk
Since there is no separability assumption:
E ( C ( y , F ( x ) ) | x ) = C (1, F ( x ) ) P (1 | x ) + C ( -1, F ( x ) ) P ( -1 | x )

In case of finite training set:
The score of a fixed instance x in the average loss:
Score( x) = n
x , +1 x

n

C ( +1, F ( x ) ) +

n

x , -1 x

n

C ( -1, F ( x)

)

Converges to conditional expectation of risk Often in practice nx = 1


Definitions
Confusing samples:

{(

xi , yi ) T : P ( - yi | xi ) > 0.5 > P ( yi | xi )}

Regular samples:

{(

xi , yi ) T : P( - yi | xi ) < 0.5 < P( yi | xi )}

Portion of confusing samples in training set is governed by Bayesian rate Minimizing risk at confusing sample leads to overfitting!


Removing confusing samples
Removing confusing samples reduces the task to deterministic with the same Bayesian separating surface In deterministic case (no confusing samples) average loss minimization is adequate


Algorithm for removing confusing samples
Relation to robust estimation
Confusing samples ­ outliers of "perfect" Bayesian model

RANSAC (RANdom SAmpling Consensus) idea
Build many models from portions of data Choose model that fits data best Prune outliers ­ samples that don't fit to the best model

Not directly applicable for learning


Proposed algorithm
I t e ra t e :
Divide data into three parts Use one for learning a model Use second for calibrating probabilistic output (Platt scaling) Estimate conditional probabilities for the third

Remove samples for which mean predicted probability o f i t' s l a b e l i s < ½


Proposed algorithm
Input: A training set T = {( xi , yi )}i =1 ; number of epochs K;
1.
1. 2. 3. 4.
3 for k=1 to K Tki = T Divide a training set into three subsets i =1 k 1 Train boosted weak learners on Tk and obtain classifier F ( x ) 2 Estimate calibration parameters A and B for posterior output on Tk using Platt scaling (or logistic transform) k 3 Estimate posterior probability of labels for samples in Tk , p ( y | xi ) end for Calculate the average of posterior probability:
n

2. 3.

p ( y | xi ) = mean p ( y | xi ) 3
k : xi Tk

(

k

)
i

1. 2.

Construct reduced training set T ' = xi , yi : arg max { p ( y | xi )} = y y return T '

{

}


Synthetic data experiments Two Gaussians


Synthetic data experiments Boosted stumps decision surface


Synthetic data experiments Pruned dataset


Synthetic data experiments Training on pruned dataset


Real-world data experiments
9 datasets from UCI repository 50-2 cross-validation
Gives smooth results Allows to approximately measure the quality of pruning

Stumps as weak learners
To avoid problems with complexity [Reyzin and Schapire]

Algorithms
Real AdaBoost Real AdaBoost on pruned data MadaBoost (regularized method)


UCI Datasets experiments
100 iterations 1000 iterations Full Dataset Estimated Reduced Mada Full Reduced Mada Regular samples precision Conf. samples precision

Breast

3.73

4.6±0.08

3.65±0.09

4.01±0.09

4.77±0.1

3.67±0.09

4.15±0.22

98.83

72.91

Australian

13.06

15.2±0.17

13.8±0.17

14.93±0.23

17.68±0.17

13.88±0.16

16.23±0.64

96.78

74.43

German

24.66

25.72±0.16

25.35±0.14

23.4±0.38

28.4±0.18

25.05±0.16

24.9±0.03

91.22

71.54

Haberman

25.96

29.67±0.29

26.33±0.31

27.78±0.31

34.50±0.36

26.37±0.32

34.97±0.1

92.14

77.14

Heart

18.34

21.41±0.36

18.36±0.30

16.66±0.58

23.13±0.36

18.07±0.30

20.37±0.35

92.60

68.26

Pima

24.03

25.58±0.20

23.99±0.16

25.26±0.17

28.07±0.20

24.10±0.17

28.26±0.12

93.38

79.29

Spam

5.79

6.19±0.04

6.02±0.04

5.59±0.03

6.35±0.04

5.97±0.04

6.26±0.03

98.83

78.01

Tic-tac-toe

6.49

8.47±0.20

13.59±0.29

12.84±0.03

2.04±0.05

2.12±0.08

1.67±0.07

97.83

35.70

Vote

4.51

4.75±0.13

4.61±0.1

5.51±0.43

5.90±0.14

4.63±0.10

7.35±0.29

99.51

88.84


UCI Datasets experiments

"Classical" overfitting ­ error is increasing after complexity optimum is passed

Error is not increasing with iterations, but removing confusing samples still helps a lot


UCI Datasets experiments


UCI Datasets experiments


Margin theory

d P ( y f ( x) < 0) PT (margin ( x, y ) ) + n

2




Discussion: classifier compression
Another view of the algorithm:
Build a large committee of classifiers trained on randomly subsampled data Prune those samples that are misclassified by "big committee" Compress classifier by training one classifier only on reduced dataset


Conclusion
Source of overfitting
Inadequateness of average loss minimization in case of overlapping classes "Confusing samples" ­ samples that lie in classes overlap and are misclassified by perfect Bayesian classifier

Contribution
Algorithm for removing confusing samples Error prediction


Thanks for attention!
Presentation is available at
research.graphicon.ru/...

Source code available at
http://research.graphicon.ru/machine-learning/avoiding-boostingoverfitting-by-removing-confusing-sam-3.html

Our contacts
Alexander Vezhnevets avezhnevets@graphics.cs.msu.ru obarinova@graphics.cs.msu.ru