Äîêóìåíò âçÿò èç êýøà ïîèñêîâîé ìàøèíû. Àäðåñ îðèãèíàëüíîãî äîêóìåíòà : http://hea-www.harvard.edu/AstroStat/Stat310_1415/ma_20150504.pdf
Äàòà èçìåíåíèÿ: Mon Apr 20 18:52:28 2015
Äàòà èíäåêñèðîâàíèÿ: Sun Apr 10 11:39:55 2016
Êîäèðîâêà: IBM-866

Ïîèñêîâûå ñëîâà: ï ï ï ï ï ï ï ï ï ï ï ï ï ï ï ï ï ï ï ï ï ï ï ï ï ï ï ï ï ï ï ï ï ï ï ï ï ï
New Bayesian methods for model comparison in astrostatistics

How many components in a finite mixture?
Murray Aitkin
murray.aitkin@unimelb.edu.au

School of Mathematics and Statistics University of Melbourne Australia

Astrostatistics í p. 1


Acknowledgements
Work suppor ted by US National Center for Education Statistics and Australian Research Council. Aim: to evaluate and develop general Bayesian model comparisons for arbitrar y models through posterior likelihood ratios/posterior deviance differences. Two aspects:
§ Non-nested models í compute distribution of ratio of

posterior likelihoods
§ Nested models í compute posterior distribution of likelihood

ratio Book treatment: Murray Aitkin (2010) Statistical Inference: an Integrated Bayesian/Likelihood Approach. Chapman and Hall/CRC
Astrostatistics í p. 2


The galaxy recession velocity study
§ The data are the recession velocities of 82 galaxies from 6

well-separated sections of the Corona Borealis region (Postman, Huchra and Geller, The Astronomical Journal 1986).
§ Do these velocities clump into groups or clusters, or does

the velocity density increase initially and then gradually tail off? This had implications for theories of evolution of the universe. If the velocities clump, the velocity distribution should be multi-modal.
§ Investigated by fitting mixtures of normal distributions to the

velocity data; the number of mixture components necessar y to represent the data í or the number of modes í is the parameter of par ticular interest.

Astrostatistics í p. 3


Recession velocities in km/sec (/1000) of 82 galaxies
99 97 92 91 86 86 99 85 88 66 85 91 55 82 89 54 80 75 71 53 63 75 71 47 42 50 67 44 22 96 37 54 99 78 34 22 92 31 54 72 56 93 34 20 81 25 48 37 48 60 33 18 70 24 26 29 35 41 17 55 07 18 49 21 24 29 99 17 23 08 42 05 17 14 19 21 13 63 96 -------------------------------------------------------------------9.10.11.12.13.14.15.16.17.18.19.20.21.22.23.24.25.26.27.28.29.30.31 velocity/1000

79 07 28 ---------.32.33.34.

Astrostatistics í p. 4


Could these come from a single normal distribution?
1.0

0.9

0.8

0.7

0.6

cdf

0.5

0.4

0.3

0.2

0.1

0.0 10 15 20 25 30 35

velocity

Clearly not!
Astrostatistics í p. 5


Clumping by mixtures
Mixture distributions are widely used to represent heterogeneity; mixtures of normals are the most common for data on a continuous scale. The general model for a K -component normal mixture has 2 different means ²k and variances k in component k , which makes up a propor tion k of the population:
f (y ) =
K k =1

k f (y |²k , k ),
1 2k

where
f (y |²k , k ) =
1 2

ex p -
k

2

(y - ²k )2 ,

and the k are positive with

K k =1

k = 1.

Fitting by ML is straightforward with an EM algorithm.

Astrostatistics í p. 6


Two normals
1.0

0.9

0.8

0.7

0.6

cdf

0.5

0.4

0.3

0.2

0.1

10

15

20

25

30

35

velocity

²1 = 21.35, 1 = 1.88, 1 = 0.740 ^ ^ ^ ²2 = 19.36, 2 = 8.15, 2 = 0.260 ^ ^ ^
Astrostatistics í p. 7


Three normals
1.0

0.9

0.8

0.7

0.6

cdf

0.5

0.4

0.3

0.2

0.1

10

15

20

25

30

35

velocity

A good fit?
Astrostatistics í p. 8


Four normals
1.0

0.9

0.8

0.7

0.6

cdf

0.5

0.4

0.3

0.2

0.1

10

15

20

25

30

35

velocity

A ver y close fit!
Astrostatistics í p. 9


Six normals
1.0

0.9

0.8

0.7

0.6

cdf

0.5

0.4

0.3

0.2

0.1

0.0 10 15 20 25 30 35

velocity

Is this really a better fit?
Astrostatistics í p. 10


Roeder's analysis
Roeder computed a nonparametric estimate of the velocity density by optimizing a goodness-of-fit criterion based on spacings between the ordered obser vations. A simple kernel density estimate suggested three or more components; histograms gave inconsistent impressions depending on the bandwidth and limits. A surface generated by bandwidth variations suggested three or more components.

Astrostatistics í p. 11


Roeder kernel density graph

Astrostatistics í p. 12


Roeder histograms graph

Astrostatistics í p. 13


Roeder density surface graph

Astrostatistics í p. 14


Model assessment and comparison
How do we decide whether four components are needed, more than four, or less than four? The most common frequentist methods are AIC, BIC and the bootstrap likelihood ratio test. These don't work well (see references for explanation). The most common Bayesian method uses the integrated likelihood. This doesn't work well either. The DIC is used but is widely criticised. We consider first the simplest case of completely specified models.

Astrostatistics í p. 15


Model comparisons í completely specified models
We have a random sample of data y = (y1 , ..., yn ) from a population which may be either of two completely specified distributions: Model 1 í f1 (y |1 ) and Model 2 í f2 (y |2 ), where 1 and 2 are known. The likelihoods and priors for the two models are:
§ L1 = § L2 =
i f1 i f2

(yi |1 ) and 1 , (yi |2 ) and 2 = 1 - 1 .

Then by Bayes's theorem, the posterior odds (ratio of posterior probabilities) for model 1 over model 2 is 1|y L1 1 = §. 2|y L2 2 For equal prior probabilities 1 = 2 , the RHS is the likelihood ratio, so a likelihood ratio of 9 gives a posterior probability of 9/(9+1) = 0.9 for Model 1.
Astrostatistics í p. 16


Model comparisons í general models
The models are Model 1 í f1 (y |1 ) and Model 2 í f2 (y |2 ) as before, but now 1 and 2 are unspecified except for priors 1 (1 ) and 2 (2 ). The likelihood ratio now depends on 1 and 2 . We eliminate this dependence by integrating out the unknown parameters with respect to their priors, to give integrated likelihoods: ï ï L1 = L1 (1 )1 (1 )d1 , L2 = L2 (2 )2 (2 )d2 . The Bayes factor (for the relative suppor t for Model 1 over Model ïï 2) is defined to be L1 /L2 , and the posterior odds on Model 1 over Model 2 is defined to be ï L1 1 ï2 § 2 , L as though the integrated likelihoods are from completely specified models.
Astrostatistics í p. 17


Integrated likelihood difficulties
This approach to model comparisons is not restricted to two models: it can handle any number of competing models in the same way. However, it has well-known difficulties:
§ Improper priors cannot be used, as the integration leaves

an arbitrar y constant in the integrated likelihood.
§ Proper priors are informative, depending on prior

parameters .
ï § Then the integrated likelihood L() depends explicitly on the

prior parameter í
§ A change in the value of the prior parameter will change the

value of the integrated likelihood.

Astrostatistics í p. 18


Bayes analysis of mixtures
Bayesian and ML analyses are greatly simplified by the introduction of a set of latent Bernoulli variables {Zik } for membership of obser vation i in component k . This allows the complete data representation (with the Z counterfactually obser ved)
f (yi , {Zik }) =
K k =1 Z f (yi |²k , k )Zik § k i
k

ik

We can then write the complete data likelihood as
L (Z, , , k ) =
n i=1 K k =1 Z f (yi |²k , k )Zik § k ik .

This allows simple conditional distributions in the MCMC algorithm, as in the EM algorithm.

Astrostatistics í p. 19


Posterior model probabilities for the galaxy data
§ All the Bayes analyses used some form of Data

Augmentation or Markov chain Monte Carlo analysis, with updating of the successive conditional distributions of the set of parameters and the set of latent component membership variables given the other set and the data y .
§ Most of the analyses took K initially as fixed, obtained an integrated likelihood over the other

parameters for each K depending on the settings of the prior parameters, and used Bayes's theorem to obtain the posterior probabilities of each value of K .

Astrostatistics í p. 20


Bayes analysis
More complex analyses (Richardson and Green 1997) used Reversible Jump MCMC (RJMCMC) in which K is included directly in the parameter space, which changes as K changes, as jumps are allowed across different values of K . The choice of prior distributions (including for K ) varied among Bayes analyses of the galaxy data by
§ Escobar and West (1995) § Carlin and Chib (1995) § Phillips and Smith (1996) § Roeder and Wasserman (1997) § Richardson and Green (1997) and § Nobile (2004).

Astrostatistics í p. 21


Prior distributions for
K EW CC PS RW RG N

K
3 14 33 24 10 03 4 21 33 18 10 03 5 21 10 10 03 6 17 05 10 03 7 11 02 10 03 8 06 01 10 03 9 .02 .10 .03 10

. . . .

1 01 16 10 03 ?

. . . . .

2 06 33 24 10 03

. . . . .

. . . . .

. . . .

. . . .

. . . .

. . . .

.10 .03...

Astrostatistics í p. 22


Posterior distributions for
K E W1 E W2 CC1 CC2 PS RW RG N .999 .06 .02 .00 .13 .13 .02 .64 .004 3 4 .03 .05 .36 .996

K
6 .22 .21 .03 7 .26 .21 .39 8 .20 .16 .32 9 .11 .11 .22 10 .05 .06 .04 11 .02 .03 .01 12 13

5 .11 .14 -

.18 .16

.20 .25

.16 .20

.11 .13

.07 .06

.04 .03

.02 .01

.01 .01

.01

Astrostatistics í p. 23


Conclusions?
Most posteriors were diffuse, with modes at 6 or 7 components. Carlin and Chib found 3 or 4 depending on their priors. Roeder and Wasserman found 3 with probability almost 1. So how many components are there? "Explanation: the results are different because of the different priors." This is not the solution: it is the problem.

Astrostatistics í p. 24


Solution: posterior likelihoods/deviances
We give the general approach, originally due to Dempster. The model likelihoods are uncer tain, because of our uncer tainty about the parameters in these models. The parameter uncer tainty is expressed through the posterior distributions of each model's parameters k , given the data and priors. The model k likelihood Lk (k ) is a functional í a function of both k and the obser ved data, so we map the posterior distribution of k into that of Lk (k ). This is ver y simply done by simulation, making random draws from the posteriors and substituting them in the likelihoods.

Astrostatistics í p. 25


Non-informative priors and MCMC for finite mixtures
For each K = 1, 2, ... we use a diffuse Dirichlet prior on the component propor tions k and diffuse conjugate priors on the 2 means ²k and inverse variances 1/k , for k = 1, 2, ...K . For computational MCMC analysis these have to be proper, so slightly informative.
§ The MCMC sampler is run till convergence of the joint

posterior distribution of the parameter set for each K .
§ Then we sample M = 10, 000 values

for each component from this posterior distribution, and
LK = LK (
[m] [m] 1

[m] k

§ compute the K -component mixture likelihood

, ...,

[m] K

) for each parameter set.

This study was done for the galaxy data by Celeux et al (2006), in an evaluation of various rules for penalizing the posterior mean deviance in the DIC of Spiegelhalter et al (2002).
Astrostatistics í p. 26


Asymptotics for likelihoods and deviances
We generally work with posterior deviances rather than posterior likelihoods í their asymptotics are much better behaved. ^ For regular models f (y | ) with flat priors, giving an MLE internal to the parameter space, the second-order Taylor ^ expansion of the deviance D() = -2 log L() = -2() about gives: . ^) - 2( - ) () - ( - ) ()( - ) ^ ^ ^ ^ ^ -2() = -2( ^) + ( - ) I ()( - ) quadratic log - likelihood ^ ^ ^ = -2( . ^) § exp[-( - ) I ()( - )/2] normal likelihood ^ ^ ^ L() = L( . ^) I ()( - )/2] normal posterior ^ ^ (|y) = c § exp[-( -
^) I ()( - ) is (asymptotically) a pivotal ^ ^ The quadratic form ( - (function of data and parameters) which has (asymptotically) a known (2 ) distribution, Bayesian or frequentist.
Astrostatistics í p. 27


Asymptotic distributions
So asymptotically, given the data y and a flat prior on , we have the posterior distributions: ^^ N ( , I ()-1 ), ^) I ()( - ) 2 , ^ ^ ( - p
^ D() D( ) + 2 , p ^ L() L( ) § exp(-2 /2). p
§ The deviance D( ) has a shifted 2 distribution, shifted by p

^ the frequentist deviance D(), where p is the dimension of .
§ The likelihood L( ) has a scaled exp(-2 /2) distribution. p

With even moderate data sets, likelihoods become extremely small and cause underflow in computations í we evaluate deviances instead.
Astrostatistics í p. 28


Posterior deviances
The frequentist deviance is an origin parameter for the posterior deviance distribution: no random draw can give a smaller deviance value than the frequentist deviance. For the comparison of models with different numbers of components, we compute the sets of posterior deviance draws for K = 1, 2, ...7:
D
[m] K

= -2 log LK .

[m]

The M values for each K define the posterior distributions: we order them to give the empirical cdfs.

Astrostatistics í p. 29


Deviance distributions
1.0

0.9

0.8

0.7

0.6

cdf

0.5

3 456 7

2

1

0.4

0.3

0.2

0.1

0.0 420 440 460 480 500

deviance

Astrostatistics í p. 30


Inter pretation
§ The deviance distribution for K = 1 is far to the right of the

others í the 1-component mixture (single normal distribution) is a ver y bad fit relative to the others.
§ Its deviance distribution is stochastically larger than all the

others.
§ The fit improves substantially from K = 1 to 2. § The improvement continues from K = 2 to 3. § The distribution for K = 3 is the stochastically smallest í § as the number of components increases beyond 3 the

deviance distributions move steadily to the right, to larger deviance values (lower likelihoods).
§ They also become more diffuse, with decreasing slope.

With more parameters there is less information about each parameter, and therefore about the likelihood and deviance, so their posterior distributions become more diffuse.
Astrostatistics í p. 31


Consistency with data

The conclusions from the deviance distributions are consistent with
§ the appearance of the data empirical cdfs; § the results of Roeder and Wasserman; § the DIC results of Celeux et al (whatever the choice of

function of the deviance draws, or the penalty, DIC always chose 3 components for the galaxy data).

Astrostatistics í p. 32


1.0

1.0

Graphs

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6

cdf

0.5

cdf

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1

0.0 10 15 20 25 30 35
10 15 20 25 30 35

velocity
1.0 1.0

velocity

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6

cdf

cdf

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1

10

15

20

25

30

35

10

15

20

25

30

35

velocity

velocity

Astrostatistics í p. 33


Does this prove anything?
Bayesian methods for determining the number of mixture components have not previously been suppor ted by simulation studies. Aitkin, Vu and Francis (2015) repor t studies based on galaxy-like data sets, and compare model choice by posterior deviances with choice by DIC (which has a penalty on the mean deviance for each model í Spiegelhalter et al 2002). The deviance distribution comparison was uniformly more accurate than the DIC comparison, especially for large numbers of components. The 5- and 7-component mixtures were par ticularly difficult to identify.

Astrostatistics í p. 34


Simulations
n K 1 2 3 4 5 6 7

82 DI C 100 85 51 3 0 2 0

Dev 100 98 99 9 18 9 1

164 DI C 100 100 98 11 0 0 0

Dev 99 100 99 67 9 10 15

328 DI C 100 100 100 30 0 56 4

Dev 100 100 99 99 37 100 3

656 DI C 100 100 100 17 1 78 4

Dev 100 97 99 99 89 100 32

Percentages of correct model identification in 100 data sets of size n using DIC and posterior deviance

Astrostatistics í p. 35


References for this work
Postman, M.J., Huchra, J.P. and Geller, M.J. (1986) Probes of large-scale structures in the Corona Borealis region. The Astronomical Journal 92, 1238í1247. Roeder, K. (1990) Density estimation with confidence sets exemplified by superclusters and voids in the galaxies. JASA 85, 617í624. Carlin, B.P. and Chib, S. (1995) Bayesian model choice via Markov Chain Monte Carlo methods. JRSSB 57, 473í484. Escobar, M.D. and West, M. (1995) Bayesian density estimation and inference using mixtures. JASA 90, 577í588. Phillips, D.B. and Smith, A.F.M. (1996) Bayesian model comparison via jump diffusions. in Markov chain Monte Carlo in practice eds. Gilks, W.R., S. Richardson and D.J. Spiegelhalter. London: Chapman and Hall.
Astrostatistics í p. 36


References for this work
Richardson, S. and Green, P.J. (1997) On Bayesian analysis of mixtures with an unknown number of components (with discussion). JRSSB 59, 731í792. Roeder, K. and Wasserman, L. (1997) Practical Bayesian density estimation using mixtures of normals. JASA 92, 894í902. Aitkin, M. (2001) Likelihood and Bayesian analysis of mixtures. Statistical Modelling 1, 287í304. Spiegelhalter, D.J., Best, N.G., Carlin, B.P. and van der Linde, A. (2002) Bayesian measures of model complexity and fit (with Discussion). JRSSB 64, 583í639. Nobile, A. (2004) On the posterior distribution of the number of components in a finite mixture. Annals of Statistics 32, 2044í2073.
Astrostatistics í p. 37


References for this work
Celeux, G., Forbes, F., Rober t, C. P. and Titterington, D. M. (2006) Deviance information criteria for missing data models. Bayesian Analysis 1, 651í674. Aitkin, M. (2010) Statistical Inference: an Integrated Bayesian/Likelihood Approach. CRC Press, Boca Raton. Aitkin, M. (2011) How many components in a finite mixture? pp. 277í292 in Mixtures: Estimation and Applications. eds. K.L. Mengersen, C.P. Rober t and D.M. Titterington. Wiley, Chichester. Gelman, A., Rober t, C.P. and Rousseau, J. (2013) Inherent difficulties of non-Bayesian likelihood inference, as revealed by an examination of a recent book by Aitkin. Statistics and Risk Modeling 30, 105í120. Aitkin, M. (2013) Comments on the review of Statistical Inference. Statistics and Risk Modeling 30, 121í132.
Astrostatistics í p. 38


References for this work
Aitkin, M., Vu, D. and Francis, B. (2015) A new Bayesian approach for determining the number of components in a finite mixture. Metron, revision under review.

Astrostatistics í p. 39