Документ взят из кэша поисковой машины. Адрес оригинального документа : http://hea-www.harvard.edu/AstroStat/Stat310_1011/pb+is_20110712.pdf
Дата изменения: Tue Jul 12 22:19:13 2011
Дата индексирования: Tue Oct 2 05:15:07 2012
Кодировка:
Поисковые слова: http astrokuban.info astrokuban

Bayesian Estimation of log N - log S
Paul Baines, Irina Udaltsova
Department of Statistics University of California, Davis

July 12th, 2011

Introduction
What is `log N - log S '? Cumulative number of sources detectable at a given sensitivity Defined as: N (> S ) =
i

I

{Si >S }

i.e., the number of sources brighter than a threshold. Considering the distribution of sources, this is related to the survival function i.e., N(>S)=Nв(1-F(S)) `log N - log S ' refers to the relationship beween (or plot of ) log10 N (> S ) and log10 S . Why do we care? Constrains evolutionary models, dark matter distribution etc.

Inferential Process

To infer the log N - log S relationship there are a few steps: 1. Collect raw data images 2. Run a detection algorithm to extract `sources' 3. Produce a dataset describing the `sources' (and uncertainty about them) 4. Infer the log N - log S distribution from this dataset Our analysis is focused on the final step accounting for some (but not all) of the detector-induced uncertainties. . .

Inferential Process

To infer the log N - log S relationship there are a few steps: 1. Collect raw data images 2. Run a detection algorithm to extract `sources' 3. Produce a dataset describing the `sources' (and uncertainty about them) 4. Infer the log N - log S distribution from this dataset Our analysis is focused on the final step accounting for some (but not all) of the detector-induced uncertainties. . . Adding further layers to the analysis to start with raw images is possible but that is for a later time. . .

The Data

The data is essentially just a list of photon counts with some extra information about the background and detector properties.
Src_ID 2 3 7 18 19 Count 270 117 33 7 12 Src_area 1720 96 396 128 604 Bkg 3.16 0.19 0.61 0.22 0.96 Off_axis 4.98 5.72 6.17 6.34 4.51 Eff_area 734813.1074 670916.3154 670916.3154 319483.9597 670916.3154

Problems

The photon counts do not directly correspond to the source fluxes: 1. Background contamination 2. Natural (Poisson) variability 3. Detector efficiencies (PSF etc.) Not all sources in the population will be detected: 1. Low intensity sources 2. Close to the limit: background, natural variability and detection probabilities are important.

Key Ideas

Our goals: Provide a complete analysis, accounting for all detector effects (especially those leading to unobserved sources) Allow for the incorporation of prior information Investigate parametric forms (testing) for log N - log S (e.g., broken power-laws) Investigate the data-prior inferential limit (e.g., for which Smin does the information come primarily from the model and not the data)

Missing Data Overview

There are many potential causes of missing data in astronomical data: Background contamination (e.g., total=source+background) Low-count sources (below detection threshold) Detector schedules (source not within detector range) Foreground contamination (objects between the source and detector) etc. Some are more problematic than others. . .

Missing Data Mechanisms

In the nicest possible case, if the particular data that is missing does not depend on any unobserved values then we can essentially ignore the missing data. In this context, whether a source is observed is a function of its source count (intensity) which is unobserved for unobserved sources. This missing data mechanism is non-ignorable, and needs to be carefully accounted for in the analysis.

The Model

N NegBinom ( , ) , Si |Smin , Pareto (, Smin ) , Gamma(a, b ), Yisrc |Si , Li , Ei Pois ((Si , Li , Ei )) , Yibkg |Li , Ei Pois (k (Li , Ei )) , Ii Bernoulli (g (Si , Li , Ei )) .
iid iid iid

i = 1, . . . , N ,

The Model
It turns out that in many contexts there is strong theory that expects the log N - log S to obey a Power law:
N

N (> S ) =
i =1

I{S

i

>S }

S

-

, S > Smin

Taking the logarithm gives the linear log(N ) - log(S ) relationship. The power-law relationship defines the marginal survival function of the population, and the marginal distribution of flux can be seen to be a Pareto distribution: Si |Smin , Pareto (, Smin ) ,
iid

i = 1, . . . , N .

The analyst must specify Smin , a threshold above which we seek to estimate .

The Model cont. . .

The total number of sources (unobserved and observed), denoted by N , is modeled as: N NegBinom ( , ) , We observe photon counts contaminated with background noise and other detector effects, Yitot = Yisrc + Yibkg , Yisrc |Si , Li , Ei Pois ((Si , Li , Ei )) ,
iid

Yibkg |Li , Ei Pois (k (Li , Ei )) .

iid

The functions and k represent the intensity of source and background, respectively, for a given flux Si , location Li and effective exposure time Ei .

The Model cont. . .

The probability of a source being detected, g (Si , Li , Ei ), is determined by the detector sensitivity, background and detection method. The marginal detection probability as a function of is defined as: ( ) = g (Si , Li , Ei ) · p (Si |) · p (Li , Ei ) dSi dEi dLi .

The prior on is assumed to be: Gamma(a, b ).

The Model cont. . .

The posterior distribution, marginalizing over the unobserved fluxes, can be shown to be:
src tot p N , , Sobs Yobs |n, Yobs

src src tot tot src tot p N , , Sobs , Smis , Yobs , Ymis , Ymis |n, Yobs dYmis dYmis dSmis

p (N ) · p () · p (n|N , ) · p (Sobs |n, )
tot src tot · p Yobs |n, Sobs · p Yobs |n, Yobs , Sobs .

The Model

N NegBinom ( , ) , Si |Smin , Pareto (, Smin ) , Gamma(a, b ), Yisrc |Si , Li , Ei Pois ((Si , Li , Ei )) , Yibkg |Li , Ei Pois (k (Li , Ei )) , Ii Bernoulli (g (Si , Li , Ei )) .
iid iid iid

i = 1, . . . , N ,

Computational Details
The Gibbs sampler consists of four steps:
src tot Yobs |n, Yobs , Sobs , tot src Sobs |n, Yobs , Yobs , ,

[|n, N , Sobs ] ,

[N |n, ] .

Sample the observed photon counts:
src tot tot Yobs ,i |n, Yobs ,i , Sobs ,i Binom Yobs ,i ,

(Sobs ,i , Lobs ,i , Eobs ,i ) (Sobs ,i , Lobs ,i , Eobs ,i ) + k

,

for i = 1, . . . , n. Sample the fluxes Sobs ,i , i = 1, . . . , n (MH using a t -proposal). Sample the power-law slope (MH using a t -proposal). Compute the posterior distribution for the total number of sources, N , using numerical integration: p (N |n, ) (N + ) (N - n + 1) 1 - () +1
N

I{N n}

Note: The (prior) marginal detection probability () is pre-computed via the numerical integration.

Computational Notes

Some important things to note: Computation is fast (secs), and insensitive to the number of missing sources The fluxes of the missing sources are never imputed (only the number of missing sources) Most steps are not in closed form changing (some) assumptions has little computation impact Broken power law (or other forms) can be implemented by changing only one of the steps Fluxes of missing sources can (optionally) be imputed to produce posterior draws of a `corrected' log N - log S

RED = Missing sources, BLACK = Observed sources.

Future Work

We currently do not include: 1. False sources (allowing that `observed' sources might actually be background/artificial) 2. Spatially varying detection probabilities (straightforward, needs implementing)

Simulated Example
Assume parameter setting: N NegBinom(, ), where = 200 = shape, = 2 = scale Gamma(a, b ), where a = 20 = shape, b = 1/20 = scale Si | Pareto (, Smin ), where Smin = 10-13 Yisrc |Si , Li , Ei Pois ((Si , Li , Ei )) Yibkg |Si , Li , Ei Pois (k (Li , Ei ))
· = SiEi , where effective area Ei (1, 000, 100, 000), and the energy per photon = 1.6 в 10-9

ki = z · Ei , where the rate of background photon count intensity per million seconds z = 0.0005 niter = 21, 000, Burnin = 1000

Simulated Example cont. . .
Detection probability: g (, k ) = 1.0 - a0 · ( + k )a1 · e a2 ·( a0 = 11.12, a1 = -0.83, a2 = -0.43 Marginal detection probability:
+k )

, where

Empirical results of MCMC sampler
The actual coverage of nominal percentiles for all parameters for simulated data, for M = 200 validation datasets: Coverage Percentile N all Sobs 50% 0.55 0.50 0.51 80% 0.83 0.82 0.81 90% 0.90 0.92 0.90 95% 0.96 0.97 0.95 98% 0.98 0.99 0.98 99% 0.99 0.99 0.99 99.9% 1.00 1.00 1.00

Mean squared error of different estimators for N and for simulated data. MSE Level of Effective Area Low Medium High N Median 215.96 121.26 68.23 Mean 291.82 168.91 95.36 Median 0.05439 0.05558 0.04578 Mean 0.07481 0.07407 0.05987

MCMC Draws

Posterior Correlations

Posterior estimates for the power-law slope and the total number of sources.

MCMC Draws

Simulated log N - log S

Uncertainties in source fluxes and a display of the power-law relationship. Posterior draws (gray), truth (blue).

The Data

Src_ID 2 3 7 18 19

Count 270 117 33 7 12

Src_area 1720 96 396 128 604

Bkg 3.16 0.19 0.61 0.22 0.96

Off_axis 4.98 5.72 6.17 6.34 4.51

Eff_area 734813.1074 670916.3154 670916.3154 319483.9597 670916.3154

Posteriors of parameters N and

^ Zezas et al. (2003) estimated a power-law slope of = 0.45. The posterior median from our analysis is = 0.38, with the 95% posterior interval consistent with competing estimators.

Note: This a posterior plot for the observed sources only (the `corrected' plot would be more useful. . . ) Evidence of a possible break in the power-law in the observed log N - log S . Given the possible non-linearity of the log(N ) - log(S ), more work is needed to allow for a broken power-law or more general parametric forms.

References

A. Zezas et al. (2004) Chandra survey of the 'Bar' region of the SMC Revista Mexicana de Astronoma y Astrofsica (Serie de Conferencias) Vol. 20. IAU Colloquium 194, pp. 205-205. R.J.A. Little, D.B. Rubin. (2002) Statistical analysis with missing data, Wiley.