Äîêóìåíò âçÿò èç êýøà ïîèñêîâîé ìàøèíû. Àäðåñ îðèãèíàëüíîãî äîêóìåíòà : http://acat02.sinp.msu.ru/presentations/james/moscow.slides.ps
Äàòà èçìåíåíèÿ: Tue Jul 9 20:24:43 2002
Äàòà èíäåêñèðîâàíèÿ: Mon Oct 1 20:24:35 2012
Êîäèðîâêà:
Statistical
Methods
in
Data
Analysis
CERN
Statistics
helps
to
solve
these
problems:
.
Point
Estimation:
Find
the
``best''
value
for
a
parameter.
.
Interval
Estimation:
Find
a
range
within
which
the
true
value
should
lie,
with
a
given
confidence.
.
Hypothesis
Testing:
Compare
two
hypotheses.
Find
which
one
is
better
supported
by
the
data.
.
Goodness­of­Fit
Testing:
Find
how
well
one
hypothesis
is
supported
by
the
data.
.
Decision
Making:
Make
the
best
decision,
based
on
data.
Make
sure
you
know
which
of
these
problems
you
are
solving;
the
answer
depends
on
the
question
(especially
for
Frequentists).
Moscow,
June
2002
F.
James
1

Statistical
Methods
in
Data
Analysis
CERN
Probability
All
statistical
methods
are
based
on
calculations
of
probability.
Mathematical
probability
is
an
abstract
(undefined)
concept
which
obeys
certain
rules
(Kolmogorov's
axioms).
We
will
need
a
specific
operational
definition.
There
are
basically
two
such
definitions
we
will
use:
.
Frequentist
probability
is
defined
as
the
limiting
frequency
of
favourable
outcomes
in
a
large
number
of
identical
experiments.
.
Bayesian
probability
is
defined
as
the
degree
of
belief
in
a
favourable
outcome
of
a
single
experiment.
Moscow,
June
2002
F.
James
2

Statistical
Methods
in
Data
Analysis
CERN
Frequentist
Probability
This
probability
of
an
event
A
is
defined
as
the
number
of
times
A
occurs,
divided
by
the
total
number
of
trials,
in
the
limit
of
a
large
number
of
trials:
P(A)
=
lim N##
N(A) N
where
A
occurs
N(A)
times
in
N
trials.
Frequentist
probability
is
used
in
most
scientific
work,
because
it
is
objective
(independent
of
the
observer).
It
can
(in
principle)
be
determined
to
any
desired
accuracy.
In
practice,
seldom
use
the
definition
to
evaluate
P
[like
E­field].
It
is
the
probability
of
Quantum
Mechanics.
HOWEVER,
a
serious
limitation:
It
can
only
be
applied
to
repeatable
phenomena.
Moscow,
June
2002
F.
James
3

Statistical
Methods
in
Data
Analysis
CERN
Bayesian
Probability
is
more
general,
since
it
can
apply
even
to
unrepeatable
phenomena
(for
example,
the
probability
that
it
will
rain
tomorrow)
.
It
depends
not
only
on
the
phenomenon
itself,
but
also
on
the
state
of
knowledge
and
beliefs
of
the
observer.
.
Bayesian
P(A)
will
in
general
change
with
time.
The
probability
that
it
will
rain
at
12:00
on
Friday
will
change
as
we
get
closer
to
that
date
until
it
becomes
either
zero
or
one
on
Friday
at
12:00.
.
If
initial
conditions
cannot
be
repeated
exactly,
the
Bayesian
probability
cannot
be
verified
by
any
long­term
frequency.
.
One
operational
definition
is
based
on
``the
coherent
bet''
(de
Finetti).
There
are
other
definitions.
Measurement
is
hard.
Moscow,
June
2002
F.
James
4

Statistical
Methods
in
Data
Analysis
CERN
Comparison
of
Frequentist
and
Bayesian
Probabilities
Probability
of
drawing
a
red
ball
from
a
mathematical
urn
Urn
contains
N(R)
red
balls
and
N(W)
white
balls
Each
ball
is
replaced
after
it
is
drawn
Given
Conditions
Frequentist
P(R)
Bayesian
P(R)
N(R)+N(W)
>
0
unknown
P(R)
=
0.5
(1)
one
draw,
ball
was
red
P(R)
#=
0
.5
<
P(R)
<
1
(2)
100
draws,
40
red
P(R)
=
0.40±
0.05
.4
<
P(R)
<
.5
(2)
N(R)
=
N(W)
=
1
P(R)
=
0.5
P(R)
=
0.5
(1)
By
Pascal's
principle
of
Insu#cient
Reason,
but
what
if
there
are
two
kinds
of
red
balls?
(2)
An
exact
value
will
be
given
in
this
range.
Moscow,
June
2002
F.
James
5

Statistical
Methods
in
Data
Analysis
CERN
Fundamental
Underlying
Concepts
The
Hypothesis
is
what
we
want
to
test,
verify,
measure,
decide.
Examples:
H:
The
patient
has
influenza.
H:
The
mass
of
the
proton
is
mp
(unspecified)
H:
The
Universe
is
expanding
at
a
constant
rate.
A
Random
Variable
is
data
which
can
take
on
di#erent
values,
unpredictable
except
in
probability,
even
assuming
the
hypothesis.
P(data|hypothesis)
is
assumed
known,
provided
any
unknowns
in
the
hypothesis
are
given
some
assumed
values.
Example:
for
a
Poisson
process P(N|µ)
=
e

µ
N
N!
Moscow,
June
2002
F.
James
6

Statistical
Methods
in
Data
Analysis
CERN
If
the
data
are
continuous,
then
P(data|hypothesis)
is
an
example
of
a
probability
density
function
(pdf).
Gaussian
data
would
have
the
pdf:
P(x|µ,#)
=
1
#
2##
2
e
-(x-µ)
2
2#
2
#
b
a
P(x|µ,#)
dx
=
P(a
<
x
<
b)
#
# -#
P(x|µ,#)
dx
=
1
The
Likelihood
Function
L
is
P(data|hypothesis)
evaluated
at
the
observed
data,
and
considered
as
a
function
of
the
(unknowns
in
the)
hypothesis.
Moscow,
June
2002
F.
James
7

Statistical
Methods
in
Data
Analysis
CERN
Under
change
of
variables
µ#f(µ)
or
x#
f(x)
the
functions
L
and
P
transform
di#erently:
.
The
values
of
L(µ)
are
invariant,
but
the
integral
is
not.
.
The
values
of
P(x)
are
not
invariant,
but
the
integral
between
corresponding
points
is
invariant.
This
suggests
the
Golden
Rule:
.
Never
integrate
under
a
Likelihood
Function.
.
Never
use
the
values
of
a
pdf,
use
only
its
integral.
A
Nuisance
parameter
is
an
unknown
whose
value
does
not
interest
us,
but
is
unfortunately
necessary
for
the
calculation
of
P(data|hypothesis).
Moscow,
June
2002
F.
James
8

Statistical
Methods
in
Data
Analysis
CERN
Bayes'
Theorem
Conditional
probability:
P(A|B)
means
the
probability
that
A
is
true,
given
that
B
is
true.
For
example
P(symptom|illness)
such
as
P(headache|influenza). Bayes'
Theorem
says
that
the
probability
of
both
A
and
B
being
true
simultaneously
is:
P(A,B)
=
P(A|B)P(B)
=
P(B|A)P(A)
which
implies:
P(B|A)
=
P(A|B)P(B) P(A)
this
can
also
be
written:
P(B|A)
=
P(A|B)P(B)
P(A|B)P(B)+
P(A|notB)P(notB)
Moscow,
June
2002
F.
James
9

Statistical
Methods
in
Data
Analysis
CERN
Example
of
Bayes'
Theorem
Suppose
we
have
a
test
for
influenza,
such
that
if
a
person
has
flu,
the
probability
of
a
positive
result
is
90%
:
P(T
+
|flu)
=
0.9
[10%
false
negatives]
and
1%
if
he
doesn't
have
the
flu:
P(T
+
|not
flu)
=
0.01
[1%
false
positives]
Now
patient
P
tests
positive.
What
is
the
probability
that
he
has
the
flu?
The
answer
by
Bayes'
Theorem:
P(flu|T
+
)
=
P(T
+
|flu)P
(flu)
P(T
+
|flu)P
(flu)+
P(T
+
|not
flu)P
(not
flu)
Moscow,
June
2002
F.
James
10

Statistical
Methods
in
Data
Analysis
CERN
Bayes
Prior
Probability
So
the
answer
depends
on
the
Prior
Probability
of
the
person
having
flu,
that
is,
the
probability
of
flu
in
the
general
population.
If
we
are
in
the
winter,
perhaps
P(flu)
is
1%
.
On
the
other
hand,
we
may
be
dealing
with
a
very
rare
disease
like
CJD:
P(CJD)
=
10
-6
If
we
had
a
test
which
gave
10%
false
negatives
and
1%
false
positives
in
the
case
of
these
two
diseases,
we
would
get
the
following
probabilities:
illness
flu
=
1%
CJD
=
10
-6
P(illness|T
+
)
0.48
10
-4
P(illness|T
-
)
0.001
10
-7
Moscow,
June
2002
F.
James
11

Statistical
Methods
in
Data
Analysis
CERN
Point
Estimation
­
Frequentist
An
Estimator
E
is
a
function
of
the
data
which
will
be
used
to
estimate
(measure)
the
unknown
parameter
µ.
It
can
be
thought
of
as
a
method
or
an
algorithm
which
is
applied
to
the
data
to
produce
the
estimate
“ µ
=
E(data).
The
job
is
to
find
an
estimator
E
which
has
certain
desirable
properties
(given
below).
These
properties
can
be
seen
from
the
sampling
distribution
of
E,the
expected
distribution
of
“ µ
obtained
when
E
is
applied
repeatedly
to
random
data
for
which
the
true
value
of
µ
is
assumed.
This
is
known
because
P(data|µ)
is
known
and
the
estimate
“ µ
is
a
function
of
the
data.
Moscow,
June
2002
F.
James
12

Statistical
Methods
in
Data
Analysis
CERN
The
desirable
properties
of
E
are:
.
Consistency:
E
is
consistent
if
the
estimate
“ µ
converges
to
the
true
value
as
the
amount
of
data
becomes
very
large:
lim N##
“ µN
=
µT
.
Unbiassedness:
The
bias
of
E
is
the
expectation
(over
all
random
data)
of
E,
minus
the
true
value:
BN(“ µ)
=
EN(“
µ)-
µT
The
subscript
N
is
to
indicate
that
the
bias
often
decreases
as
the
amount
of
data
increases.
When
lim N##
BN(“ µ)
=
0
E
is
said
to
be
asymptotically
unbiassed.
Moscow,
June
2002
F.
James
13

Statistical
Methods
in
Data
Analysis
CERN
.
E#ciency:
The
width
of
the
distribution
of
estimates
is
generally
measured
by
the
variance
of
E:
V
(“ µ)
=
EN[“
µ-EN(“
µ)]
2
and
the
e#ciency
of
E
is
just: E#(“ µ)
=
V
min
V
(“ µ)
where
V
min
is
the
smallest
possible
variance
of
any
estimator.
For
an
unbiassed
estimator, V
min
(“ µ)
=
1
E
#
-
#
2
lnL #µ
2
#
It
turns
out
that
under
very
general
conditions,
the
most
e#cient
estimator
is
that
which
maximizes
the
Likelihood
L(µ).
If
there
exists
an
E
with
variance
V
min
,
it
is
the
maximum
likelihood
(m.l.)
estimator.
[Cram’
er]
Moscow,
June
2002
F.
James
14

Statistical
Methods
in
Data
Analysis
CERN
Point
Estimation
­
Bayesian
For
parameter
estimation,
we
can
rewrite
Bayes'
Theorem:
P(hyp|data)
=
P(data|hyp)P(hyp) P(data)
and
if
the
hypothesis
concerns
the
value
of
µ:
P(µ|data)
=
P(data|µ)P(µ) P(data)
which
is
a
probability
density
function
in
the
unknown
µ.
The
normalisation
condition:
#
µ
P(µ|data)dµ
=
1
e#ectively
determines
P(data).
Assigning
names
to
the
di#erent
factors,
we
get:
Posterior
pdf(µ)
#
L(µ)â
Prior
pdf(µ)
Moscow,
June
2002
F.
James
15

Statistical
Methods
in
Data
Analysis
CERN
Posterior
pdf(µ)
#
L(µ)â
Prior
pdf(µ)
Then
the
Bayesian
point
estimate
is
usually
taken
as
the
value
of
µ
that
maximizes
the
Posterior
probability
density.
If
the
Prior
probability
is
taken
as
a
uniform
distribution,
then
the
maximum
of
the
Posterior
will
occur
at
the
maximum
of
L(µ)
which
means
that
in
practice
the
Bayesian
point
estimate
is
often
the
same
as
the
Frequentist
point
estimate!
However,
note
that:
.
The
choice
of
a
uniform
Prior
is
not
well
justified
in
Bayesian
theory.
.
The
choice
of
the
maximum
of
the
posterior
is
metric­dependent.
.
These
two
dubious
choices
together
cause
the
Bayesian
method
to
give
the
same
result
as
the
Frequentist.
Moscow,
June
2002
F.
James
16

Statistical
Methods
in
Data
Analysis
CERN
Interval
Estimation
­
Bayesian
Here
the
goal
is
to
find
an
interval
which
will
contain
the
true
value
with
a
given
probability,
say
90%
.
Since
the
Posterior
Probability
distribution
is
known
from
Bayes'
Theorem
(see
previous
slide),
we
have
only
to
find
an
interval
such
that
the
integral
under
the
Posterior
pdf
is
equal
to
0.9
.
As
this
interval
is
not
unique,
the
usual
convention
is
to
choose
that
interval
containing
the
largest
values
of
the
posterior
pdf.
Note
that
this
choice
is
metric­dependent.
(Intervals
in
µ
2
would
not
be
the
squares
of
intervals
in
µ.)
This
could
be
avoided
by
taking
central
intervals,
such
that
the
integral
of
the
pdf
to
the
left
of
the
interval
is
the
same
as
that
to
the
right.
But
then
one
could
never
get
an
upper
limit.
And
of
course
the
BIG
problem
is
still
the
prior
(discussed
later).
Moscow,
June
2002
F.
James
17

Statistical
Methods
in
Data
Analysis
CERN
Interval
Estimation
­
Frequentist
The
method
is
to
find
two
functions
of
the
data
F1(random
data|true
value)
and
F2(random
data|true
value)
such
that
P(F1
<
true
value
<
F2)
=
0.9
Then
the
90%
interval
is
defined
by
F1
(observed
data)
and
F2(observed
data).
If
we
could
find
such
functions,
this
would
assure
the
property
known
as
coverage:
If
the
experiment
were
repeated
many
times,
and
the
data
were
treated
using
the
functions
F1
and
F2
to
define
the
inerval,
then
the
interval
would
contain
the
true
value
in
90%
of
the
cases.
J.
Neyman
showed
(1930)
how
to
construct
such
functions
in
the
most
general
case.
Moscow,
June
2002
F.
James
18

Statistical
Methods
in
Data
Analysis
CERN
Hypothesis
Testing
­
Frequentist
Compare
two
hypotheses
to
see
which
one
better
explains
(predicts)
the
data.
H0
:
the
null
hypothesis,
and
H1
the
alternative
hypothesis.
We
know
P(data|H0
)
and
P(data|H1
).
If
W
is
the
space
of
all
possible
data,
find
a
Critical
Region
(in
which
we
reject
H0)
#
#W
such
that
P(data
#
#|H0)
=
#
is
as
small
as
possible,
and
at
the
same
time,
P(data
#W
-#|H1)
=
#
is
also
as
small
as
possible.
Moscow,
June
2002
F.
James
19

Statistical
Methods
in
Data
Analysis
CERN
#
is
the
probability
of
rejecting
H0
when
it
is
true.
This
is
the
error
of
the
first
kind,
or
loss.
1-#
is
the
acceptance
of
the
test.
Some
books
interchange
the
definitions
of
#
and
1-#.
#
is
the
probability
of
accepting
H0
when
H1
is
true.
This
is
the
error
of
the
second
kind,
or
contamination.
1-#
is
the
power
of
the
test.
Moscow,
June
2002
F.
James
20

Statistical
Methods
in
Data
Analysis
CERN
Hypothesis
Testing
­
Bayesian
Recall
that
according
to
Bayes'
Theorem:
P(hyp|data)
=
P(data|hyp)P(hyp) P(data)
The
normalization
factor
P(data)
can
be
determined
for
the
case
of
parameter
estimation,
where
all
the
possible
values
of
the
parameter
are
known,
but
in
hypothesis
testing
it
doesn't
work,
since
we
are
only
comparing
two
hypotheses
and
we
cannot
enumerate
all
possible
hypotheses.
However
we
can
still
find
the
ratio
of
probabilities
for
two
hypotheses,
since
the
normalizations
cancel:
R
=
P(H0|data) P(H1|data)
=
L(H0)P(H0) L(H1)P(H1)
Moscow,
June
2002
F.
James
21

Statistical
Methods
in
Data
Analysis
CERN
Goodness­of­Fit
Testing
(GoF)
Here
we
are
testing
only
one
hypothesis
H0.
The
alternative
is
``everything
else'',
undefined.
The
Frequentist
method
for
GoF
is
the
same
as
for
hypothesis
testing,
except
that
now
only
H0
and
#
are
known.
We
cannot
know
the
power
of
the
test
since
we
don't
know
what
we
are
trying
to
exclude.
We
can
only
say
that
if
the
data
fall
in
the
critical
region,
they
fail
the
test
(incompatible
with
the
hypothesis
H0).
In
practice,
any
GoF
test
consists
of
defining
a
test
statistic
(a
function
of
the
data)
and
a
critical
region
(values
of
the
test
statistic
which
lead
to
rejection).
If
the
critical
region
does
not
depend
on
the
underlying
distributions
of
the
data,
the
test
is
called
distribution­free.
Moscow,
June
2002
F.
James
22

Statistical
Methods
in
Data
Analysis
CERN
Although
there
is
theoretically
no
``best''
GoF
test,
there
is
in
practice
one
that
is
the
most
used
statistical
technique
in
history:
Pearson's
Chi­square
Test,
published
by
Karl
Pearson
in
1900.
The
test
statistic
is
the
sum
of
squares
of
deviations
between
the
model
and
the
data,
normalized
by
the
variance
of
the
data:
#
2
=
n
# i=1
(x
i
-ai)
2
V
(x
i
)
where
n
is
the
number
of
bins
or
points
over
which
we
want
to
know
if
the
data
x
are
a
good
fit
to
the
model
a.
The
test
is
distribution­free
because
the
critical
values
of
#
2
depend
only
on
n,
and
not
on
the
distribution
of
x.
The
probability
of
#
2
exceeding
the
observed
value
for
a
given
n,
assuming
H0,
is
called
the
P­value.
An
important
but
often
forgotten
property
of
Pearson's
#
2
statistic
is
that
larger
values
of
#
2
imply
a
worse
fit.
Moscow,
June
2002
F.
James
23

Statistical
Methods
in
Data
Analysis
CERN
Bayesians
cannot
use
this
or
any
other
GoF
test,
because
the
calculation
of
the
P­value
depends
on
the
probability
of
data
not
observed,
and
it
therefore
violates
the
Bayesian
Likelihood
Principle.
Goodness­of­fit
testing
is
the
domain
of
Frequentist
statistics.
There
is
no
credible
Bayesian
goodness­of­fit
test.
Moscow,
June
2002
F.
James
24

Statistical
Methods
in
Data
Analysis
CERN
Decision
Theory
For
decision­making
we
need
to
introduce
a
new
concept,
the
loss
incurred
in
making
the
wrong
decision,
or
more
generally
the
losses
incurred
in
taking
di#erent
decisions
as
a
function
of
which
hypothesis
is
true.
Sometimes
the
negative
loss
(utility)
is
used.
Simplest
possible
example:
Decide
whether
to
bring
an
umbrella
to
work.
Let
P(rain)
be
the
(Bayesian)
probability
that
it
will
rain.
The
loss
function
may
be:
Loss(umbrella
if
rain)
=
1
Loss(umbrella
if
no
rain)
=
1
Loss(no
umbrella
if
no
rain)
=
0
Loss(no
umbrella
if
rain)
=
5
Moscow,
June
2002
F.
James
25

Statistical
Methods
in
Data
Analysis
CERN
The
most
obvious
criterion
for
making
a
decision
is
to
minimize
the
expected
loss. Expected
loss|umbrella
=
1âP(rain)+

P(no
rain)
=
1
Expected
loss|no
umbrella
=
5âP(rain)+0âP(no
rain)
=
5âP(rain)
So
you
will
minimize
expected
loss
by
taking
an
umbrella
to
work
whenever
the
probability
of
rain
is
more
than
1/5.
Decision
Theory
is
the
domain
of
Bayesian
statistics.
Moscow,
June
2002
F.
James
26