Author 
Message 
DGoncz@aol.com science forum Guru Wannabe
Joined: 25 Oct 2005
Posts: 122

Posted: Mon Jul 10, 2006 11:40 am Post subject:
Re: Maximal entropy from Bayes' law?



Hello, news:sci.math.research and news:sci.stat.math readers.
It has been a while for me since posting to sci.math.research..
I just finished MTH 241 Statistics and wrote a bit in Wikipedia and
news:sci.math about Bayesian probabilites to exercise my new skills; I
worked out the power (beta) of an HIV test example, which differs in a
striking manner from the picture you get looking at the specificity and
selectivity of the antibody test. (The forward probability, I think
that is called.)
I collected some links here and have tried to fit them into the OP with
a few quotes from the linked documents.
John Baez wrote in the referenced post that:
Quote:  I was walking on the beach with Chris Lee and James Dolan last
weekend, and we came across a problem about statistical inference.

I don't know about you, but I love it when I come across things on the
beach. :)
Quote:  Suppose we repeatedly observe a system in a wellunderstood but
stochastic way and get some frequency distribution of outcomes.
What's our best guess for the probability that the system is in a
given state? I'm hoping that under some conditions the Bayesian
approach gives results that match Jaynes' maximum entropy method.

In general,
http://mathworld.wolfram.com/InverseProblem.html
but specifically,
http://citeseer.ifi.unizh.ch/12595.html
"A Comparison Of Two Approaches: Maximum Entropy On The Mean (MEM) And
Bayesian Estimation (BAYES) For Inverse Problems "
may be of use.
So now the question is "What are the conditions?"
Quote:  Let's make the problem very finitistic to keep it simple.
Suppose X is a finite set of "states" and Y is a finite set
of "observation outcomes". Suppose we have a function
f: Y x X > [0,1]
giving the conditional probability f(yx) for the outcome of
an observation to equal y given that the system's state is x.

That seems more like a, or even *the* general observation problem. What
we see is not always what is there.
Quote: 
We can think of f as a "stochastic function" from states to
observation outcomes, sending each point of X to a probability
measure on Y. By linearity, f gives a map from probability measures
on X to probability measures on Y. I'll also call this f.
Given a probability measure q on Y, what's our best guess
for a probability measure p on X with f(p) = q? Assume such
a measure exists.

How is what we see colored by our expectation of what is there? How do
we take the color of our expectations away from what we see? Zen
provides answers to the philosophical problem. Math provides answers to
actual problems.
http://en.wikipedia.org/wiki/Prior_probability provides:
"Some attempts have been made at finding probability distributions in
some sense logically required by the nature of one's state of
uncertainty; these are a subject of philosophical controversy. "
Quote: 
One approach is Jaynesian: maximize entropy! In other words,
look at the probability measures p with f(p) = q and choose
the p with maximum entropy.
Of course to define entropy we need a "prior" measure on X;
let's use counting measure. So, we are trying to maximize
sum_{x in X} p(x) ln(p(x))
over p satisfying the linear constraint f(p) = q. This is something
people know how to do.

http://en.wikipedia.org/wiki/Prior_probability provides:
""Another idea, championed by Edwin T. Jaynes, is to use the principle
of maximum entropy. The motivation is that the Shannon entropy of a
probability distribution measures the amount of information contained
the distribution. The larger the entropy, the less information is
provided by the distribution. Thus, by maximizing the entropy over a
suitable set of probability distributions on X, one finds that
distribution that is least informative in the sense that it contains
the least amount of information consistent with the constraints that
define the set. "
Quote: 
Another approach is Bayesian: use Bayes' rule! The conditional
probability for any measurement outcome y given the state x is f(yx).
Use counting measure on X as our "prior", and use Bayes' law to
update this prior when we do an observation and get the outcome y.

http://mathworld.wolfram.com/BayesTheorem.html
Quote:  In fact, let's imagine that we keep doing observations, getting
outcomes distributed according to the probability measure q, and
use these to keep updating our probability measure on X. In the
limit of infinitely many observations, this measure should converge
(almost surely) to some fixed measure on X.
Do we get the same answer using both approaches?

Once again, from http://en.wikipedia.org/wiki/Prior_probability :

"...Edwin T. Jaynes has published an argument (Jaynes 1968) based on
Lie groups that suggests that the prior for the proportion p of voters
voting for a candidate, given no other information, should be
p ^( 1) * (1  p) ^( 1)
If one is so uncertain about the value of the aforementioned proportion
p that one knows only that at least one voter will vote for Smith and
at least one will not, then the conditional probability distribution of
p given this information alone is the uniform distribution on the
interval [0, 1], which is obtained by applying Bayes' Theorem to the
data set consisting of one vote for Smith and one vote against, using
the above prior."

Now that is something similar to what I did to work out the HIV testing
example. I think I could do or at least follow that, but would have
some difficulty as the population parameter p (not probability but
proportion) wasn't covered in MTH 241 Statistics. But Bayes's rule was.
I don't know my right foot from a Lie group yet.
Quote:  If so, we might say we had a "derivation" of the maximal entropy
principle from Bayes' law... or vice versa.
Feel free to fix my question; my description of the Bayesian
approach is pretty sketchy, and I may have something screwed up.

Oh, I can't do that, but I hope those who read these inserted links
will have something to add. I'm just a machinist studying math at the
community college!
http://mathworld.wolfram.com/BayesianAnalysis.html
This, I think, is more helpful, clear, rigorous, and general than the
Wikipedia article referenced in John's second post to this thread. My 2
cents...
Finally,
http://citeseer.ifi.unizh.ch/mohammaddjafari96full.html
"A Full Bayesian Approach for Inverse Problems"
Doug Goncz
Replikon Research
Seven Corners, VA 220440394 

Back to top 


kunzmilan@atlas.cz science forum beginner
Joined: 21 Feb 2006
Posts: 42

Posted: Sat Jul 08, 2006 11:48 am Post subject:
Re: Maximal entropy from Bayes' law?



Thoma wrote
Quote:  From the above description, it should be clear that maximum
entropy is a shockingly strong assumption. It should only be 
used if
A) Your probability measure really is being generated by an
astronomically large number of microstates interacting. I'm
told this sometimes happens in physics.
1)>Boltzmann, trying to explain his formula for thermodynamical
entropy, published (1877) the following table:
(7,0,0,0,0,0,0)
(6,1,0,0,0,0,0)
(5,2,0,0,0,0,0)
(4,3,0,0,0,0,0)
(5,1,1,0,0,0,0)
(4,2,1,0,0,0,0)
(3,3,1,0,0,0,0)
(3,2,2,0,0,0,0)
(4,1,1,1,0,0,0)
(3,2,1,1,0,0,0)
(2,2,2,1,0,0,0)
(3,1,1,1,1,0,0)
(2,2,1,1,1,0,0)
(2,1,1,1,1,1,0)
(1,1,1,1,1,1,1)
He calculated entropy of each partition using the polynomial
coefficient. He suggested, that the logarithm of it is the equal to
entropy. It is thus possible to use it even for small sets.
Shannon realization of previous table in information theory:
(a,a,a,a,a,a,a)
(a,a,a,a,a,a,b)
(a,a,a,a,a,b,b)
(a,a,a,a,b,b,b)
(a,a,a,a,a,b,c)
(a,a,a,a,b,b,c)
(a,a,a,b,b,c,c)
(a,a,a,b,b,b,c)
(a,a,a,b,b,c,c)
(a,a,a,a,b,c,d)
(a,a,a,b,b,c,d)
(a,a,b,b,c,c,d)
(a,a,a,b,c,d,e)
(a,a,b,c,d,e,f)
(a,b,c,d,e,f,g)
Here, another polynomial coefficient is applicable, connected with
another measure of entropy. Both entropies are additive, nevertheless
they can not solve the mixing problem, how to distinguish strings:
000000111111
010101010101
111010110000.
kunzmilan 

Back to top 


Ian1 science forum beginner
Joined: 16 Feb 2005
Posts: 22

Posted: Fri Jul 07, 2006 11:46 am Post subject:
Re: Maximal entropy from Bayes' law?



Sorry that this reply is so broken up into separate posts. I keep
having new thoughts about it.
John Baez wrote:
Quote:  I was walking on the beach with Chris Lee and James Dolan last
weekend, and we came across a problem about statistical inference.
Suppose we repeatedly observe a system in a wellunderstood but
stochastic way and get some frequency distribution of outcomes.
What's our best guess for the probability that the system is in a
given state? I'm hoping that under some conditions the Bayesian
approach gives results that match Jaynes' maximum entropy method.

If we observe a number n of y's coming from a single x (denoting the
set of outcomes by Y  the set is sufficient since the y samples are
independent given x), then
P(x  Y) \propto P(Y  x) P(x) .
If we take P(x) to be uniform, then we have
P(x  Y) \propto \prod_{y \in Y) f(y, x)^{n_{y}} ,
where n_{y} is the number of occurrences of y in the n observations. As
n gets large, we will have
n_{y} ~ n. f(y, x)
so
ln P(x  Y) ~ \sum_{y \in Y} n f(y, x) ln f(y, x)
so assuming we evenutally observe every possible value of y, we will
have
P(x  Y) ~ e^{ n H[f](x)}
where H[f](x) is the entropy of f(y, x) as a distribution on y. As n
gets very large, P(x  Y) will concentrate on those x's which minimize
H[f](x), which is not the maximum entropy result, at least not in the
way you defined it in your post.
Ian. 

Back to top 


Ian1 science forum beginner
Joined: 16 Feb 2005
Posts: 22

Posted: Fri Jul 07, 2006 11:35 am Post subject:
Re: Maximal entropy from Bayes' law?



Dear Thomas,
thomaswc@gmail.com wrote:
Quote:  In my understanding, the "orthodox Bayesian" solution to your
setup would involve a prior on p, the probability measure on X.
If you wish, you may call this P(p) a hyperprior.
Next, you need to formalize what you mean by a best guess.
Here are two possiblities:
1) The p maximizing P(p) subject to the constraint f(p) = q.
This may not be unique.
2) The p' minimizing the expected loss function
EL(p') = sum_{p  f(p) = q} L(p', p) P(p)
where L is some loss function specifying how embarassed you
would feel about guessing p' when the real answer is p. Note
that the expected loss minimizing p' isn't guaranteed to satisfy
f(p') = q (but it might anyhow).

So if the loss function is \delta(p', p), we are doing MAP estimation,
and we are back to (1).
Quote:  If you are willing to bend the rules a bit, you can fit maximum
entropy into this framework. Just think of maximum entropy
as an unnormalized hyperprior P(p) that says P(p) is infinitely
greater than P(q) whenever p has higher entropy than q.

The 'infinite difference' seems unnecessary to me. If the prior is
simply e^H(p), then the MAP estimate will be to maximize H(p) subject
to the constraint f(p) = q, which is exactly the maximum entropy
solution suggested by John Baez. Or am I missing something?
Quote:  From the above description, it should be clear that maximum
entropy is a shockingly strong assumption. It should only be
used if
A) Your probability measure really is being generated by an
astronomically large number of microstates interacting. I'm
told this sometimes happens in physics.

The relation between Bayes' theorem and maximum entropy that I outline
in my posts does suggest that maximum entropy should only be applied
when the 'measurement' you have is a very good estimate at least of
the mean of p. It is an interesting question when this assumption is
justified (deviations from it would mean deviations from the canonical
ensemble in statistical physics, for example, because the Bayes'
theorem argument should be applied with finite m). In physical systems,
perhaps it is because of the timeaveraging that would be involved in
any real measurement?
Quote:  John Baez writes:
Another approach is Bayesian: use Bayes' rule! The conditional
probability for any measurement outcome y given the state x is f(yx).
Use counting measure on X as our "prior", and use Bayes' law to
update this prior when we do an observation and get the outcome y.
In fact, let's imagine that we keep doing observations, getting
outcomes distributed according to the probability measure q, and
use these to keep updating our probability measure on X. In the
limit of infinitely many observations, this measure should converge
(almost surely) to some fixed measure on X.
I'm skeptical about this converging; I don't see any reason why it
wouldn't just hop around all of the various p where p = f(q).

I do not think the rest of your post (snipped) is analysing what John
Baez was suggesting. The result certainly converges, according to my
interpretation of what he was saying.
Ian.
Ian. 

Back to top 


Ian1 science forum beginner
Joined: 16 Feb 2005
Posts: 22

Posted: Thu Jul 06, 2006 10:52 am Post subject:
Re: Maximal entropy from Bayes' law?



John Baez wrote:
Quote:  I was walking on the beach with Chris Lee and James Dolan last
weekend, and we came across a problem about statistical inference.
Suppose we repeatedly observe a system in a wellunderstood but
stochastic way and get some frequency distribution of outcomes.
What's our best guess for the probability that the system is in a
given state? I'm hoping that under some conditions the Bayesian
approach gives results that match Jaynes' maximum entropy method.

Further to my previous post: as you have probably noticed, the
situation you are discussing is more complex in one way, and more
simple in another, than the one Jaynes is discussing in the reference I
gave.
Jaynes is discussing the case where there is a deterministic relation
between your x and y (rather than my x and y  sorry my notation was
confusing). In other words, f(y, x) = \delta(y, F(x)) for some function
F. Then, given <F>, the mean of F under an unknown measure on X,
maximum entropy furnishes a measure on X. Alternatively, given the
average of y over m samples of y (equivalently, the average of F(x)
over m samples of x), Bayes' theorem furnishes a posterior measure on
X^{m}. Marginalizing gives a measure on X. As m > \infty, the two
measures agree, essentially because the second method counts the number
of ways that (m  1) x's can produce a given average value of F given
that the m^{th} value is fixed.
Your situation involves a stochastic function f, which, what is more,
is not measured directly. There are two ways to think about
constructing a measure on X, as you say. One is to take the maximum
entropy measure on X under the constraint q = f(p). (It should be
noted, though, that you do not have q. You have samples from q.) The
other is to look at n samples of y coming from a single unknown value
x, and to construct the posterior measure on x. It is hard to see how
the latter will produce a maximum entropy distribution, though, as it
does not 'count the number of ways' of producing something, or at
least, I cannot see how it does.
The combination of the two cases would involve m different points in
Y^{n} coming from m points in X (so for each x sample, there are n
samples from f(y, x)). Bayes theorem would then construct a measure on
X^{m}, which could be marginalized to given a measure on X. Jaynes'
argument suggests that it is in the case m > \infty (and probably,
because you have a stochastic function and not a deterministic one, n
> \infty as well) that the equivalence will exist.
To put it another, I think equivalent, way: the constraint q = f(p) is
equivalent to a large number of constraints of Jaynes' type, because
q(y) = \int dx f(y, x) p(x) = <f_{y}>, where f_{y}(x) = f(y, x). To fit
in with his argument then, requires being given, for each y, the
average value of f_{y} over m samples of x, and then to let m >
\infty. In practice, the values of f_{y} are not available directly,
but something similar would be the average of the frequencies of
occurrence of y across the different samples of x. For these to be
'precise' measures of the average values of the f_{y} though, indeed
to guarantee having an average value for every y, means taking n >
\infty. This seems equivalent to the previous paragraph.
Ian. 

Back to top 


Ian1 science forum beginner
Joined: 16 Feb 2005
Posts: 22

Posted: Thu Jul 06, 2006 8:48 am Post subject:
Re: Maximal entropy from Bayes' law?



John Baez wrote:
Quote:  I was walking on the beach with Chris Lee and James Dolan last
weekend, and we came across a problem about statistical inference.
Suppose we repeatedly observe a system in a wellunderstood but
stochastic way and get some frequency distribution of outcomes.
What's our best guess for the probability that the system is in a
given state? I'm hoping that under some conditions the Bayesian
approach gives results that match Jaynes' maximum entropy method.

You know Jaynes' book, so perhaps you already know this article:
http://bayes.wustl.edu/etj/articles/stand.on.entropy.pdf.
Pages 38  41 discuss a relationship between maximum entropy and Bayes'
theorem. As I understand it, it is the following. Let X be the 'state
space', Y = X^{n}, and F be a function from X to somewhere suitable.
1) Given the constraint <F> = f, one can maximize entropy over measures
on X to get a measure on X.
2) Given that (1/n) \sum_{i = 1}^{n} F(y_{i}) = f for an unknown sample
y \in Y, one can use Bayes' theorem to compute the posterior measure
for y, and then marginalize to get the posterior measure for x \in X.
As n > \infty, these two procedures agree. The DarwinFowler method in
statistical physics does more or less the same thing to get from the
microcanonical to canonical ensemble I think.
I am not sure if this really addresses your question, but perhaps it
will be helpful.
Ian. 

Back to top 


thomaswc@gmail.com science forum beginner
Joined: 03 Jul 2006
Posts: 1

Posted: Mon Jul 03, 2006 10:50 pm Post subject:
Re: Maximal entropy from Bayes' law?



In my understanding, the "orthodox Bayesian" solution to your
setup would involve a prior on p, the probability measure on X.
If you wish, you may call this P(p) a hyperprior.
Next, you need to formalize what you mean by a best guess.
Here are two possiblities:
1) The p maximizing P(p) subject to the constraint f(p) = q.
This may not be unique.
2) The p' minimizing the expected loss function
EL(p') = sum_{p  f(p) = q} L(p', p) P(p)
where L is some loss function specifying how embarassed you
would feel about guessing p' when the real answer is p. Note
that the expected loss minimizing p' isn't guaranteed to satisfy
f(p') = q (but it might anyhow).
If you are willing to bend the rules a bit, you can fit maximum
entropy into this framework. Just think of maximum entropy
as an unnormalized hyperprior P(p) that says P(p) is infinitely
greater than P(q) whenever p has higher entropy than q. (You
could make this precise with probability measures valued in
nonArchimedean fields, but where's the fun in that?)
Quote:  From the above description, it should be clear that maximum
entropy is a shockingly strong assumption. It should only be 
used if
A) Your probability measure really is being generated by an
astronomically large number of microstates interacting. I'm
told this sometimes happens in physics.
B) You don't care. Or perhaps more precisely, when you'd
rather compute a partition function than come up with a loss
function and do the integral in above possibility #2.
John Baez writes:
Quote:  Another approach is Bayesian: use Bayes' rule! The conditional
probability for any measurement outcome y given the state x is f(yx).
Use counting measure on X as our "prior", and use Bayes' law to
update this prior when we do an observation and get the outcome y.
In fact, let's imagine that we keep doing observations, getting
outcomes distributed according to the probability measure q, and
use these to keep updating our probability measure on X. In the
limit of infinitely many observations, this measure should converge
(almost surely) to some fixed measure on X.

I'm skeptical about this converging; I don't see any reason why it
wouldn't just hop around all of the various p where p = f(q).
But let's see. We'll do n observations, y_1 .. y_n, and see what
happens
to
E( P( x  y_1 .. y_n ) ) = sum_{y_1} .. sum_{y_n} P(y_1 .. y_n ) P( x 
y_1 .. y_n )
By Bayes' law,
E( P( x  y_1 .. y_n ) ) = sum_{y_1} .. sum_{y_n} P(x) P( y_1 .. y_n 
x )
Let's assume the y_i are independent,
E( P( x  y_1 .. y_n ) ) = P(x) sum_{y_1} .. sum_{y_n} P( y_1  x ) ..
P( y_n  x )
Rearranging,
E( P( x  y_1 .. y_n ) ) = P(x) ( sum_y f(yx) )^n
But wait! sum_y f(yx) = 1, so
E( P( x  y_1 .. y_n ) ) = P(x)
So I guess it does converge, just to the boring P(x) and not to the
interesting maximum entropy solution.
Thomas C 

Back to top 


John Baez science forum Guru Wannabe
Joined: 01 May 2005
Posts: 220

Posted: Sat Jul 01, 2006 7:48 am Post subject:
Re: Maximal entropy from Bayes' law?



In article <e83q7p$4i8o@odds.stat.purdue.edu>,
Herman Rubin <hrubin@stat.purdue.edu> wrote:
Quote:  In article <e81fac$29a$1@glue.ucr.edu>,
John Baez <baez@math.removethis.ucr.andthis.edu> wrote:
Let's make the problem very finitistic to keep it simple.
Suppose X is a finite set of "states" and Y is a finite set
of "observation outcomes". Suppose we have a function
f: Y x X > [0,1]
giving the conditional probability f(yx) for the outcome of
an observation to equal y given that the system's state is x.
As Einstein said, we should make things as simple as
possible, but not too simple.

In math, a question can only be too simple if you already know
the answer. I don't know the answer to this one.
Quote:  This is too simple, as it is quite difficult to come up with
a real problem with X finite.

That's okay. I'm trying to understand the answer to
the very simplest case of a class of conjectures before
tackling harder and more interesting ones.
Quote:  We can think of f as a "stochastic function" from states to
observation outcomes, sending each point of X to a probability
measure on Y. By linearity, f gives a map from probability measures
on X to probability measures on Y. I'll also call this f.
Also, typically the function f is of a restricted type.
We generally assume that observations are independent,
or have some restricted type of dependence. Otherwise,
little progress can be made.

That's okay  I just want to know whether my conjecture is
true, false, or if there's some actual logical inconsistency
in my statement of it.
Quote:  One approach is Jaynesian: maximize entropy! In other words,
look at the probability measures p with f(p) = q and choose
the p with maximum entropy.
This is something people know how to do.
But other than knowing how, why should we do this?

As you probably know, there's a vast literature on this
issue starting with Jaynes' book:
http://omega.albany.edu:8008/JaynesBook.html
or in some sense even the work of Boltzmann and Gibbs.
It's fascinating and controversial.
But it's sort of no fair, when someone states a mathematical conjecture
claiming that the lefthand of some putative equation equals the
righthand side, to ask "but why should we compute the lefthand side?"
Quote:  Another approach is Bayesian: use Bayes' rule! The conditional
probability for any measurement outcome y given the state x is f(yx).
Use counting measure on X as our "prior", and use Bayes' law to
update this prior when we do an observation and get the outcome y.
Which Bayes rule?

I think there's a rule which says how to compute the posterior
probability for an event x if you 1) know the probability of y
given x, 2) know the probability of y, and 3) have chosen a
prior probability for x:
http://en.wikipedia.org/wiki/Bayes'_rule
If I'm confused about this, that's the sort of thing I'd
definitely like to know.
Quote:  The counting measure on X, even if it is
finite, is rarely even a remotely reasonable prior.

That's okay. I'm trying to state the simplest possible
case of a class of conjectures, so I'm taking the simplest
possible prior. If the conjecture turns out to be true,
then I can see if it's still true with a more general class
of priors.
Quote:  All
attempts to come it with reasonable "natural" priors seem
to fail at several points.

Right. I'm just choosing some completely stupid prior in
order to keep the problem simple.
Quote:  In fact, let's imagine that we keep doing observations, getting
outcomes distributed according to the probability measure q, and
use these to keep updating our probability measure on X. In the
limit of infinitely many observations, this measure should converge
(almost surely) to some fixed measure on X.
Do we get the same answer using both approaches?
This requires that Y change with each additional observation,
or that only part of Y is observed.

Okay, now you're tackling my question! Great!
But, I'm not sure I get your point. I'm definitely keeping Y fixed.
I'm letting the observation outcome y in Y depend in a stochastic
way on the state x in X. So, even if I fix a particular state x,
and keep doing observations, I'll keep getting different results y;
they'll be distributed with probabilities f(yx). I guess I
should have said that they're independently distributed! Sorry.
So, for example, I could have a 6sided die, so
X = {1,2,3,4,5,6}
and you could ask me if the number on it is even or odd, so
Y = {0,1}
and I could look at the die and tell you the correct answer with
probability 3/4 and the wrong answer with probability 1/4 
say I'm not very good at this even vs. odd business, or I've
got bad eyesight. This gives the function f with
f(yx) = 3/4 if y = x mod 2
f(yx) = 1/4 otherwise
Now say I keep rolling the die and you keep asking me whether
the number that comes up is even or odd. Suppose I say
it comes up even 2/3 of the time.
Now comes a question: "what's the probability that I roll a 2?"
To answer this, I'll use as my prior the one where the die is fair.
I can then try to answer my question in a maximum entropy way,
or a Bayesian way, as sketched above... and my question is
whether these give the same answers.
Let me do it the maximum entropy way, just for kicks.
I want to find a probability distribution p on X that
maximizes
sum p(x) ln(p(x))
subject to the constraint that
3/4 (p(2) + p(4) + p(6)) + 1/4 (p(1) + p(3) + p(5)) = 2/3
meaning that I report the die coming up "even" 2/3
of the time. This constraint is of the form
sum g(x) p(x) = constant
so I can use Lagrange multipliers and get
p(x) = Z exp(g(x)/c)
for some constants c and Z, which I can work out using
the fact that the sum of the p(x) should be 1, and
3/4 (p(2) + p(4) + p(6)) + 1/4 (p(1) + p(3) + p(5)) = 2/3
Hmm, I'm too tired to actually grind it out right now 
it's almost 1 am. But anyway....
Quote:  And yes, in the limit with infinitely many identically
distributed observations with finite X, if the f's are
all different, anything which does not exclude some
elements of X works.

I'm sorry, I don't get what you mean about "if the f's
are all different". I'm picking one finite set X, one
finite set Y, and one function f: Y x X > [0,1] at the
very start of the problem.
I have a feeling that communication has barely begun.
Please forgive the fact that I may be asking a weird
question and may be using terminology in a funny way. 

Back to top 


Herman Rubin science forum Guru
Joined: 25 Mar 2005
Posts: 730

Posted: Fri Jun 30, 2006 6:23 pm Post subject:
Re: Maximal entropy from Bayes' law?



In article <e81fac$29a$1@glue.ucr.edu>,
John Baez <baez@math.removethis.ucr.andthis.edu> wrote:
Quote:  I was walking on the beach with Chris Lee and James Dolan last
weekend, and we came across a problem about statistical inference.
Suppose we repeatedly observe a system in a wellunderstood but
stochastic way and get some frequency distribution of outcomes.
What's our best guess for the probability that the system is in a
given state? I'm hoping that under some conditions the Bayesian
approach gives results that match Jaynes' maximum entropy method.
Let's make the problem very finitistic to keep it simple.
Suppose X is a finite set of "states" and Y is a finite set
of "observation outcomes". Suppose we have a function
f: Y x X > [0,1]
giving the conditional probability f(yx) for the outcome of
an observation to equal y given that the system's state is x.

As Einstein said, we should make things as simple as
possible, but not too simple. This is too simple, as
it is quite difficult to come up with a real problem
with X finite.
Quote:  We can think of f as a "stochastic function" from states to
observation outcomes, sending each point of X to a probability
measure on Y. By linearity, f gives a map from probability measures
on X to probability measures on Y. I'll also call this f.

Also, typically the function f is of a restricted type.
We generally assume that observations are independent,
or have some restricted type of dependence. Otherwise,
little progress can be made.
Quote:  Given a probability measure q on Y, what's our best guess
for a probability measure p on X with f(p) = q? Assume such
a measure exists.
One approach is Jaynesian: maximize entropy! In other words,
look at the probability measures p with f(p) = q and choose
the p with maximum entropy.
Of course to define entropy we need a "prior" measure on X;
let's use counting measure. So, we are trying to maximize
sum_{x in X} p(x) ln(p(x))
over p satisfying the linear constraint f(p) = q. This is something
people know how to do.

But other than knowing how, why should we do this?
Quote:  Another approach is Bayesian: use Bayes' rule! The conditional
probability for any measurement outcome y given the state x is f(yx).
Use counting measure on X as our "prior", and use Bayes' law to
update this prior when we do an observation and get the outcome y.

Which Bayes rule? The counting measure on X, even if it is
finite, is rarely even a remotely reasonable prior. All
attempts to come it with reasonable "natural" priors seem
to fail at several points. I have given reasonable finite
examples where the universe would not be large enough to
contain the information needed for a reasonable prior.
Another approach is to consider consequences, and to try to
act in a consistent manner. In this case, it is well known
that only the product of loss and prior is relevant to the
action to be taken. More than this cannot be done except
by fiat; one can use conditional measures, and they are
always probability measures. See my paper in _Statistics
and Decisions_, 1987, for a weak set of axioms, and why
the two factors cannot be separated.
With this, maximum entropy is NOT consistent, as the prior
changes with the type of observation. A selfconsistent
approach must use the SAME prior on x, regardless of which
f is considered.
Quote:  In fact, let's imagine that we keep doing observations, getting
outcomes distributed according to the probability measure q, and
use these to keep updating our probability measure on X. In the
limit of infinitely many observations, this measure should converge
(almost surely) to some fixed measure on X.
Do we get the same answer using both approaches?

This requires that Y change with each additional observation,
or that only part of Y is observed. Both of these can be
handled, and they give different results with maximum
entropy, although not with a particular Bayes prior.
And yes, in the limit with infinitely many identically
distributed observations with finite X, if the f's are
all different, anything which does not exclude some
elements of X works. With infinitely many, there are
problems, but if the problem is simple enough, again
just about anything works. But as the complexity
increases, the prior becomes more important, and I am
unaware of situations where "objectivity" can be used
to get reasonable results for reasonable sized samples,
say 10^10 or so.
Quote:  If so, we might say we had a "derivation" of the maximal entropy
principle from Bayes' law... or vice versa.
Feel free to fix my question; my description of the Bayesian
approach is pretty sketchy, and I may have something screwed up.

The Bayesian approach is to use SOME prior (and loss) in
combination. The prior to use must not depend on the
distribution f of the observations, as "maximum entropy"
does. If some cases with densities, maximum entropy can
even give what are known to be bad results, as the maximum
entropy distribution, which needs constraints, will not
have the right properties.
There are many ways to show that all selfconsistent
procedures are at least approximately Bayesian. However,
it is generally the case that attempting to come up with
"natural" priors leads to difficulties, some of which can
be quite serious. This is a hard problem.
To give an example, suppose one is interested in testing
whether a parameter is within a distance q of 0. Set up
a lossprior combination. Assume the sample size is large
enough that normality of the observations can be assumed.
I have discussed the problem, and there are things which I
can recommend, but not covering all cases.

This address is for information only. I do not claim that these views
are those of the Statistics Department or of Purdue University.
Herman Rubin, Department of Statistics, Purdue University
hrubin@stat.purdue.edu Phone: (765)4946054 FAX: (765)4940558 

Back to top 


John Baez science forum Guru Wannabe
Joined: 01 May 2005
Posts: 220

Posted: Thu Jun 29, 2006 9:05 pm Post subject:
Maximal entropy from Bayes' law?



I was walking on the beach with Chris Lee and James Dolan last
weekend, and we came across a problem about statistical inference.
Suppose we repeatedly observe a system in a wellunderstood but
stochastic way and get some frequency distribution of outcomes.
What's our best guess for the probability that the system is in a
given state? I'm hoping that under some conditions the Bayesian
approach gives results that match Jaynes' maximum entropy method.
Let's make the problem very finitistic to keep it simple.
Suppose X is a finite set of "states" and Y is a finite set
of "observation outcomes". Suppose we have a function
f: Y x X > [0,1]
giving the conditional probability f(yx) for the outcome of
an observation to equal y given that the system's state is x.
We can think of f as a "stochastic function" from states to
observation outcomes, sending each point of X to a probability
measure on Y. By linearity, f gives a map from probability measures
on X to probability measures on Y. I'll also call this f.
Given a probability measure q on Y, what's our best guess
for a probability measure p on X with f(p) = q? Assume such
a measure exists.
One approach is Jaynesian: maximize entropy! In other words,
look at the probability measures p with f(p) = q and choose
the p with maximum entropy.
Of course to define entropy we need a "prior" measure on X;
let's use counting measure. So, we are trying to maximize
sum_{x in X} p(x) ln(p(x))
over p satisfying the linear constraint f(p) = q. This is something
people know how to do.
Another approach is Bayesian: use Bayes' rule! The conditional
probability for any measurement outcome y given the state x is f(yx).
Use counting measure on X as our "prior", and use Bayes' law to
update this prior when we do an observation and get the outcome y.
In fact, let's imagine that we keep doing observations, getting
outcomes distributed according to the probability measure q, and
use these to keep updating our probability measure on X. In the
limit of infinitely many observations, this measure should converge
(almost surely) to some fixed measure on X.
Do we get the same answer using both approaches?
If so, we might say we had a "derivation" of the maximal entropy
principle from Bayes' law... or vice versa.
Feel free to fix my question; my description of the Bayesian
approach is pretty sketchy, and I may have something screwed up. 

Back to top 


Google


Back to top 



The time now is Tue Oct 23, 2018 11:54 am  All times are GMT

