Daniel Shawcross Wilkerson

8 and 15 October 2006, 26 January 2009

Information Entropy is not often explained well, yet it is quite straightforward to understand and that understanding can be quite helpful in everyday life. Here we explain Information Entropy from scratch using simple mathematics and examples from everyday life, in particular deriving from first principles the best method of playing the game "Twenty Questions".

There is a popular game called Twenty Questions that works like this. One person is the "Knower" and picks a point out of the probability space of all objects (thinks of an object). The other is the "Guesser" and asks the Knower to evaluate various random variables (questions) at that point (answer the questions of the object). The Guesser wins if he can guess the object the Knower is thinking about by asking at most Twenty Questions.

Here is a stupid way to play the game:

Knower: I got one. Guesser: Is it my left big toe? Knower: No. Guesser: Is it the Washington Monument? Knower: No. Guesser: Is it Aunt June? ...

Anyone who has played this game can do better. But can we examine rigorously how, and say precisely by how much, we can do better?

When my telephone rings, I do the experiment of answering it: sometimes its a hot girl; sometimes it is a telemarketer. Even if we know which is more likely, we never know for sure. I called a girl last night and she immediately exclaimed "we were just talking about you!"; we are quite surprised when the measurement of picking up the phone has even the slightest predictability, such as being the person about whom we were just speaking. Picking up the phone is quite an interesting experiment to do.

On the other hand, there are also much less interesting experiments: we basically know how they are going to turn out. When I turn the key in my car, it starts. I have a well-maintained car and a great mechanic so it doesn't surprise me when my car starts. It isn't interesting enough to even enter into a discussion with another person when making plans with them: I don't say "I'll be there at 8pm, as long as my car starts"; I take it completely for granted. Starting my car is not a very interesting thing to do.

Further we rely on our ability to predict the difference between interesting and un-interesting experiments. In particular, we rely on most of our day consisting of rather un-interesting experiments, such a taking a step, consulting our pay-stub, or greeting a friend. If all of these events were suddenly to be come interesting, we would be in trouble. What if it was a coin-flip whether my car starts, the floor is there, my paycheck comes, or my old friend recognizes me? Life would be unlivable. We are careful to know when an interesting experiment is coming: we put our attention on it, we plan for the various eventualities, it punctuates the otherwise more predictable parts of our life.

While we may consider some things in life to be predictable or known and others random, there is in fact no fundamental separation. Each moment is an experiment. Is there a way to measure which experiments are more interesting and which are less?

A Random Variable it is a mathematical abstraction for modeling what we call in the real-world a "measurement", "observation", or "experiment". The definition that mathematicians give is that it is a function of a set you don't know anything about called the "probability space" (they add a few other technical conditions that basically just make the math work). You could think of the probability space as the complete state of the world beyond your knowledge.

That is, if my phone rings, to me whoever is at the other end of the line is random: although it might more often be one person than another, I really don't know. But to someone who has a god-like knowledge of the complete state of the universe, they know before I answer that it is my mom (forget the problems resulting from quantum mechanics for the moment). Another way to think about it is that mathematics is deterministic, so for us to be able to apply mathematics to this question, we have to squeeze all the randomness down into one thing, so we put it into the mysterious and unknowable "probability space" and then think of our random variables as being deterministic functions of it.

A "bit" is a binary decision between two things. If you ask for a
milkshake and the waiter says "chocolate or vanilla" in theory he just
wants one bit of information from you. He might ask "whipped
cream?"; he wants another bit. "Sprinkles?"; another bit. But notice
that these two possibilities and another two and another two actually
multiply for a total of eight. That is, the total number of
possibilities is two to the power of the number of bits; here `2^3
= 8`:

- chocolate no whipped cream no sprinkles,
- chocolate no whipped cream sprinkles,
- chocolate whipped cream no sprinkles,
- chocolate whipped cream sprinkles,
- vanilla no whipped cream no sprinkles,
- vanilla no whipped cream and sprinkles,
- vanilla whipped cream no sprinkles,
- vanilla whipped cream and sprinkles,

When a random variable has a particular value this is called an "event". That is, the set of possible world-states where the person calling me on the phone is my mom is an event; when we are in one of those worlds we say the "mom is calling" event has occurred. From the perspective of all possible worlds at once we say the subset of the probability space where it is mom that is calling is the event.

Note that this event really is a subset of the possible worlds space or probability space. When mom is calling, the number of birds nesting in the Campanile tower on the Berkeley campus could be two, or it could be zero; it doesn't matter, because either way my mom is calling. So an event consists of a whole subset of possible worlds. Only some of the information about a given world is relevant: its membership in the event-set; other information is not relevant.

If we look at all possible values of a random variable, we notice that their events are a "partition" of the probability space. That is, any possible world is in exactly one of them: the events cover the space and also do not overlap.

It is helpful to partition the idea of a "measurement" of a random variable into two steps:

- Pick a point in the probability space.
- Compute the random variable's function of that point and return its value.

Now, note that a random variable can be thought of as an information-losing filter for the state of the world observed in step 1. That is, when we pick a point from the probability space, we do not get to find out all of that information about that point; instead we only get to find out the value of the random variable function when computed on that point. The particular value we get out of the random variable tells us that an event has occurred, but all we know from that is that the state of the world is one of the many points in that event-set, the set of points that would give that value. That is, observing only the value of the random variable, some information about the state of the world has been lost.

Suppose you play a game where you flip a fair coin and if it comes up heads you win a dollar and if it comes up tails you win nothing. How much do you "expect" to win?

I basically expect to win fifty-cents. Now, that's a bit odd, because I can't actually win fifty-cents! However, what we mean is "expect on average of identical independent games played in the long run". We discuss further what that means exactly below, but for the moment your intuition will work rather well here.

The number fifty-cents is called the "expected value". We computed it as follows.

- Consider each possible event;
- Compute its value and its probability;
- Multiply the value and probability;
- Add those up.

That is in the previous example we did this: (1/2 * $0) + (1/2 * $1) = $0.5.

It can be shown in our mathematical model called probability theory that the amount you actually win of independent games played in the long run is in fact fifty-cents per play on average. Such results are called "Limit Theorems", the famous one being the Central Limit Theorem. These are beyond the scope of this essay but most people don't find this understanding of expectation so unintuitive, so we rely on intuition for our purposes here.

In general for a random variable that has a value that is a number (so we can add and multiply it), the expected value is the sum for each outcome of the product of the probability of the outcome and the value of the outcome. This is a useful number because in the long run, you do tend to get what you expect. If we call our random variable R, we get the following formula.

Expected Value of R = Sum {over x} Prob(R = x) * Value(x)

Now we come to the main question: can we measure how much information is preserved by a random variable? Another way to say it is, how interesting is that random variable? If a random variable told us the whole state of the world, that would be pretty interesting; we would get lots of information. However if a random variable always said '0', that would be pretty boring. How do we measure this interestingness?

Imagine a one meter wide by 10 cm high board tiled with colored rectangles. Suppose our complete state of the world, or probability space, is a point somewhere on this board. Our random variable is the color of the region that the point is in.

+------+------+------+------+------+------+------+------+ | | | | | | A | B | C | D | | | | | | +------+------+------+------+------+------+------+------+

Information about the point's vertical position has no effect on the outcome, so we just ignore it; it is part of the information that is irretrievably lost. However, we do get some information about its horizontal position from the random variable output: the color.

Let's say that the horizontal position of the point is represented by a string of zeros and ones as follows. If the point is on the left half, the first digit in the string is zero, otherwise if it is on the right half, the first digit is one. Once we are restricted to only one half of the board, we can repeat this process of naming which half of the half we are on and so on. We may pin down the position of the point as narrowly as we like.

In mathematics one just assumes that this process can continue forever, but at small enough scales, quantum mechanics intervenes. Let's imagine we go out to ten digits, so or resolution would be 2^-10 m which is about 10^-3 = one thousandths of a meter or one millimeter. So in reality a point is picked at the millimeter granularity, one in about a thousand possibilities, but we only get to find out one of four possibilities. Information is clearly lost.

Further, the information we get when we do get a color is non-uniform. Let's consider each color and see how many bits of the real ten bit string we can recover.

- D: all we find out is that the point is somewhere on the right. This just tells us that the first bit of the point is '1'. Only one bit of information is preserved.
- A: we find out that the first bit is '0' and the second is '0'. Two bits preserved.
- B: the first bit is '0', the second '1', the third '0'. Three bits.
- C: the first bit is '0', the second '1', the third '1'. Again, three bits.

Wow, sometimes we can get a whole three bits of information! Recall however that we can't control a random variable: you never know which point from the probability space is going to happen. What we really want to know is when played independently in the long run how much information do we expect to get on average?

Have you ever heard "he has a six-figure income"? If you have ever
referred to the magnitude of someone's income by the number of figures
it has then you already understand logarithm. Basically, every time
the number of figures in someone's income goes up by one, we know that
they make ten *times* more money. That is, their income is an
*exponential* function of their digits. Reversing it, their
digits are a *logarithmic* function of their income. That's
all there is to the logarithm.

Thelogarithm base 10 of x, also writtenlog_10(x), just means how many 10's you have to multiply together to get x.

For example, log_10(100) = 2, as two 10's multiplied give 100.
Further, log_10(10) = 1, because you only need one 10 to make a 10
(duh!). It is important to realize that log_10(1) = 0: if you want to
multiply things together to get one, there is nothing to do! When I
was a little kid that one drove me crazy -- I ran around the house
yelling at anyone who would listen: "No! You can't get
*something* from *no* multiplies at all!" It was
straightforward once I realized that, in the context of
multiplication, a sequence of multiplies "starts at" one, the
multiplicative identity, rather than zero, the additive identity, so
you get a one "for free" as it were.

What if x is not a power of 10? Well first notice that if you
multiply something by 3 twice, you almost get 10 -- that is the square
root of 10 is 3.16.... So if you have to multiply by 3.16 then that
only counts as multiplying by *half of a 10*, so the logarithm
of the product only goes up by 1/2. That is:

log_10(31.6) = log_10(10 * 3.16)... but the number of 10s we need to multiply together to get the

= log_10(10) + log_10(3.16) = 1 + 0.5 = 1.5.

Above, 10 is the *base* of the logarithm, but we could have
just as easily used another number, such as 2, and we would have
written log_2(x) instead. You get the idea.

To compute expected information we need to know the information value of each color-event, which we computed in the previous section, and its probability. Now note for a moment how we pick the point on the board: while some random variables may be non-uniform, the events may have different probabilities, the points in the probability space are always of the same value. That is, the probability of an event is measured as the size of the part of the probability space that results in that event.

Therefore, when considering the probability of a color-event we can just consider what percentage of the board is covered by that color. Let's consider each color again and note the probability of its occurrence.

- D: all we know is that the point is somewhere on the right. This is half the board so there is a 1/2 chance of this.
- A: one fourth of the board is covered, so the chance is 1/4.
- B: takes one eight of the board, so the chance is 1/8.
- C: again takes one eight of the board, so again the chance is 1/8.

Noticing this, we can finally compute the expected number of bits of information we get about the point on the board just from observing its color.

Expected Value of Color = (1/2 * 1) + (1/4 * 2) + (1/8 * 3) + (1/8 * 3) = 1.75

But wait, this looks very similar to the computation of information that we did above. Each time a region of color got smaller by one half, two things happened together:

- The number of bits of information we got when it did happen went up by one.
- The chance that it did happen went down by a factor of 1/2.

That is the information of an event x is the logarithm of one over its probability.

Information(x) = log_2 (1 / Prob(R = x) )

So in general the expected information or "entropy" of a random variable is the same formula as the expected value with the Value filled in with the Information.

Entropy of R = Expected Information of R = Sum {over x} Prob(R = x) * Information(x) = Sum {over x} Prob(R = x) * log_2 (1 / Prob(R = x) ) = Sum {over x} - Prob(R = x) * log_2 (Prob(R = x)) = - Sum {over x} Prob(R = x) * log_2 (Prob(R = x))

You should now be able to articulate clearly what is wrong with playing Twenty Questions the "stupid way" that we presented in the introduction: the Guesser is picking random variables (questions) that have very low entropy and so it getting very little information on average. Let's see why.

Consider a random variable with only two values with very unequal probabilities; the technical term is "an unfair coin" but I don't want to scare you with technical jargon. Let's say the probability of heads is one in a thousand. We compute the entropy of flipping the coin once as follows:

Entropy = 1/1000 * - log_2(1/1000) + 999/1000 * - log_2(999/1000) = 0.011407757737461138

That is, a very biased yes-no question has very little entropy: you might get lucky and find out a lot of information (and win the game) all at once, but mostly you will just hear a no, which has little information in it (the object is not the Washington Monument).

The best way to play this game is, as most people figure out, to ask questions that are about a fifty-fifty chance of giving you a yes. Computing the entropy of such a question we get

Entropy = 1/2 * - log_2(1/2) + 1/2 * - log_2(1/2) = 1.0

That is, if you ask a one-bit question that gives you a bit every time, you can expect to get one bit of information every time! Hmm. Deep.

It is hard to know what question gives exactly a "50-50" chance of a yes or no; however even if you ask a question that has a 1/4 chance of being yes, the entropy is almost 1:

Entropy = 1/4 * - log_2(1/4) + 3/4 * - log_2(3/4) = 0.811

So even if you are picking questions that have only around a 1/4 chance of getting a "yes" answer, you will come close to asking questions with maximum entropy and you will do well.

If we knew the set of things that people select from when asked to "think of something", we could compute questions that would come very close to giving us a yes or no; we could do even better if we knew the probability of each thing in the set. Someone seems to have done this and has built an electronic toy that does rather well at this game. It figured out "apple" from me.

Having not made such a list the average person must rely on their intuition and experience about what general properties of things are about one half likely to get a "yes" and half to get a "no". A classic one is "Is it bigger than a breadbox?" Now that you know what makes a good question when you think of your own you know how to test them: about half games you play the answer should be yes and about half the time no. Throw out questions that tend to always give the same answer.

Suppose you are getting 1 bit of information each time. After Twenty Questions you can expect twenty bits of information. How many possibilities can you distinguish between?

2^20 = 2^10 * 2^10 =~ 10^3 * 10^3 = a thousand thousand = a million

Even if your questions are not so good and you expect only 0.811 bits, computing another way you have

2^(20 * 0.811) = 76331.98 > 75 thousand

Seventy-five thousand is likely larger than the universe of
different objects that your little sister can think of.

There is much more we could do with the Twenty Questions problem. For example, once you know something about the goal object, you can ask better questions: if you find out that the object is inorganic, there is no point in asking if it reproduces sexually or asexually as only organic things reproduce. The relative entropy of a question is the entropy given what you also already know. Maximizing relative entropy at each step results in an end-to-end process called a decision tree.

The precise study of how to make decisions with only incomplete information is deep. Before modern computers, the calculations that could be done were limited; the field of study that resulted is Statistics. However now, with orders of magnitude more computational power, we can do a much better job; we now have methods that in some situations do better than people. For example in constrained circumstances computers can already do better at determining which experiment is most likely to yield the most information given what is already known.

[New Scientist, 14 January 2004 "Robot scientist outperforms humans in lab"] An intelligent robot that could free genomics researchers from routine lab chores has proven as effective as a human scientist. The robot not only performs genetics experiments, it also decides which ones to do, interprets the results and comes up with new hypotheses.

This new field is called Statistical Artificial Intelligence or Machine Learning. One day machines designed from these methods will cure your illness, drive your car, and design your house, maybe even tell you jokes. We aren't there yet, computers can hardly see people or take dictation, but the progress is inexorable.

In the process we are learning just how difficult everyday tasks done by people really are. We easily see objects, but it is a really non-trivial engineering task to write a program to just reliably see a corner of a shape in a photograph taken under real-world conditions [Harris's algorithm]. Now that I know how hard it is to get a machine to see a face or hear a word, I realize that it is hard for my brain as well. I used to be unhappy when I made such mistakes, but now I just see the algorithm making choices given what it knows in the time it has. So one surprising and helpful effect this discipline has had on me is just to have more compassion for the mistakes of myself as human being.

© Copyright 2006 Daniel S. Wilkerson