Information theory and Entropy

Tom Carter

A. How to measure compexity: as the amount of Unpredicted Information

We want to be able to compare two systems, and be able to say that system A is more complex than system B.

Various approaches to this measure complexity have been proposed, among them:

1. Human observation and (subjective) rating

2. Number of parts or distinct elements (what counts as a distinct part?)

3. Dimension (measured how?)

4. Number of parameters controlling the system

5. Minimal description (in which language?)

6. Information content (how do we define/measure information?)

7. Minimal generator/constructor (what machines/methods can we use?)

8. Minimum energy/time to construct

My first focus will be on measures related to how surprising or unexpected an observation or event is. This approach has been described as information theory.

B. DERIVING A DEFINITION OF INFORMATION: i (P) = LOG(1/P)

We would like to develop a usable measure of the information we get from observing the occurrence of an event having probability p .

We will want our information measure I(p) to have several properties (note that along with the axiom is motivation for choosing the axiom):

1. We want the Information Measure to be a non-negative quantity: I(p) ≥ 0.

2. If an event has probability 1 (it is certain that it will occur), we get no information from the occurrence of the event: I(1) = 0.

3. If two independent events occur (whose joint probability is the product of their individual probabilities), then the information we get from observing the events is the sum of the two informations: I(p1 * p2) = I(p1)+I(p2). (This is the critical property . . . )

4. We will want our information measure to be a continuous (and, in fact, monotonic) function of the probability (slight changes in probability should result in slight changes in information).

We can therefore from this axioms derive the following:

1. I(p2) = I(p * p) = I(p)+I(p) = 2 * I(p)

2. Thus, further, I(pn) = n * I(p) (by induction . . . )

3. I(p) = I((p ^1/m)^m) = m * I(p ^1/m), so

I(p ^1/m) = 1 m * I(P) and thus in general I(p ^n/m) = n m * I(p)

4. And thus, by continuity, we get, for 0 < p < 1, and a > 0 a real number: I(pa) = a * I(p)

From this, we can derive the nice property:

I(p) = −log_b(p)

This will be our definition of Information Measure

I(p) = log_b(1/p)

[This means that the information enclosed in a event, is inversely correlated to the probability of the event: the more probable the event, the less information it carries]

C. Defining entropy as the average Information per symbol

Suppose now that we have n symbols {a1, a2, . . . , a_n}, and some source is providing us with a stream of these symbols. Suppose further that the source emits the symbols with probabilities {p₁, p2, . . . , p_n}, respectively. For now, we also assume that the symbols are emitted independently (successive symbols do not depend in any way on past symbols). What is the average amount of information we get from each symbol we see in the stream?

If we observe N (independent) observations, we will get total information I of:

I = ∑ (N * p_i) * log(1/p_i)

But then, the average information we get per symbol observed will be:

I/N = (1/N) ∑(N * p_i) * log(1/p_i) = ∑ p_i * log(1/p_i)

This will be our (actually Shanon’s) definition of entropy: the average information per symbol in a stream of symbols, or the expected value of information:

H(P) = ∑ p_i* log(1/p_i)

D. What is the maximum amount of unpredicted information that a system may Have

Using the Gibbs inequality to find the maximum of the entropy function above, we got:

0 ≤ H(P) ≤ log(n)

That is, the maximum of the entropy function is the log of the number of possible events, and occurs when all the events are equally likely.

An example illustrating this result: How much information can a student get from a single grade? First, the maximum information occurs if all grades have equal probability (e.g., in a pass/fail class, on average half should pass if we want to maximize the information given by the grade).

The maximum information the student gets from a grade will be:

Pass/Fail : [log (1/2) = 1] = 1 bit

A, B, C, D, F : log (1/5) = 2.3 bits

A, A-, B+, . . ., D-, F : log (1/7) = 3.6 bits.

Thus, using +/- grading gives the students about 1.3 more bits of information per grade than without +/-, and about 2.6 bits per grade more than pass/fail.

E. Other characteristic of this Definition

1. These definitions of information and entropy may not match with some other uses of the terms. For example, if we know that a source will, with equal probability, transmit either the complete text of Hamlet or the complete text of Macbeth (and nothing else), then receiving the complete text of Hamlet provides us with precisely 1 bit of information.

2. It is important to recognize that our definitions of information and entropy depend only on the probability distribution. In general, it won’t make sense for us to talk about the information or the entropy of a source without specifying the probability distribution.

3. This observation (almost 🙂 accords with our intuition: two people listening to the same lecture can get very different information from the lecture. For example, without appropriate background, one person might not understand anything at all, and therefore have as probability model a completely random source, and therefore get much more information than the listener who understands quite a bit, and can therefore anticipate much of what goes on, and therefore assigns non-equal probabilities to successive words.

[…]