The Mathematical Theory of Information
Chapter 1: About Information

Question 1: Can information be measured?

Answer: Yes, by any measure not violating the Law of Diminishing Information introduced in this work.

1.1 Introduction

Information is playing an increasing role in our industrialized society. A technical overview of the flourishing electronics industry stated in 1987: "On almost every technology front, the driving force behind new developments is the ever-rising demand for information. Huge amounts of data and information, larger than anyone ever dreamed of a generation ago, have to move faster and faster through processors and networks, then end up having to be stored" {Electronics,1987, p.83}.

Four years later the Industrial age had already given way to the Information age: "In 1991 companies for the first time spent more on computing and communications gear-the capital goods of the new era-than on industrial, mining, farm, and construction machines. Info tech is now as vital, and often as intangible, as the air we breathe, which is filled with radio waves. In an automobile you notice the $675 worth of steel around you, but not the $782 worth of microelectronics" {Stewart, 1994, p.55}.

The space industry provides another example, where man reaches outside the earth merely to gather information, not to mine silver on the moon. But also the more down-to-earth beet sugar industry values information: "Information has become a highly valued economic category, on par with capital and skilled labor. The collection, structuring and processing of information consequently constitute one of the focal points of business, and it is the task of process control engineering to provide adequate support for the production process in this area. The structuring of information is an essential precondition for the optimal conduct of technical processes" {Drathen, 1990, p.631}.

This book is a result of theoretical interest awakened during practical work with information engineering. I myself, as a manufacturer of process refractometers, make my living out of supplying information transmitters. A process refractometer is mounted in an industrial process pipe, and transmits an electrical signal proportional to the concentration of the liquid in the pipe. A transmitter does not cause a change in the process, in contrast to other industrial equipment, the purpose of which is to transform or transport the process medium. The sole mission of the process refractometer is to transmit information to the process control system.

Stimulated by the ever expanding telecommunications sector, much research activity has been focused on information transfer. Claude Shannon laid the foundations of this approach to information in a paper published in Bell Systems Technical Journal, July and October, 1948, later reprinted as a book, The Mathematical Theory of Communication, published in 1949. Shannon's theory is based on the probability theory; he measured information as entropy, or "how much 'choice' is involved in the selection of an event or how uncertain we are of the outcome" {Shannon, 1949, Section 6}.

Any information measure defined in terms of probabilities is potentially useful in all areas where the probability theory is applied: in most sciences and in many of the humanities. And, information appears in different spheres, such as biology, "Life, too, is digital information written in DNA" {Ridley, 2000, p.16}, and in physics, "Giving us its as bits, the quantum presents us with physics as information" {Wheeler, 1990, p.7}. Or, a bit differently, to work out "physics from Fisher information" {Frieden, 1998}.

Shannon himself applies the word 'information' only in a descriptive sense to the output of an information source, and he stays resolutely within the framework of telecommunications, using the title communication theory. His followers renamed the theory 'information theory', and now it is too late to revert to the name given by Shannon. Instead, we will refer to it as the classical information theory.

There are excellent textbooks on classical information theory, aimed at students of telecommunications or computers, but which also contain a wealth of material useful to others. Yet, they are oriented to specific applications, the topics are not connected by an underlying general theory.

Here we will propose a general theory of information, which will unify the seemingly disparate topics collected in this book. The general theory is based on the Law of Diminishing Information as an additional axiom of the probability theory. We will show that this Law follows directly from Shannon's conclusions concerning noisy channels, but extrapolated to non-linear systems.

Further, the classical information theory still contains assumptions concerning information that are specific to telecommunications, but not always explicitly acknowledged as such. In a context where they do not apply, such hidden assumptions may lead a biologist or a physicist astray. In this book we will try to make those assumptions visible, faithfully following Shannon’s reasoning, but relaxing the conditions characteristic of communications engineering.

The mathematics is kept at an undergraduate level in science. Hence, the book can be used as class material for an introductory course in information theory. For engineering students, it can be used as material complementary to traditional courses in information technology.

The book can help a scientist, and also a curious information technologist, to grasp the idea of information. All this may even justify the rather pretentious title: The Mathematical Theory of Information.

To Top of Page

1.2 Information @bout

A fundamental, but a somehow forgotten fact, is that information is always information about something. The Oxford dictionary {Oxford, 1974} defines information as:

in·for·ma·tion...on/about,...2 something told; news or knowledge given: ...
Can you give me any information on/about this matter?

A-->[channel]-B-->[receiver]

Fig 1.1 Information transmission.

Information transmission is described by Figure 1.1, which we may visualize as a reader (Receiver) who gets information through a newspaper (Channel) from an article (B) about an event (A): The article (B) gives information about the event (A). A and B can be interpreted in many ways, as anything that carries information: Signals, descriptions, pictures, reports, observations, events. A and B will be called messages or, more exactly, message sets.

The channel can be a newspaper or a telephone, but it can also be your friend (Channel) telling you (Receiver) a story (B) about his achievements (A). For a scientist investigating bugs, his microscope is his information channel. The dictionary {Oxford, 1974} defines:

chan·nel ...4 (fig) any way by which news, ideas, etc may travel: He has secret channels of information.

A mathematical theory of information is naturally concerned with numerical measures of information. The quantity, or the quality, of the information in a message should be expressed as a number on a scale: the higher the number, the better the information.

This number, the information measure, we will here introduce by the definition:

(1.2.1)   inf(B@A) = the information B gives about A

In this context '@' is read as 'about'; @ is seen as an a with a tail 'bout. Alternatively '@' can be read as 'at', a contraction of about. The notation 'inf( )' is borrowed from the semantic information theory {Bar-Hillel, 1964, p.241} and should not, due to the presence of '@', be confused with 'inf' as the abbreviation of 'infimum'.

For a property, as in this case informativity (1.2.1), to be measured on a one-dimensional numerical scale transitivity is required {Raiffa, 1970, p.76}: if X gives more information than Y, and Y gives more than Z, then X gives more information than Z.

Further, in the words of Walter Nernst (translated): "... the scientist must always strive to put his propositions in a form that can be numerically expressed, that is, the qualitative concepts must have a quantitative foundation. A description of a phenomenon is either incomplete or misleading, the communication to others of an observation is aggravated, when the order of magnitude of the observed effect is missing" {Nernst, 1909, p.5}.

To Top of Page

1.3 A Day at the Races

On a day at the races we will learn an essential point: what kind of information measure to choose depends on the application.

Betting is arranged for a race between two horses of equal swiftness, 'Zero' and 'Uno'. A Gambler can bet his dollar on either horse. If it is the right horse, the Gambler wins one dollar, but if it is the wrong horse he loses one dollar. The Gambler then has a 50% chance to win $1.00 and a 50% chance to lose $1.00. Thus his expected gain is 1/2×$1.00 - 1/2×$1.00 = ±0. This can be interpreted statistically by saying that after a day of many similar bets, the Gambler can expect to have neither a gain nor a loss.

Now, imagine that before the race an Insider gives the Gambler a tip (T) about the result (R). How large is inf(T@R) = the information value of the tip about the result?

From the Gambler's point of view, the information value of the tip is the same as his expected gain. If the tip is 100% sure (T = T100%) the Gambler gains a sure dollar and thus

(1.3.1)   inf(T_100%@R) = $1.00

If a tip is only 75% sure (T = T_75%) the Gambler can expect to win 3 out of 4 races. Then the tip T about the result R gives the Gambler an expected gain

(1.3.2)   inf(T_75%@R) = 3/4×$1.00 - 1/4×$1.00 = $0.50

If the Gambler has to pay for the tip, then the value inf(T@R) is the upper limit for what is profitable for him to pay.

We will now study another aspect, and look at the case as a pure communications engineering problem. An Engineer could be employed to design a data transmission line from the Insider to the Gambler. The only purpose of the line is to transmit the Insider's tip. From the Engineer's point of view, the amount of information in the tip is just a question of how large a capacity is required of the transmission line.

In this case the line needs only to transmit one symbol with two alternatives. Perhaps the Engineer instructs the Insider to wave a flag, either flag 0 for 'Zero wins' or flag 1 for 'Uno wins'. A symbol in a symbol set having only two alternatives, like 0 and 1, is called a bit.

The Engineer ignores the semantic aspects, he looks at the tip as just a message to transmit. In the language of the Engineer the tip contains exactly one bit of information,

(1.3.3)   inf(T@R) = 1 bit

because the transmission requires only one 0/1 symbol.

From the perspective of the Gambler the information value of the tip depends on its reliability: 100% sure means a dollar (1.3.1), 75% sure means 50 cents (1.3.2), 55% sure means only a dime. From the perspective of the Engineer, the tip contains exactly one bit of information (1.3.3), independently of its reliability.

Conclusion: It is a matter of perspective whether the information should be measured in dollars or in bits.

To Top of Page

1.4 Bits and Entropy

In digital data processing and communications, the fundamental unit for information transmission and storage capacity is a bit, or the derived unit byte, typically 8 bits. We will now modify the race track example in the preceding section to study the communication requirements when the tip contains the predicted result of four consecutive races between the horses 'Zero' and 'Uno'. If the Insider's tip of the winners list happens to be

Race 1: Zero Race 2: Uno Race 3: Uno Race 4: Zero

he waves the flag 0110. To be able to signal all potential messages, he needs W = 16 flags, where W is the number of possible combinations.

0000    0001    0010    0011    0100    0101    0110    0111

1000    1001    1010    1011    1100    1101    1110    1111

Each flag above contains a string of 4 bits. It is customary and more convenient {Shannon, 1949, Introduction} to measure the amount of information as a 'number of bits' instead of a 'number of possible combinations'. The two numbers are mathematically connected; W = 2ⁿ where the positive real number n = number of bits, or

(1.4.1)   n = log₂(W)

where n does not have to be an integer. A flag from the collection above transmits log₂(16) = 4 bits of information. We divide by four to get the information per race and arrive at the same value as (1.3.3)

(1.4.2)   inf(T@R) = 1/4log₂(16) = 1 bit

The key concept in communications engineering is compression, i.e. shortening messages or data without sacrificing information. If the Insider repeats a pattern, the message can be compressed. If the Insider is invariably of the opinion that 'Uno' will win once but only once in each of four races, he needs only four flags

0001    0010    0100    1000

or the number of possible combinations W = 4. The information is then log₂(4) = 2 bits per 4 races. Thus the Engineer can design a sufficient flag collection based on only two bits per flag (representing 0,1,2,3 in binary form):

00  01  10  11

We can now see that the string of bits in a flag has been compressed from 4 bits to 2 bits for four races. The information per race is now

(1.4.3)   inf(T@R) = (1/n)log₂(W) = 1/4log₂(4) = 1/2 bit

Now we increase the number of bits n while keeping the same proportion between 0 and 1. Let k represent the number of the symbol 1 in a sequence. To begin with, take n = 8 and k = n/4 = 2. Sequences that qualify are e.g.

00010100    01010000    00001100    00100001    . . . .

We count all the possible sequences and we get W = 28. Hence

(1.4.4)   inf(T@R) = (1/n)log₂(W) = (1/8)log₂(28) ~ .601 bit

In case n = 32 and k = n/4 = 8, the number of bits needed to transmit the sequence is

(1.4.5)   inf(T@R) = (1/n)log₂(W) = (1/32)log₂(10,518,300) ~ .729 bit

We can see that with increasing sequence length n, the constraining pattern dissolves and, in proportion, more bits are needed to transmit the sequence. The general form for how (1/n)log₂(W) depends on n and k is given by elementary combinatorial analysis

(1.4.6)   (1/n)log₂(W) = (1/n)log₂(ⁿ_k) = (1/n)log₂(n!/(k!(n-k)!))

where n! reads 'n factorial', so that e.g. 5! = 1×2×3×4×5. The points in Figure 1.2 represent (1.4.6) when k = n/4.

For this figure, see Chapter 1 as pdf file.

Fig 1.2 Number of bits needed (1/n)log₂(W) as a function of the number of bits n in a sequence.

We may ask, when does the entropy enter? In the classical information theory "entropy and relative entropy arise again and again as the answers to the fundamental questions in communication and statistics" {Cover&Thomas, 1991, p.11}.

From Figure 1.2 we can see that (1/n)log₂(W) approaches asymptotically an upper limit when the length n of the sequences grows. For large n, a sequence becomes indistinguishable from a random distribution where the frequency of 1 equals a probability p. In our example p = 1/4. Using the Stirling approximation (7.8.18) for the factorials in (1.4.6) we get the limit

(1.4.7)   lim_n->oo (1/n)log₂(ⁿ_k) = -p log₂p - (1-p)log₂(1-p)

for k = np. The limit in terms of probabilities -Σ p_ilog₂p_i was called 'entropy' by Shannon and denoted by 'H'. A limit of the same form had earlier been used in thermodynamics, and given the name 'entropy' by Clausius^[1], see Section 7.13.

In our example p = 1/4 and the entropy takes the value

(1.4.8)   -Σp_i log₂ p_i = -1/4 log₂1/4 - (1-1/4)log₂(1-1/4) ~ .811

in agreement with Figure 1.2. That is, the entropy is the Stirling^[2] limit of the number of bits needed to transmit a sequence when the length approaches infinity.

For short sequences, however, we may be far from the Stirling limit. For n = 4 and k = 1 we manage with only 1/4log₂(4) = .5 bits/n compared to the entropy ~ .811. On the other side, in a random sequence n = 4 with p = 1/4, any combination of 0 and 1 can occur, and we need as much as 1/4log₂(16) = 1 bits/n.

The essence of Shannon's communication theory is his insight that the compression of long data strings can be measured by entropy. His followers have, however, elevated his entropy function to a universal information measure. Two of them have even been inspired to an entropy eulogy:

"Shannon's measure is an invention. It was designed to fill a specific need: to provide a useful measure of what is transmitted on a communication channel. It has also been shown to be the only function that satisfies certain basic requirements of information theory. In the 23 years since Shannon put forward his measure thousands of papers have been written on the subject and no one has found a replacement function, or even a need for one. On the contrary, many alternative derivations have been found. We conclude that the Shannon entropy measure is fundamental in information science, just as the Pythagorean theorem is fundamental in geometry" {Tribus&McErvine, 1971, p.180}.

To Top of Page

1.5 Hits and Reliability

We will, however, claim that the fundamental measure of information is reliability, i.e. how well a message can be expected to be reproduced: "The fundamental problem of communication is that of reproducing at one point either exactly or approximately a message selected at another point" {Shannon, 1949, Introduction}. We define:

The reliability rel(B@A) is the probability to guess the right message A when the message B has been received, and A is the message sent.

The reliability is used by Shannon, but in the form of "the probability of an incorrect interpretation" {Shannon, 1949, Section 14} denoted by 'q', which is equivalent to 1 - rel(B@A). Hence

(1.5.1)   rel(B@A) = 1 - q

The definition of rel(B@A) directly in terms of probabilities is given by (3.6.1). Before the message B has been received, there is already some probability of guessing the right A, i.e. the initial probability of incorrect interpretation is less than 100%, q₀ < 1. To evaluate the value of the received message B, we rather would measure the increase of reliability brought about by B

(1.5.2)   rel_abs(B@A) = (1- q) - (1 - q₀) = q₀ - q

where 'abs' indicates that the scale is absolute; rel_abs(B@A) = 0 if B contains no information about A. For example, if there is an initial 75% chance to make the right guess, 1 - q₀ = 3/4 and the chance remains at 75% after B has been received, 1 - q = 3/4, then no information has been transmitted.

In general, we will stick to the conventional notation, but we will use a combination of three letters (such as 'rel' and 'ent') for information measures instead of one letter (as the corresponding 'q' and 'H'), due to the need of distinguishing between a larger number of measures. The symbol @ for about is the other exception, instead the classical information theory uses a semicolon in its fundamental measure, which is called mutual information and written I(B;A) {Cover&Thomas, 1991, p.18}. This measure will here be written ent(B@A) and defined in (4.9.2). There is a need to stress the potential asymmetry of information: While I(B;A) happens to be symmetric, I(B;A) is equal to I(A;B), but e.g. rel(B@A) and rel(A@B) are not necessarily so. The Sailor's tale in Section 1.6 will give an example of the asymmetry rel(B@A) ≠ rel(A@B).

In the theory of communication, bits and entropy can be seen as the quantitative measures, and reliability as the qualitative measure. Shannon had no reason to identify reliability as an information measure, because he considered the case of 100% reliability, rel(B@A) = 1, or more exactly: "It is possible to send information at a rate C through the channel with as small a frequency of errors or equivocation as desired by proper encoding" {Shannon, 1949, Section 13}. This result is restricted to messages that are long strings of symbols, e.g. of bits 0011010..., and the rate C is determined by a Stirling limit similar to (1.4.7).

What about shorter strings? Let the message A consist of no more than one bit, either 0 or 1, transmitted through an unreliable channel, rel(B@A) < 1. There is no way of coding to make a single transmission perfect.

Instead, we must turn to another discipline, Pattern Classification, because it addresses the problem of "detecting a single weak pulse, such as a dim flash of light or a weak radar reflection" {Duda,Hart&Stork, 2001, p.48}. Here it is a question of discriminating between two patterns, 'pulse' or 'no pulse'. Expressed in terms of information theory, the transmitted signal A can be either a 0 (absence of a pulse) or a 1 (presence of a pulse). The received signal B is unreliable, but it contains some information about A.

Another place where to look for an answer is Statistics, which treats "the problem of classifying or assigning a sample to one of several possible populations" {Kullback, 1997, p.83}.

In Pattern Classification, accuracy is customarily measured in hits, i.e. the number of correct classifications. We can conclude that sometimes information should be measured on an entropy scale in bits, sometimes on a reliability scale in hits. But, as reliability is relevant for messages of any length, we can see the justification in the motto^[3]: HITS BEFORE BITS!

To Top of Page

1.6 The Sailor's Tale

Let's imagine a ship sailing across the ocean, and a Sailor looking at the morning sky to get information about how the weather will turn out during the day:

In our model ocean, a storm can be expected every second day, on an average. A red sky at morning can be expected every fourth day and that always means a storm. But, a storm can appear even if the morning sky is grey. In fact, in our model half of the storms come without the warning of a red sky.

We do not need a computer to simulate these weather conditions. All we need is four cards in a hat, three black cards and one red. For every morning we draw one card: if the card is red, so is the sky, and a storm will follow during the day. If the card is black, the sky is grey and we draw a second time, now from the remaining three cards. Again, the red card means a storm, and one of the black cards means good weather.

When the Sailor expects a storm, he shorten sails, but if he is wrong, he will suffer a loss of speed. How often will he be right in his weather prediction if he has observed the morning sky?

In one case out of four he will see a red sky and he can predict a storm with 100% certainty. In three cases out of four he will see a grey sky, and he can predict that there will be no storm with a certainty of 66.6%, or in 2 cases out of 3. Hence we can calculate the reliability of his weather predictions:

(1.6.1) rel(sky@weather) = 1/4×100% + 3/4×66.6% = 75% hits

That is, he can be expected to hit the right weather in 75% of the cases based on his observations of the morning sky.

But, in case he forgot to observe the morning sky, he still has a 50% chance to predict the right weather merely by an arbitrary guess of either 'storm' or 'no storm'. Hence, observation of the morning sky has increased the prediction reliability by

(1.6.2)  rel_abs(sky@weather) = 75% - 50% = 25% hits

That is, the value of the information the morning sky gives about the weather is measured as a 25% improvement of the weather prediction hits.

Now, let's introduce a passenger on the ship. The weather conditions don't bother him. But, as an aesthete, he appreciates the morning sky. It's a pity he is a late sleeper, he never has the opportunity to enjoy this fine spectacle! The best he can do is to try to guess whether the sky had been either red or grey in the morning. His only source of information are weather observations during the day. How much information will the day's weather give about the morning's sky?

In half the cases there is no storm and the passenger knows that the morning sky had been grey. Otherwise, when there is a storm, he has a 50% chance to guess right, no matter whether he guesses 'red' or 'grey'. We can now calculate the expected hit percentage of guessing the sky colour if the weather is known:

(1.6.3)   rel(weather@sky) = 1/2×100% + 1/2×50% = 75% hits

How well would the passenger do without knowing the weather? If he would guess 'grey' every time, he could be expected to make a hit in three cases out of four, as we can see directly from our model: there are three black cards and one red in the hat. Hence his hit rate without information is 3/4×100%. The information the weather gives about the morning sky equals the difference

(1.6.4) rel_abs(weather@sky) = 75% - 3/4×100% = 75% - 75% = 0% hits

That is, knowledge of the weather during the day does not give the passenger any helpful information about the colour of the morning sky.

We have now seen an example of asymmetric information. The information the morning sky gives about the weather is not the same as what the weather gives about the morning sky, rel_abs(sky@weather) ≠ rel_abs(weather@sky). The former information gives a 25% improvement in the hit rate, the latter gives nothing.

To Top of Page

1.7 The Ideal Receiver

The preceding sections have demonstrated that information can be measured on different scales: in hits, in bits, in dollars. The next question: is there some property that should be common to all the information measures?

To begin with, we imagine two newspapers, one in English and the other in Chinese, both giving the same account of an event. Then, objectively speaking, both newspapers contain the same amount of information about the event. Still, an Englishman may be of the opinion that only the English paper contains information, because he does not understand Chinese. If we allow such subjective judgements, anybody may have any opinion on the information content in a message, and no rules apply. But, what if we want to measure the information content of the message itself, disconnected from the shortcomings of a receiver?

We have already asked "Information about what?". Now we ask "Information to whom?". The true information content of a message can be imagined as the information received by somebody with a complete understanding, who is aptly called the ideal receiver in semantic information theory:

"By an 'ideal' receiver we understand, for the purposes of this illustration, a receiver with a perfect memory who 'knows' all of logic and mathematics, and together with any class of empirical sentences, all of their logical consequences. The interpretation of semantic information with the help of such a superhuman fictitious intellect should be taken only as an informal indication" {Bar-Hillel, 1964, p.224}.

The ideal receiver was conceived by Yehoshua Bar-Hillel and Rudolf Carnap in 1952, but used for illustration purposes only. We will explore the logical consequences of this idealization, leading to the mathematical theory of information.

It must be noted that the fruitful concept of the ideal receiver has some consequences that are problematic: to the ideal receiver, chess is completely trivial, once the rules are known. The same goes for mathematics. The ideal receiver could just start from the axioms and derive all possible theorems. But here is a snag: "Contrary to previous assumptions, the vast 'continent' of arithmetical truth cannot be brought into systematic order by way of specifying once for all a fixed set of axioms from which all true arithmetical statements would be formally derivable" {Nagel&Newman, 1956, p.85}. As a consequence, it is impossible to construct an ideal receiver "who knows all of mathematics". It is Gödel's proof that indicates the limits of the ideal receiver concept, (Section 8.10).

The mathematical theory of information is, however, not dependent on the existence of an ideal receiver. The rôle of the ideal receiver is to connect the information measure inf(B@A) to the corresponding ordinary use of the word 'information'.

This is the same as in probability theory, where the mathematical rules how to measure probability were originally designed to correspond to the behaviour of ideal dice, because "the specific questions which stimulated the great mathematicians to think about this matter came from requests of noblemen gambling in dice or cards {Struik, 1987, p.103}. This way the probability measure P(a) connects to the ordinary use of 'probability', but the validity of the probability theory does not depend on the existence of ideal dice.

To Top of Page

1.8 The Law of Diminishing Information

We revert to the Englishman trying in vain to read a Chinese newspaper. When he gets the help of an interpreter, he finds out that after all there is a lot of information in the paper. We can conclude that he is not the ideal receiver (knowing all languages) because his understanding could be improved by an interpreter as an intermediary. This argument can be turned around and used to define an ideal receiver: A receiver is ideal if no intermediary can improve its performance.

I. A-->-[ ]-B-->-[R]

II. B-->-[ ]-C-->-[R]

III. A-->-[ ]-B-->-[ ]-C-->-[R]

IV. A-->-[ ]-B-->-[ ]-C-->

Fig 1.3 The ideal receiver R demonstrates the validity of the principle of diminishing information.

A reasoning in four steps, Figure 1.3, will lead from the ideal receiver to the Law of Diminishing Information:

Let R be an ideal receiver, who receives a message B about the event A.
Another receiver is constructed by adding a channel between the signal B and R, with the output C as the input to R. This receiver, consisting of an intermediary channel and R in series, cannot be a better receiver of B than R alone, because R is already an ideal receiver.
Hence, any intermediary channel making the information from C greater than the information from B, would contradict the premise that R is an ideal receiver. Thereby it would also contradict the principle that the information of a message should be measured as if it were received by an ideal receiver (Section 1.7).

Thus, the information that C gives about A cannot be greater than the information B gives about A:

(1.8.1.) inf(C@A) <= inf(B@A) if A_B_C

This is the principle of diminishing information where A__B__C means a transmission chain, Figure 1.3, step IV.

The Law of Diminishing Information reads: Compared to direct reception, an intermediary can only decrease the amount of information.

The Law (1.8.1) will be given a mathematical form in Section 2.7 based on probability theory. In this formulation it will be used as the fundamental axiom of the mathematical theory of information. The Law is the pruning knife of information theory: we will argue that the Law is the necessary and sufficient condition for a mathematical function to be accepted as an information measure, i.e. qualify as inf(B@A).

To Top of Page

1.9 Shannon's Noisy Channel

"The channel is merely the medium used to transmit the signal from transmitter to receiver. ... During transmission, or at one of the terminals, the signal may be perturbed by noise. This is indicated schematically in Fig. 1..." {Shannon, 1949, Introduction}, here redrawn as Figure 1.4.

For this figure, see Chapter 1 as pdf file.

Fig 1.4 Shannon's first figure.

So, here is also a case of diminishing information: information is corrupted by noise. How does the 'noisy channel' concept connect to the Law of Diminishing Information (1.8.1)? Figure 1.5 shows in three steps how the 'noisy channel' leads to the Law of Diminishing Information (1.8.1):

For this figure, see Chapter 1 as pdf file.

Fig 1.5 From 'noisy channel' to the principle of diminishing information in three steps.

This case corresponds to the 'noisy channel' in Shannon's Fig. 1 as it appears graphically in a textbook {Gallager, 1968, p.2}.
The Law of Diminishing Information (1.8.1) is concerned only with influences on the output of the channel, i.e. the input to an imagined ideal receiver.
As the Law of Diminishing Information (1.8.1) states that an intermediary channel can only decrease the information, we can also think that the intermediary channel is causing the signal to be "perturbed by noise". Hence the noise can be presented in the form of an intermediary channel, and we arrive at the form of the Law (1.8.1).

There is a reason to take step II and exclude the noise occurring "during transmission" and at the input "terminal" of the channel: such noise does not necessarily 'perturb' the signal. Noise can be beneficial too. In nonlinear channels, the information content of a signal may well increase due to noise!

Look at our Sailor tapping his barometer. The disturbance introduced by the tap will remove friction, and the reading will be more accurate. Also spontaneous background noise can be beneficial: "Noise often creates confusion. Try having a telephone conversation while standing on a busy street corner or listening to a radio broadcast riddled with static. Engineers have long sought means to minimize such interference. But surprisingly enough, during the past decade researchers have found that background noise is sometimes useful. Indeed, many physical systems, ranging from electronic circuits to nerve cells, actually work better amid random noise" {Moss&Wiesenfeld, 1995, p.50}. This phenomenon is known as stochastic resonance^[4].

Consequently, Shannon's first figure (Figure 1.4) can be misleading: it illustrates a case of information transmission typical of electrical communication were all noise is undesired and coming from everywhere. Still, this figure is repeatedly reproduced without reservations in texts about everything from an introductory essay on information theory for scientists and engineers of all disciplines {Raisbeck, 1963, p.3} to quantum mechanics {Rothstein, 1951, p.105}.

The question about information and noise puzzled the information theorists from the very beginning: "How does noise affect information? Information is, we must steadily remember, a measure of one's freedom of choice in selecting a message. ... Thus greater freedom of choice, greater uncertainty, greater information go hand in hand. If noise is introduced, then the received message contains certain distortions, certain errors, certain extraneous material, that would certainly lead one to say that the received message exhibits, because of the effects of noise, an increased uncertainty. But if the uncertainty is increased, the information is increased... It is therefore possible for the word information to have either good or bad connotations. Uncertainty which arises by virtue of freedom of choice on the part of the sender is desirable uncertainty. Uncertainty which arises because of errors or because of the influence of noise is undesirable uncertainty" {Weaver, 1949, p.18, ed. 1963}.

Based on the Law of Diminishing Information (1.8.1), we can give an answer: uncertainty produced by the sender at the output of the source is desirable, uncertainty introduced at the output of the channel is not desirable, the other uncertainties are definitely ambiguous. Mathematically, but not verbally and graphically, Shannon gives the same answer: he defines the rate of transmission of information (4.1.13) as the difference between the uncertainty produced by the source and the uncertainty at the receiver {Shannon, 1949, Section 12}.

The Law of Diminishing Information has two faces: If a channel B__C is added to the output of a given channel A__B forming a chain A__B__C, then the added channel can be seen either as an intermediary who cannot improve the information, or as a noisemaker who can only corrupt the information. In both cases inf(C@A) <= inf(B@A).

To Top of Page

1.10 Noise and nonlinearities

In the communication models provided by electrical engineering, the position where the noise is introduced can normally be neglected. Here "an important special case occurs when the noise is added to the signal and is independent of it (in the probability sense)" {Shannon, 1949, Section 24}. Moreover, when a linear model is used for the channel, noise added at any point can be mathematically transformed into an equivalent noise added just before the receiver. Hence, any added noise corresponds to an additional channel at the output (Figure 1.4).

In biology, linear models are less useful. Sensory perception provides important examples of beneficial noise, e.g. the saccadic eye movements. "The tremor component consists of a high frequency oscillation of the eye... It is called physiological nystagmus and ... it is indisputably important in the maintenance of vision. The effect of nystagmus is to continually shift the image on the retina... It seems that the receptors cease to respond if a steady image is projected onto them and the physiological nystagmus has the effect of continually calling fresh receptors into service" {Barber&Legge, 1976, p.56}. It may be noted that the modern view is in agreement with the conclusion, but relocates the effect from the 'receptors' to the 'neurons in the retina': the cones and rods function in a DC mode, but they are AC coupled to the subsequent nerves.

Noise is a misleading term in this case because the oscillation is a part of the design. In general, the term 'noise' must be taken in a broad sense, to include any kind of signal, incoherent or regular, introduced by chance or on purpose. In fact, 'noise' means here any added signal, provided it is independent of the information source, as indicated by the graphics in Figure 1.4 and Figure 1.5.

Hearing provides an example of consciously added fluctuations. "For more than a century, teachers of vocal music have stressed the importance of training each singer to produce a vibrato rather than a steady note. This is actually a tremor or warbling change in pitch, over a range of perhaps five to six cycles per second. ...in 1959 scientists at Oxford University discovered that the vibrato is essential in controlling one's own singing voice and keeping to an intended pitch. Without the stimulation of the deliberate variation, the singer's brain does not notice gradual changes in key" {Milne&Milne, 1965, p.53}.

In practice, electrical engineers know about beneficial noise. Many instruments contain by design a built-in source of added fluctuation (which is not necessarily a random noise): "Chopping The act of interrupting an electric current, beam of light, or beam of infrared radiation at regular intervals. This can be accomplished mechanically by rotating fan blades in the path of the beam or by placing a vibrating mirror in the path of the beam to deflect it away from its intended source at regular intervals. ... Chopping is generally used to change a direct-current signal into an alternating-current signal that can be more readily amplified" {Parker, 1989, p.383}.

I. A-->-[Phot]-B-->

II. A-->-[Chop]- - -> -[Phot]-B'-->

III. A-->-[Phot]-->--[Chop]- B'- - ->

Fig 1.6 Effect of a chopper 'Chop' on a photo-cell 'Phot'.

Figure 1.6 will be used to illustrate the difference between a linear and a nonlinear channel. The figure presents an electrotechnical system, but it can as well be thought of as a biological system:

The emitted signal is a weak light beam A, the channel is a photocell device 'Phot', and the output is a reading B of a meter.
In an effort to improve the sensitivity of the photocell, a chopper 'Chop' is mounted into the path of the light. Depending on the characteristics of the photocell, the fluctuations introduced by the chopper may either increase or decrease the information the meter reading B' contains about the incident light beam A. The variations caused by the chopper can be filtered from the meter reading.
Let 'Phot' be a linear channel. Then the chopper does not influence the performance of the photocell, but it causes unwanted fluctuations of the meter readings, amplified or attenuated by a linear factor. Hence, from the standpoint of the receiver, the chopper can as well be replaced by a chopper after the photocell.

In the nonlinear case there is no general rule to tell whether inf(B'@A) will be greater than inf(B@A) or the other way around. In the linear case, the Law of Diminishing Information (1.8.1) tells that inf(B'@A) <= inf(B@A). In a linear channel, noise is indeed never desirable.

To Top of Page

1.11 Chains of channels

In spite of the serial form A__B__C appearing in the Law of Diminishing Information (1.8.1), the Law is not about 'information in chains'. In Figure 1.7 , the 'Channel' can be of any nature, containing parallel computing, quantum effects or neural networks. The second channel in the chain can be seen as a hypothetical device, introduced to test whether a given information measure conforms to the Law (1.8.1). A physicist would call this device perturbation {Schiff, 1955, p.151}.

A-->-[Channel]-B-->-: Im :- C - - >

Fig 1.7 Channels in series: one real, one imaginary.

Still, channels in series are notorious for distorting information, as the everyday experience of a newspaper shows: A is an accident, B is an eye-witness account of the accident to a journalist, C is what the journalist telephones to his newspaper, and D is the article that is printed. The information the article gives about the accident inf(D@A) and the information the eye-witness gives inf(B@A) may differ considerably.^[5] We tend to believe what is in the paper, but in case we happen to have first-hand information, we find out it is all wrong.

Even a child knows about information decay in a chain: "Most of us at one time played a game called Telephone or Post Office. Someone starts by whispering a message into the ear of the adjacent person. That person in turn whispers the same message into someone else's ear. After passing this way through thirty people, the message is completely transformed. Every third or fourth person in the chain heard a different message" {Aguayo, 1990, p.73}.

The principle that information decreases in a chain is known in the communication theory "for a pair of cascaded channels" {Gallager, 1968 p.26}, but here the formulation is limited to the entropy function only. The first to describe this principle in mathematical terms was Laplace^[6]:

"Suppose then an incident be transmitted to us by twenty witnesses in such manner that the first has transmitted it to the second, the second to the third, and so on. Suppose again that the probability of each testimony be equal to the fraction ; that of the incident resulting from the testimonies will be less than . We cannot better compare this diminution of the probability than with the extinction of the light of objects by the interposition of several pieces of glass. A relatively small number of pieces suffices to take away the view of an object that a single piece allows us to perceive in a distinct manner. The historians do not appear to have paid sufficient attention to this degradation of the probability of events when seen across a great number of successive generations" {Laplace, ed. 1951, p.13}.

Long after Laplace, this principle (in the special case of entropy) has been given a name and a contemporary interpretation: "The data processing inequality [which] can be used to show that no clever manipulation of data can improve the inferences that can be made from the data" {Cover&Thomas, 1991, p. 32}.

The Law of Diminishing Information (1.8.1) has a more general scope. It is a criterion of what qualifies as an information measure for an arbitrary channel. A chain of channels is a different matter, but from the Law (1.8.1) it follows that for every new channel added to the end of a chain, the information about the initial input decreases.

To Top of Page

1.12 Bliss of ignorance

It may seem precarious to base a theory on such a negative sounding postulate as the Law of Diminishing Information (1.8.1). Yet, our comprehension of the world around us depends on this Law. If all information were preserved, we would be lost in the woods, overwhelmed by a thicket of details. We would not even perceive individual trees, we could not imagine the paths. The Law prescribes a haze of blissful ignorance and oblivion, through which we can see patterns emerge (Section 8.13), patterns otherwise unseen.

Without the workings of the Law, physics would be reduced to gibberish (Section 11.23). According to the underlying physical laws, e.g. an electron would carry the information about all of its previous history. But, as Rydberg back in 1906 states as a basic assumption in atomic physics (transl.): "The electrons are, wherever they come from, always similar to each other" {Rydberg, 1906, p.13}.

We must here insert a notice that application of the mathematical theory of information to physics necessarily contains an element of speculation, in anticipation of two kinds of experiment yet to be made: the localization of atomic memory (Section 14.10), and the testing of Bell's inequalities on particles with nonzero mass (Section 13.13).

In a closed physical system, there are two main (but opposite) directions in which information disappears, the con-direction and the di-direction, corresponding to converging or diverging dynamic trajectories (Section 11.1). Chemistry provides an example of why this makes the world comprehensible: matter will either condense to a liquid or a solid, or it will diffuse to a gas. Hence, there are only three clear-cut aggregation modes that the chemist primarily has to study.

As a prototype of a closed system, we will use particles in a box (Sections 6.21..22): The converging orbit of a single boxed particle leading to Schrödinger's wave function (Section 14.3), and the diffusion of many particles leading to the second law of thermodynamics (Section 14.7).

Either way, information behaves as if it were destroyed: crushed by contraction, shattered by dissipation. And, the ongoing destruction of information provides an arrow of time (Section 6.15).

To Top of Page

Footnotes:

^[1] Rudolf Julius Emanuel Clausius, 1822-88, German physicist.
^[2] James Stirling (d. 1770), Scottish mathematician.
^[3] Translated from the Anglo-German phrase HITS STATT BITS coined by Jan Hajek, The Netherlands (private communication).
^[4] I myself have been involved in a design project in which the resolution of an analog to digital conversion was stretched merely by using a small capacitor to add a 50 Hz disturbance to an analog signal.
^[5] Citation distortion provides another example: a quotation from a reference cited by one author is passed on from one author to another and ends up quite different from the original citation, Section 6.11.
^[6] French astronomer and mathematician, 1749-1827.

To Top of Page

Contact: jankahre (at) hotmail.com

The Mathematical Theory of Information Chapter 1: About Information

The Mathematical Theory of Information
Chapter 1: About Information