The Mathematical Theory of Information
Chapter 1: About Information
Question 1: Can information be measured?
Answer: Yes, by any measure not violating the Law of Diminishing
Information introduced in this work.
Information is playing an increasing role in our industrialized
society. A technical overview of the flourishing electronics industry
stated in 1987: "On almost every technology front, the driving
force behind new developments is the everrising demand for
information. Huge amounts of data and information, larger than anyone
ever dreamed of a generation ago, have to move faster and faster
through processors and networks, then end up having to be stored"
{Electronics,1987, p.83}.
Four years later the Industrial age had already given way to the
Information age: "In 1991 companies for the first time spent more
on computing and communications gearthe capital goods of the new
erathan on industrial, mining, farm, and construction machines. Info
tech is now as vital, and often as intangible, as the air we breathe,
which is filled with radio waves. In an automobile you notice the $675
worth of steel around you, but not the $782 worth of
microelectronics" {Stewart, 1994, p.55}.
The space industry provides another example, where man reaches
outside the earth merely to gather information, not to mine silver on
the moon. But also the more downtoearth beet sugar industry values
information: "Information has become a highly valued economic
category, on par with capital and skilled labor. The collection,
structuring and processing of information consequently constitute one
of the focal points of business, and it is the task of process control
engineering to provide adequate support for the production process in
this area. The structuring of information is an essential precondition
for the optimal conduct of technical processes" {Drathen, 1990, p.631}.
This book is a result of theoretical interest awakened during
practical work with information engineering. I myself, as a
manufacturer of process refractometers, make my living out of
supplying information transmitters. A process refractometer is mounted
in an industrial process pipe, and transmits an electrical signal
proportional to the concentration of the liquid in the pipe. A
transmitter does not cause a change in the process, in contrast to
other industrial equipment, the purpose of which is to transform or
transport the process medium. The sole mission of the process
refractometer is to transmit information to the process control
system.
Stimulated by the ever expanding telecommunications sector, much
research activity has been focused on information transfer. Claude
Shannon laid the foundations of this approach to information in a
paper published in Bell Systems Technical Journal, July and
October, 1948, later reprinted as a book, The Mathematical Theory
of Communication, published in 1949. Shannon's theory is based on
the probability theory; he measured information as entropy, or
"how much 'choice' is involved in the selection of an event or
how uncertain we are of the outcome" {Shannon, 1949, Section 6}.
Any information measure defined in terms of probabilities is
potentially useful in all areas where the probability theory is
applied: in most sciences and in many of the humanities. And,
information appears in different spheres, such as biology,
"Life, too, is digital information written in DNA" {Ridley, 2000,
p.16}, and in physics, "Giving us its as bits, the quantum
presents us with physics as information" {Wheeler, 1990, p.7}. Or, a bit
differently, to work out "physics from Fisher information"
{Frieden,
1998}.
Shannon himself applies the word 'information' only in a
descriptive sense to the output of an information source, and he stays
resolutely within the framework of telecommunications, using the title
communication theory. His followers renamed the theory 'information
theory', and now it is too late to revert to the name given by
Shannon. Instead, we will refer to it as the classical information
theory.
There are excellent textbooks on classical information theory,
aimed at students of telecommunications or computers, but which also
contain a wealth of material useful to others. Yet, they are oriented
to specific applications, the topics are not connected by an
underlying general theory.
Here we will propose a general theory of information, which will
unify the seemingly disparate topics collected in this book. The
general theory is based on the Law of Diminishing Information as an
additional axiom of the probability theory. We will show that this Law
follows directly from Shannon's conclusions concerning noisy
channels, but extrapolated to nonlinear systems.
Further, the classical information theory still contains
assumptions concerning information that are specific to
telecommunications, but not always explicitly acknowledged as such. In
a context where they do not apply, such hidden assumptions may lead a
biologist or a physicist astray. In this book we will try to make
those assumptions visible, faithfully following Shannon’s
reasoning, but relaxing the conditions characteristic of
communications engineering.
The mathematics is kept at an undergraduate level in
science. Hence, the book can be used as class material for an
introductory course in information theory. For engineering students,
it can be used as material complementary to traditional courses in
information technology.
The book can help a scientist, and also a curious information
technologist, to grasp the idea of information. All this may even
justify the rather pretentious title: The Mathematical Theory of
Information.
To Top of Page
A fundamental, but a somehow forgotten fact, is that information is
always information about something. The Oxford dictionary {Oxford, 1974}
defines information as:
in·for·ma·tion...on/about,...2
something told; news or knowledge given: ...
Can you give me any information on/about this matter?
A>[channel]B>[receiver]
Fig 1.1 Information transmission.
Information transmission is described by Figure 1.1, which we may
visualize as a reader (Receiver) who gets information through a
newspaper (Channel) from an article (B) about an event (A): The
article (B) gives information about the event (A). A and B can be
interpreted in many ways, as anything that carries information:
Signals, descriptions, pictures, reports, observations, events. A and
B will be called messages or, more exactly, message sets.
The channel can be a newspaper or a telephone, but it can also be
your friend (Channel) telling you (Receiver) a story (B) about his
achievements (A). For a scientist investigating bugs, his microscope
is his information channel. The dictionary {Oxford, 1974} defines:
chan·nel ...4 (fig) any way by which news,
ideas, etc may travel: He has secret channels of
information.
A mathematical theory of information is naturally concerned with numerical measures of information. The quantity, or the quality, of the information in a message should be expressed as a number on a scale: the higher the number, the better the information.
This number, the information measure, we will here introduce by the definition:
(1.2.1) inf(B@A) = the information B gives about A
In this context '@' is read as 'about'; @ is seen as an
a with a tail 'bout. Alternatively
'@' can be read as 'at', a contraction of
about. The notation 'inf( )' is
borrowed from the semantic information theory {BarHillel, 1964, p.241} and
should not, due to the presence of '@', be confused with
'inf' as the abbreviation of 'infimum'.
For a property, as in this case informativity (1.2.1), to be measured on a onedimensional numerical
scale transitivity is required {Raiffa, 1970, p.76}:
if X gives more information than Y, and Y gives more than Z, then X
gives more information than Z.
Further, in the words of Walter Nernst (translated): "... the
scientist must always strive to put his propositions in a form that
can be numerically expressed, that is, the qualitative concepts must
have a quantitative foundation. A description of a phenomenon is
either incomplete or misleading, the communication to others of an
observation is aggravated, when the order of magnitude of the observed
effect is missing" {Nernst, 1909, p.5}.
To Top of Page
On a day at the races we will learn an essential point: what kind
of information measure to choose depends on the application.
Betting is arranged for a race between two horses of equal
swiftness, 'Zero' and 'Uno'. A Gambler can bet his dollar
on either horse. If it is the right horse, the Gambler wins one
dollar, but if it is the wrong horse he loses one dollar. The Gambler
then has a 50% chance to win $1.00 and a 50% chance to lose
$1.00. Thus his expected gain is
1/2×$1.00  1/2×$1.00 = ±0.
This can be interpreted statistically by saying that after a
day of many similar bets, the Gambler can expect to have neither a
gain nor a loss.
Now, imagine that before the race an Insider gives the Gambler a
tip (T) about the result (R). How large is inf(T@R) = the
information value of the tip about the result?
From the Gambler's point of view, the information value of the tip
is the same as his expected gain. If the tip is 100% sure (T = T100%)
the Gambler gains a sure dollar and thus
(1.3.1) inf(T_{100%}@R) = $1.00
If a tip is only 75% sure (T = T_{75%}) the
Gambler can expect to win 3 out of 4 races. Then the tip T about the
result R gives the Gambler an expected gain
(1.3.2) inf(T_{75%}@R) = 3/4×$1.00  1/4×$1.00 = $0.50
If the Gambler has to pay for the tip, then the value inf(T@R) is
the upper limit for what is profitable for him to pay.
We will now study another aspect, and look at the case as a pure
communications engineering problem. An Engineer could be employed to
design a data transmission line from the Insider to the Gambler. The
only purpose of the line is to transmit the Insider's tip. From the
Engineer's point of view, the amount of information in the tip is just
a question of how large a capacity is required of the transmission
line.
In this case the line needs only to transmit one symbol with two
alternatives. Perhaps the Engineer instructs the Insider to wave a
flag, either flag 0 for 'Zero wins' or flag 1 for
'Uno wins'. A symbol in a symbol set having only two
alternatives, like 0 and 1, is called a bit.
The Engineer ignores the semantic aspects, he looks at the tip as
just a message to transmit. In the language of the Engineer the tip
contains exactly one bit of information,
(1.3.3) inf(T@R) = 1 bit
because the transmission requires only one 0/1 symbol.
From the perspective of the Gambler the information value of the
tip depends on its reliability: 100% sure means a dollar (1.3.1), 75% sure means 50 cents (1.3.2), 55% sure means only a
dime. From the perspective of the Engineer, the tip contains exactly
one bit of information (1.3.3),
independently of its reliability.
Conclusion: It is a matter of perspective whether the information
should be measured in dollars or in bits.
To Top of Page
In digital data processing and communications, the fundamental unit
for information transmission and storage capacity is a bit, or the
derived unit byte, typically 8 bits. We will now modify the race track
example in the preceding section to study the communication
requirements when the tip contains the predicted result of four
consecutive races between the horses 'Zero' and 'Uno'. If
the Insider's tip of the winners list happens to be
Race 1: Zero Race 2:
Uno Race 3:
Uno Race 4: Zero
he waves the flag 0110. To be able to signal all potential messages,
he needs W = 16 flags, where W is the number of possible
combinations.
0000 0001 0010 0011 0100 0101 0110 0111
1000 1001 1010 1011 1100 1101 1110 1111
Each flag above contains a string of 4 bits. It is customary and
more convenient {Shannon, 1949, Introduction} to
measure the amount of information as a 'number of bits'
instead of a 'number of possible combinations'. The two
numbers are mathematically connected; W = 2^{n}
where the positive real number n = number of bits, or
(1.4.1) n = log_{2}(W)
where n does not have to be an integer. A flag from the collection
above transmits log_{2}(16) = 4 bits of
information. We divide by four to get the information per race and
arrive at the same value as (1.3.3)
(1.4.2) inf(T@R) = 1/4log_{2}(16) = 1 bit
The key concept in communications engineering is compression,
i.e. shortening messages or data without sacrificing information. If
the Insider repeats a pattern, the message can be compressed. If the
Insider is invariably of the opinion that 'Uno' will win once
but only once in each of four races, he needs only four flags
0001 0010 0100 1000
or the number of possible combinations W = 4. The
information is then log_{2}(4) = 2 bits per 4 races. Thus the Engineer
can design a sufficient flag collection based on only two bits per
flag (representing 0,1,2,3 in binary form):
00 01 10 11
We can now see that the string of bits in a flag has been
compressed from 4 bits to 2 bits for four races. The information per
race is now
(1.4.3) inf(T@R) = (1/n)log_{2}(W) = 1/4log_{2}(4) = 1/2 bit
Now we increase the number of bits n while keeping the same
proportion between 0 and 1. Let k represent the number of the symbol 1
in a sequence. To begin with, take n = 8 and k = n/4 = 2. Sequences
that qualify are e.g.
00010100 01010000 00001100 00100001 . . . .
We count all the possible sequences and we get
W = 28. Hence
(1.4.4) inf(T@R) = (1/n)log_{2}(W) = (1/8)log_{2}(28) ~ .601 bit
In case n = 32 and
k = n/4 = 8, the number of bits
needed to transmit the sequence is
(1.4.5) inf(T@R) = (1/n)log_{2}(W) = (1/32)log_{2}(10,518,300) ~ .729 bit
We can see that with increasing sequence length n, the constraining
pattern dissolves and, in proportion, more bits are needed to transmit
the sequence. The general form for how (1/n)log_{2}(W) depends
on n and k is given by elementary combinatorial analysis
(1.4.6) (1/n)log_{2}(W) = (1/n)log_{2}(^{n}_{k}) = (1/n)log_{2}(n!/(k!(nk)!))
where n! reads 'n factorial', so that e.g. 5! =
1×2×3×4×5. The points in Figure 1.2 represent
(1.4.6) when k = n/4.
For this figure, see Chapter 1 as pdf file.
Fig 1.2 Number of bits needed (1/n)log_{2}(W) as a function of the
number of bits n in a sequence.
We may ask, when does the entropy enter? In the classical
information theory "entropy and relative entropy arise again and
again as the answers to the fundamental questions in communication and
statistics" {Cover&Thomas, 1991,
p.11}.
From Figure 1.2 we can see
that (1/n)log_{2}(W) approaches asymptotically an
upper limit when the length n of the sequences grows. For large n, a
sequence becomes indistinguishable from a random distribution where
the frequency of 1 equals a probability p. In our example
p = 1/4. Using the Stirling approximation
(7.8.18) for the factorials in (1.4.6) we get the limit
(1.4.7) lim_{n>oo} (1/n)log_{2}(^{n}_{k}) = p log_{2}p  (1p)log_{2}(1p)
for k = np. The limit in terms of probabilities
Σ
p_{i }log_{2 }p_{i} was
called 'entropy' by Shannon and denoted by 'H'. A limit of the same
form had earlier been used in thermodynamics, and given the name
'entropy' by Clausius^{[1]},
see Section 7.13.
In our example p = 1/4 and the entropy takes the value
(1.4.8) Σp_{i} log_{2} p_{i} = 1/4 log_{2}1/4  (11/4)log_{2}(11/4) ~ .811
in agreement with Figure 1.2. That is, the entropy is the Stirling^{[2]} limit of the number of bits
needed to transmit a sequence when the length approaches infinity.
For short sequences, however, we may be far from the Stirling
limit. For n = 4 and k = 1 we manage with only
1/4log_{2}(4) = .5 bits/n compared to the entropy
~ .811. On the other side, in a random sequence n = 4
with p = 1/4, any combination of 0 and 1 can occur, and we
need as much as 1/4log_{2}(16) = 1 bits/n.
The essence of Shannon's communication theory is his insight that
the compression of long data strings can be measured by entropy. His
followers have, however, elevated his entropy function to a universal
information measure. Two of them have even been inspired to an entropy
eulogy:
"Shannon's measure is an invention. It was designed to fill a
specific need: to provide a useful measure of what is transmitted on a
communication channel. It has also been shown to be the only function
that satisfies certain basic requirements of information theory. In
the 23 years since Shannon put forward his measure thousands of papers
have been written on the subject and no one has found a replacement
function, or even a need for one. On the contrary, many alternative
derivations have been found. We conclude that the Shannon entropy
measure is fundamental in information science, just as the Pythagorean
theorem is fundamental in geometry" {Tribus&McErvine, 1971,
p.180}.
To Top of Page
We will, however, claim that the fundamental measure of information
is reliability, i.e. how well a message can be expected to be
reproduced: "The fundamental problem of communication is that of
reproducing at one point either exactly or approximately a message
selected at another point" {Shannon, 1949,
Introduction}. We define:
The reliability rel(B@A) is the probability to guess the right
message A when the message B has been received, and A is the message
sent.
The reliability is used by Shannon, but in the form of "the
probability of an incorrect interpretation" {Shannon, 1949, Section 14}
denoted by 'q', which is equivalent to 1 
rel(B@A). Hence
(1.5.1) rel(B@A) = 1  q
The definition of rel(B@A) directly in terms of probabilities is
given by (3.6.1). Before the message B has been received, there is
already some probability of guessing the right A, i.e. the initial
probability of incorrect interpretation is less than 100%,
q_{0} < 1. To evaluate the value of the
received message B, we rather would measure the increase of
reliability brought about by B
(1.5.2) rel_{abs}(B@A) = (1 q)  (1  q_{0}) = q_{0}  q
where 'abs' indicates that the scale is absolute;
rel_{abs}(B@A) = 0 if B contains no information
about A. For example, if there is an initial 75% chance to make the
right guess, 1  q_{0} = 3/4 and the chance
remains at 75% after B has been received, 1  q = 3/4, then no
information has been transmitted.
In general, we will stick to the conventional notation, but we will
use a combination of three letters (such as 'rel' and 'ent') for
information measures instead of one letter (as the corresponding 'q'
and 'H'), due to the need of distinguishing between a larger number of
measures. The symbol @ for about is the other exception,
instead the classical information theory uses a semicolon in
its fundamental measure, which is called mutual information
and written I(B;A) {Cover&Thomas, 1991,
p.18}. This measure will here be written ent(B@A) and
defined in (4.9.2). There is a need to stress the potential asymmetry
of information: While I(B;A) happens to be symmetric,
I(B;A) is equal to I(A;B), but
e.g. rel(B@A) and rel(A@B) are not
necessarily so. The Sailor's tale in Section 1.6 will give an example of the asymmetry
rel(B@A) ≠ rel(A@B).
In the theory of communication, bits and entropy can be seen as the
quantitative measures, and reliability as the qualitative
measure. Shannon had no reason to identify reliability as an
information measure, because he considered the case of 100%
reliability, rel(B@A) = 1, or more exactly: "It is
possible to send information at a rate C through the channel with
as small a frequency of errors or equivocation as desired by
proper encoding" {Shannon, 1949, Section
13}. This result is restricted to messages that are long strings of
symbols, e.g. of bits 0011010..., and the rate
C is determined by a Stirling limit similar to (1.4.7).
What about shorter strings? Let the message A consist of no more
than one bit, either 0 or 1,
transmitted through an unreliable channel, rel(B@A) <
1. There is no way of coding to make a single transmission
perfect.
Instead, we must turn to another discipline, Pattern
Classification, because it addresses the problem of
"detecting a single weak pulse, such as a dim flash of light or a
weak radar reflection" {Duda,Hart&Stork, 2001,
p.48}. Here it is a question of discriminating between two patterns,
'pulse' or 'no pulse'. Expressed in terms of information theory, the
transmitted signal A can be either a 0 (absence of a pulse) or a 1
(presence of a pulse). The received signal B is unreliable, but it
contains some information about A.
Another place where to look for an answer is Statistics, which
treats "the problem of classifying or assigning a sample to one
of several possible populations" {Kullback, 1997, p.83}.
In Pattern Classification, accuracy is customarily measured in
hits, i.e. the number of correct classifications. We can conclude that
sometimes information should be measured on an entropy scale in bits,
sometimes on a reliability scale in hits. But, as reliability is
relevant for messages of any length, we can see the justification in
the motto^{[3]}: HITS BEFORE BITS!
To Top of Page
Let's imagine a ship sailing across the ocean, and a Sailor looking
at the morning sky to get information about how the weather will turn
out during the day:
Red sky at night,
Sailor's delight.
Red sky at morning,
Sailor take warning.
In our model ocean, a storm can be expected every second day, on an
average. A red sky at morning can be expected every fourth day and
that always means a storm. But, a storm can appear even if the morning
sky is grey. In fact, in our model half of the storms come without the
warning of a red sky.
We do not need a computer to simulate these weather conditions. All
we need is four cards in a hat, three black cards and one red. For
every morning we draw one card: if the card is red, so is the sky, and
a storm will follow during the day. If the card is black, the sky is
grey and we draw a second time, now from the remaining three
cards. Again, the red card means a storm, and one of the black cards
means good weather.
When the Sailor expects a storm, he shorten sails, but if he is
wrong, he will suffer a loss of speed. How often will he be right in
his weather prediction if he has observed the morning sky?
In one case out of four he will see a red sky and he can predict a
storm with 100% certainty. In three cases out of four he will see a
grey sky, and he can predict that there will be no storm with a
certainty of 66.6%, or in 2 cases out of 3. Hence we can calculate the
reliability of his weather predictions:
(1.6.1) rel(sky@weather) = 1/4×100% + 3/4×66.6% = 75% hits
That is, he can be expected to hit the right weather in 75% of the
cases based on his observations of the morning sky.
But, in case he forgot to observe the morning sky, he still has a
50% chance to predict the right weather merely by an arbitrary guess
of either 'storm' or 'no storm'. Hence, observation of the
morning sky has increased the prediction reliability by
(1.6.2) rel_{abs}(sky@weather) = 75%  50% = 25% hits
That is, the value of the information the morning sky gives
about the weather is measured as a 25% improvement of the
weather prediction hits.
Now, let's introduce a passenger on the ship. The weather
conditions don't bother him. But, as an aesthete, he appreciates the
morning sky. It's a pity he is a late sleeper, he never has the
opportunity to enjoy this fine spectacle! The best he can do is to try
to guess whether the sky had been either red or grey in the
morning. His only source of information are weather observations
during the day. How much information will the day's weather give about
the morning's sky?
In half the cases there is no storm and the passenger knows that
the morning sky had been grey. Otherwise, when there is a storm, he
has a 50% chance to guess right, no matter whether he guesses
'red' or 'grey'. We can now calculate the expected hit
percentage of guessing the sky colour if the weather is known:
(1.6.3) rel(weather@sky) = 1/2×100% + 1/2×50% = 75% hits
How well would the passenger do without knowing the weather? If he
would guess 'grey' every time, he could be expected to make a hit in
three cases out of four, as we can see directly from our model: there
are three black cards and one red in the hat. Hence his hit rate
without information is 3/4×100%. The information the weather
gives about the morning sky equals the difference
(1.6.4) rel_{abs}(weather@sky) = 75%  3/4×100% = 75%  75% = 0% hits
That is, knowledge of the weather during the day does not give the
passenger any helpful information about the colour of the morning
sky.
We have now seen an example of asymmetric information. The
information the morning sky gives about the weather is not
the same as what the weather gives about the morning sky,
rel_{abs}(sky@weather) ≠
rel_{abs}(weather@sky). The former information gives a
25% improvement in the hit rate, the latter gives nothing.
To Top of Page
The preceding sections have demonstrated that information can be
measured on different scales: in hits, in bits, in dollars. The next
question: is there some property that should be common to all the
information measures?
To begin with, we imagine two newspapers, one in English and the
other in Chinese, both giving the same account of an event. Then,
objectively speaking, both newspapers contain the same amount of
information about the event. Still, an Englishman may be of the
opinion that only the English paper contains information, because he
does not understand Chinese. If we allow such subjective judgements,
anybody may have any opinion on the information content in a message,
and no rules apply. But, what if we want to measure the information
content of the message itself, disconnected from the shortcomings of a
receiver?
We have already asked "Information about what?". Now we ask
"Information to whom?". The true information content of a
message can be imagined as the information received by somebody with a
complete understanding, who is aptly called the ideal receiver in
semantic information theory:
"By an 'ideal' receiver we understand, for the purposes of
this illustration, a receiver with a perfect memory who 'knows' all of
logic and mathematics, and together with any class of empirical
sentences, all of their logical consequences. The interpretation of
semantic information with the help of such a superhuman fictitious
intellect should be taken only as an informal indication" {BarHillel,
1964, p.224}.
The ideal receiver was conceived by Yehoshua BarHillel
and Rudolf Carnap in 1952, but used for illustration purposes only. We
will explore the logical consequences of this idealization, leading to
the mathematical theory of information.
It must be noted that the fruitful concept of the ideal
receiver has some consequences that are problematic: to the ideal
receiver, chess is completely trivial, once the rules are known. The
same goes for mathematics. The ideal receiver could just start from
the axioms and derive all possible theorems. But here is a snag:
"Contrary to previous assumptions, the vast 'continent' of
arithmetical truth cannot be brought into systematic order by way of
specifying once for all a fixed set of axioms from which all true
arithmetical statements would be formally derivable" {Nagel&Newman,
1956, p.85}. As a consequence, it is impossible to construct an
ideal receiver "who knows all of mathematics". It is Gödel's
proof that indicates the limits of the ideal receiver concept,
(Section 8.10).
The mathematical theory of information is, however, not dependent
on the existence of an ideal receiver. The rôle of the ideal
receiver is to connect the information measure inf(B@A) to the
corresponding ordinary use of the word 'information'.
This is the same as in probability theory, where the mathematical
rules how to measure probability were originally designed to
correspond to the behaviour of ideal dice, because "the
specific questions which stimulated the great mathematicians to think
about this matter came from requests of noblemen gambling in dice or
cards {Struik, 1987, p.103}. This way the
probability measure P(a) connects to the ordinary use of
'probability', but the validity of the probability theory does
not depend on the existence of ideal dice.
To Top of Page
We revert to the Englishman trying in vain to read a Chinese
newspaper. When he gets the help of an interpreter, he finds out that
after all there is a lot of information in the paper. We can conclude
that he is not the ideal receiver (knowing all languages) because his
understanding could be improved by an interpreter as an
intermediary. This argument can be turned around and used to define an
ideal receiver: A receiver is ideal if no intermediary can improve its
performance.
I. A>[ ]B>[R]
II. B>[ ]C>[R]
III. A>[ ]B>[ ]C>[R]
IV. A>[ ]B>[ ]C>
Fig 1.3 The ideal receiver R demonstrates the
validity of the principle of diminishing information.
A reasoning in four steps, Figure 1.3, will lead from the ideal
receiver to the Law of Diminishing Information:
 Let R be an ideal receiver, who receives a message B about the event A.
 Another receiver is constructed by adding a channel between the
signal B and R, with the output C as the input to R. This receiver,
consisting of an intermediary channel and R in series, cannot be a
better receiver of B than R alone, because R is already an ideal
receiver.
 Hence, any intermediary channel making the information from C
greater than the information from B, would contradict the premise that
R is an ideal receiver. Thereby it would also contradict the principle
that the information of a message should be measured as if it were
received by an ideal receiver (Section 1.7).
 Thus, the information that C gives about A cannot be greater than
the information B gives about A:
(1.8.1.) inf(C@A) <= inf(B@A) if A_B_C
This is the principle of diminishing information where A__B__C means a transmission chain, Figure 1.3, step IV.
The Law of Diminishing Information reads: Compared to direct
reception, an intermediary can only decrease the amount of
information.
The Law (1.8.1) will be given a
mathematical form in Section 2.7 based on probability theory. In this
formulation it will be used as the fundamental axiom of the
mathematical theory of information. The Law is the pruning knife of
information theory: we will argue that the Law is the necessary and
sufficient condition for a mathematical function to be accepted as an
information measure, i.e. qualify as inf(B@A).
To Top of Page
"The channel is merely the medium used to transmit
the signal from transmitter to receiver. ... During transmission, or
at one of the terminals, the signal may be perturbed by noise. This is
indicated schematically in Fig. 1..." {Shannon, 1949, Introduction},
here redrawn as Figure 1.4.
For this figure, see Chapter 1 as pdf
file.
Fig 1.4 Shannon's first figure.
So, here is also a case of diminishing information: information is
corrupted by noise. How does the 'noisy channel' concept connect to
the Law of Diminishing Information (1.8.1)? Figure 1.5 shows in three steps how the 'noisy
channel' leads to the Law of Diminishing Information (1.8.1):
For this figure, see Chapter 1 as pdf file.
Fig 1.5 From 'noisy channel' to the principle of diminishing
information in three steps.
 This case corresponds to the 'noisy channel' in Shannon's Fig. 1
as it appears graphically in a textbook {Gallager, 1968, p.2}.
 The Law of Diminishing Information (1.8.1) is concerned only with influences on the output
of the channel, i.e. the input to an imagined ideal receiver.
 As the Law of Diminishing Information (1.8.1) states that an intermediary channel can only
decrease the information, we can also think that the intermediary
channel is causing the signal to be "perturbed by
noise". Hence the noise can be presented in the form of an
intermediary channel, and we arrive at the form of the Law (1.8.1).
There is a reason to take step II and exclude the noise occurring
"during transmission" and at the input "terminal" of the
channel: such noise does not necessarily 'perturb' the
signal. Noise can be beneficial too. In nonlinear channels, the
information content of a signal may well increase due to noise!
Look at our Sailor tapping his barometer. The disturbance
introduced by the tap will remove friction, and the reading will be
more accurate. Also spontaneous background noise can be beneficial:
"Noise often creates confusion. Try having a telephone
conversation while standing on a busy street corner or listening to a
radio broadcast riddled with static. Engineers have long sought means
to minimize such interference. But surprisingly enough, during the
past decade researchers have found that background noise is sometimes
useful. Indeed, many physical systems, ranging from electronic
circuits to nerve cells, actually work better amid random noise"
{Moss&Wiesenfeld, 1995,
p.50}. This phenomenon is known as stochastic resonance^{[4]}.
Consequently, Shannon's first figure (Figure 1.4) can be
misleading: it illustrates a case of information transmission typical
of electrical communication were all noise is undesired and coming
from everywhere. Still, this figure is repeatedly reproduced without
reservations in texts about everything from an introductory essay on
information theory for scientists and engineers of all disciplines {Raisbeck, 1963,
p.3} to quantum mechanics {Rothstein, 1951, p.105}.
The question about information and noise puzzled the information
theorists from the very beginning: "How does noise affect
information? Information is, we must steadily remember, a measure of
one's freedom of choice in selecting a message. ... Thus greater
freedom of choice, greater uncertainty, greater information go hand in
hand. If noise is introduced, then the received message contains
certain distortions, certain errors, certain extraneous material, that
would certainly lead one to say that the received message exhibits,
because of the effects of noise, an increased uncertainty. But if the
uncertainty is increased, the information is increased... It is
therefore possible for the word information to have either good or bad
connotations. Uncertainty which arises by virtue of freedom of choice
on the part of the sender is desirable uncertainty. Uncertainty which
arises because of errors or because of the influence of noise is
undesirable uncertainty" {Weaver, 1949, p.18,
ed. 1963}.
Based on the Law of Diminishing Information (1.8.1), we can give an answer: uncertainty produced by
the sender at the output of the source is desirable, uncertainty
introduced at the output of the channel is not desirable, the other
uncertainties are definitely ambiguous. Mathematically, but not
verbally and graphically, Shannon gives the same answer: he defines
the rate of transmission of information (4.1.13) as the difference
between the uncertainty produced by the source and the uncertainty at
the receiver {Shannon, 1949, Section 12}.
The Law of Diminishing Information has two faces: If a channel B__C
is added to the output of a given channel A__B forming a chain
A__B__C, then the added channel can be seen either as an intermediary
who cannot improve the information, or as a noisemaker who can only
corrupt the information. In both cases inf(C@A) <= inf(B@A).
To Top of Page
In the communication models provided by electrical engineering, the
position where the noise is introduced can normally be neglected. Here
"an important special case occurs when the noise is added to the
signal and is independent of it (in the probability sense)" {Shannon, 1949, Section
24}. Moreover, when a linear model is used for the channel, noise
added at any point can be mathematically transformed into an
equivalent noise added just before the receiver. Hence, any added
noise corresponds to an additional channel at the output (Figure 1.4).
In biology, linear models are less useful. Sensory perception
provides important examples of beneficial noise, e.g. the saccadic eye
movements. "The tremor component consists of a high frequency
oscillation of the eye... It is called physiological nystagmus and
... it is indisputably important in the maintenance of vision. The
effect of nystagmus is to continually shift the image on the
retina... It seems that the receptors cease to respond if a steady
image is projected onto them and the physiological nystagmus has the
effect of continually calling fresh receptors into service" {Barber&Legge,
1976, p.56}. It may be noted that the modern view is in agreement
with the conclusion, but relocates the effect from the
'receptors' to the 'neurons in the retina': the cones and
rods function in a DC mode, but they are AC coupled to the subsequent
nerves.
Noise is a misleading term in this case because the oscillation is
a part of the design. In general, the term 'noise' must be taken in a
broad sense, to include any kind of signal, incoherent or regular,
introduced by chance or on purpose. In fact, 'noise' means here any
added signal, provided it is independent of the information source, as
indicated by the graphics in Figure 1.4 and Figure 1.5.
Hearing provides an example of consciously added
fluctuations. "For more than a century, teachers of vocal music
have stressed the importance of training each singer to produce a
vibrato rather than a steady note. This is actually a tremor or
warbling change in pitch, over a range of perhaps five to six cycles
per second. ...in 1959 scientists at Oxford University discovered that
the vibrato is essential in controlling one's own singing voice and
keeping to an intended pitch. Without the stimulation of the
deliberate variation, the singer's brain does not notice gradual
changes in key" {Milne&Milne, 1965,
p.53}.
In practice, electrical engineers know about beneficial noise. Many
instruments contain by design a builtin source of added fluctuation
(which is not necessarily a random noise): "Chopping The act of
interrupting an electric current, beam of light, or beam of infrared
radiation at regular intervals. This can be accomplished mechanically
by rotating fan blades in the path of the beam or by placing a
vibrating mirror in the path of the beam to deflect it away from its
intended source at regular intervals. ... Chopping is generally used
to change a directcurrent signal into an alternatingcurrent signal
that can be more readily amplified" {Parker, 1989, p.383}.
I. A>[Phot]B>
II. A>[Chop]  > [Phot]B'>
III. A>[Phot]>[Chop] B'  >
Fig 1.6 Effect of a chopper 'Chop' on a photocell 'Phot'.
Figure 1.6 will be used to illustrate the difference between a
linear and a nonlinear channel. The figure presents an
electrotechnical system, but it can as well be thought of as a
biological system:
 The emitted signal is a weak light beam A, the channel is a
photocell device 'Phot', and the output is a reading B of a
meter.
 In an effort to improve the sensitivity of the photocell, a
chopper 'Chop' is mounted into the path of the light. Depending
on the characteristics of the photocell, the fluctuations introduced
by the chopper may either increase or decrease the information the
meter reading B' contains about the incident light beam A. The
variations caused by the chopper can be filtered from the meter
reading.
 Let 'Phot' be a linear channel. Then the chopper does not
influence the performance of the photocell, but it causes unwanted
fluctuations of the meter readings, amplified or attenuated by a
linear factor. Hence, from the standpoint of the receiver, the chopper
can as well be replaced by a chopper after the photocell.
In the nonlinear case there is no general rule to tell whether
inf(B'@A) will be greater than inf(B@A) or the other way around. In
the linear case, the Law of Diminishing Information (1.8.1) tells that inf(B'@A) <= inf(B@A). In a
linear channel, noise is indeed never desirable.
To Top of Page
In spite of the serial form A__B__C appearing in the Law of
Diminishing Information (1.8.1), the
Law is not about 'information in chains'. In Figure 1.7 , the 'Channel' can be of
any nature, containing parallel computing, quantum effects or neural
networks. The second channel in the chain can be seen as a
hypothetical device, introduced to test whether a given information
measure conforms to the Law (1.8.1). A physicist would call this device
perturbation {Schiff, 1955, p.151}.
A>[Channel]B>: Im : C   >
Fig 1.7 Channels in series: one real, one imaginary.
Still, channels in series are notorious for distorting information,
as the everyday experience of a newspaper shows: A is an accident, B
is an eyewitness account of the accident to a journalist, C is what
the journalist telephones to his newspaper, and D is the article that
is printed. The information the article gives about the accident
inf(D@A) and the information the eyewitness gives inf(B@A) may differ
considerably.^{[5]} We tend to believe what is in the paper, but in case
we happen to have firsthand information, we find out it is all
wrong.
Even a child knows about information decay in a chain: "Most
of us at one time played a game called Telephone or Post
Office. Someone starts by whispering a message into the ear of the
adjacent person. That person in turn whispers the same message into
someone else's ear. After passing this way through thirty people, the
message is completely transformed. Every third or fourth person in the
chain heard a different message" {Aguayo, 1990, p.73}.
The principle that information decreases in a chain is known in the
communication theory "for a pair of cascaded channels" {Gallager, 1968
p.26}, but here the formulation is limited to the entropy function
only. The first to describe this principle in mathematical terms was
Laplace^{[6]}:
"Suppose then an incident be transmitted to us by
twenty witnesses in such manner that the first has transmitted it to
the second, the second to the third, and so on. Suppose again that the
probability of each testimony be equal to the fraction ; that of the
incident resulting from the testimonies will be less than . We cannot
better compare this diminution of the probability than with the
extinction of the light of objects by the interposition of several
pieces of glass. A relatively small number of pieces suffices to take
away the view of an object that a single piece allows us to perceive
in a distinct manner. The historians do not appear to have paid
sufficient attention to this degradation of the probability of events
when seen across a great number of successive generations" {Laplace,
ed. 1951, p.13}.
Long after Laplace, this principle (in the special case of entropy)
has been given a name and a contemporary interpretation: "The
data processing inequality [which] can be used to show that no clever
manipulation of data can improve the inferences that can be made from
the data" {Cover&Thomas,
1991, p. 32}.
The Law of Diminishing Information (1.8.1) has a more general scope. It is a criterion of
what qualifies as an information measure for an arbitrary channel. A
chain of channels is a different matter, but from the Law (1.8.1) it follows that for every new
channel added to the end of a chain, the information about the initial
input decreases.
To Top of Page
It may seem precarious to base a theory on such a negative sounding
postulate as the Law of Diminishing Information (1.8.1). Yet, our comprehension of the world around us
depends on this Law. If all information were preserved, we would be
lost in the woods, overwhelmed by a thicket of details. We would not
even perceive individual trees, we could not imagine the paths. The
Law prescribes a haze of blissful ignorance and oblivion, through
which we can see patterns emerge (Section 8.13), patterns otherwise
unseen.
Without the workings of the Law, physics would be reduced to
gibberish (Section 11.23). According to the underlying physical laws,
e.g. an electron would carry the information about all of its previous
history. But, as Rydberg back in 1906 states as a basic assumption in
atomic physics (transl.): "The electrons are, wherever they come
from, always similar to each other" {Rydberg, 1906, p.13}.
We must here insert a notice that application of the mathematical
theory of information to physics necessarily contains an element of
speculation, in anticipation of two kinds of experiment yet to be
made: the localization of atomic memory (Section 14.10), and the
testing of Bell's inequalities on particles with nonzero mass (Section
13.13).
In a closed physical system, there are two main (but opposite)
directions in which information disappears, the
condirection and the didirection,
corresponding to converging or
diverging dynamic trajectories (Section
11.1). Chemistry provides an example of why this makes the world
comprehensible: matter will either condense to a
liquid or a solid, or it will diffuse to a
gas. Hence, there are only three clearcut aggregation modes that the
chemist primarily has to study.
As a prototype of a closed system, we will use particles in a box
(Sections 6.21..22): The converging orbit of a single
boxed particle leading to Schrödinger's wave function (Section
14.3), and the diffusion of many particles leading to the second law
of thermodynamics (Section 14.7).
Either way, information behaves as if it were destroyed: crushed by
contraction, shattered by
dissipation. And, the ongoing destruction of
information provides an arrow of time (Section 6.15).
To Top of Page
Footnotes:
^{[1]} Rudolf Julius Emanuel Clausius, 182288,
German physicist.
^{[2]} James Stirling (d. 1770), Scottish
mathematician.
^{[3]} Translated from the AngloGerman phrase HITS
STATT BITS coined by Jan Hajek, The Netherlands (private
communication).
^{[4]} I myself have been involved in a design
project in which the resolution of an analog to digital conversion was
stretched merely by using a small capacitor to add a 50 Hz disturbance
to an analog signal.
^{[5]} Citation distortion provides another example: a quotation from a reference cited by one author is passed on from one author to another and ends up quite different from the original citation, Section 6.11.
^{[6]} French astronomer and mathematician, 17491827.
To Top of Page
Contact: jankahre (at) hotmail.com
