- Home
- Contact

- Contents
- Reviews and

- Questions

- Papers
- New results

- K:)nigsberg


- Jan Hajek's
   IT & TI


The Mathematical Theory of Information:
Questions and Answers

  • The back cover tells that the classical (Shannon) information theory is a special case of The Mathematical Theory of Information. Can you be more specific? (M.B.)
  • What do you mean by colligation? (M.B.)
  • Additional questions on colligation. (Liu C,China)
  • The title of your book sounds familiar? (M.B.)
  • Shall an information measure be additive? (M.B.)
  • Why does your SUMMARY contain words where the prefixes di- and con- are in bold letters? (Anne M, USA)
  • What is the role of the Law of Diminishing Information in physics? (M.B.)
  • What is the role of the Law of Diminishing Information in biology? (M.B.)
  • Is there some problem with the range of Rnyi entropy? (Jan H, The NL)
  • What do you mean by Wiener's ideal receiver? (Jan H, The NL)
  • How come some metricity fundamentals are not mentioned? (Jan H, The NL)
  • Must different information measures be ordered the same way in respect to given probability distributions? (Jan H, The NL)
  • Is it possible that various specific information measures will yield opposite orderings of informations? (Jan H, The NL) [The answer contains a new theorem.]
  • Is there more to The Law of Diminishing Information than the game of telephone or post office? (W. Small Jr.)
  • How to optimally estimate the information in biological signals (as e.g. the search for a neural code has failed)? (Frank Borg, Finland)
  • Is there a minimum energy quantum per bit-process (Landauer etc)? (Frank Borg, Finland)
  • Does information disappear into black holes or is e.g. a form of the holographic principle ('t Hoft) valid? (Frank Borg, Finland)
  • Why do you refer to Bar-Hillel's "ideal receiver", but not mentioning Woodward who treats "ideal receiver" in more detail? (Jan H, The NL)


Question: The back cover tells that the classical (Shannon) information theory is a special case of The Mathematical Theory of Information. Can you be more specific? (M.B.)

Answer: You better buy the book. Section 7.0 lists the 8 conditions for the telecommunications (or classical) information theory and its entropy measure to apply:

  • One-to-one correspondence is the goal both in encoding and decoding, and there is in principle no information loss, Theorem T11 (3.10.3).
  • Symmetry follows from the one-to-one correspondence, and entropy is symmetric (4.9.9). There are, however, cases where an asymmetric information measure is appropriate (Section 1.6 The Sailor's Tale).
  • Pure probabilistic message characterization. The 'semantic' aspects are ignored, as well as error correction based on content (Section 8.1).
  • Convexity is a useful property of entropy in the mechanistic systems of technology and physics, but makes entropy unsuitable when gathering knowledge (Section 5.9) or making decisions (Section 4.12).
  • Kraft inequality characterizes the signals, i.e. a message is a demarcated string of symbols. This property leads to entropy. Information is carried also by other kinds of signal (Section 8.7).
  • Causality is involved, the sender causes the signal. Einstein causality (13.2.1) is violated by some carriers of information (Section 13.8).
  • The law of large numbers must apply, the symbol string must be long enough to make the entropy a relevant limit value (Figures 1.2 and 7.6).
  • Ergodicity must hold for the symbol strings, i.e. a long string represents the statistical properties of all the strings. Ergodic models have failed in attempts to describe lingual communication (Section 7.12).

To Top of Page

Question: What do you mean by colligation? (M.B.)

Answer: To tie different pieces of information together in a useful way. Some information measures, such as entropy, does not allow colligation: presence of background knowledge can only decrease the the information given by a piece of information. Other information measures, such as utility allows colligation: prior knowledge can make a piece of information more useful.

As the mental process of gathering knowledge depends on colligation, entropy is not the appropriate information measure in such applications. I met an exception when, relaxing after my writing chores, I read: "Actually, Diane didn't really know anything; her bits of information - and to give Diane her due, they were admittedly legion - floated around in a contextless sea" (Martha Grimes, The Horse You Came In On). Entropy is suitable to measure such scattered pieces of information.

Additional questions on colligation. (Liu C,China)

To Top of Page

Question: The title of your book sounds familiar? (M.B.)

Answer: Claude Shannon wrote a book "The Mathematical Theory of Communication" which is sometimes mistakenly written as "The Mathematical Theory of Information". This is an example of "citation distortion", a consequence of the Law of Diminishing Information. Authors of scientific works tend to expand their own reference lists by borrowing from others' lists, references they have not actually read. But if an author makes a mistake writing down the title of his reference, a borrower will repeat this mistake. So, googling with "mathematical theory of information" and "illinois press" gives a sample of authors referring to Shannon's book even if they never seen it.

To Top of Page

Question: Shall an information measure be additive? (M.B.)

Answer: No information measure is strictly additive. For two pieces of information A and B, an information measure is additive if the information given by A and B together equals the sum of the informations given separately. This holds e.g. for entropy, provided A and B are statistically independent. There is, however, a second condition: A and B must not give information about the same object. Entropy is additive if we find it meaningful to add information about one object to information about a another object - adding apples and pears. No information measure is additive, if we require additivity also for pieces of information about the same object.

To Top of Page

Question: Why does your SUMMARY contain words where the prefixes di- and con- are in bold letters? (Anne M.)

Answer: The prefixes refer to the two alternatives how a physical system reaches equilibrium: by convergence or by divergence. In both cases the information about the system's original state disappears. An example: A swinging pendulum coming to rest by air resistance. The air molecules form a diffusive system, absorbing the mechanical energy, reaching a thermodynamic equilibrium. The movement of the pendulum again will converge to the point of lowest potential energy, reaching a static equilibrium. The information about the initial values of direction and amplitude has disappeared.

The italics of "disappeared" underline that information is never destroyed in a physical system, it merely disappears. If we imagine that, after the pendulum is brought to rest, the velocities of all the air molecules were reversed, they would start to impinge on the pendulum, until the original swinging is restored.

To Top of Page

Question: What is the role of the Law of Diminishing Information in physics? (M.B.)

Answer: It makes physics possible. Systems are forgetting their past as they reach equilibrium, or rather, the initial conditions can be eliminated from their description. Otherwise, physics would be complicated beyond comprehension (Section 1.12 Bliss of ignorance).

Mathematically, we use an information gain function which must decrease with time. If we plug in the divergent hypothesis, the Second Law of Thermodynamics follows. If we plug in the convergent hypothesis, Schrdinger's wave equation follows.

To Top of Page

Question: What is the role of the Law of Diminishing Information in biology? (M.B.)

Answer: Evolution is based on elimination. If all the middle forms between species would live today (instead of being scantly recorded as fossils), zoology would be a mess. Quoting my recent birthday gift from my wife: "Why is nature not all confusion instead of the species being, as we see them, well defined? ... Why is life so lumpy?" (Darwin's Ghost by Steve Jones).

Mathematically, evolution is the result of two opposite forces: divergence by random mutations, convergence by natural selection. Both obliterates information about the past.

To Top of Page

Question: On p.106 your (4.9.5) says that "0 <= alpha < 1 and lim_alpha --> oo " which is a contradiction. I know that this is a typo, as it was me who has pointed out the problem in your draft edition. But why limit the Renyi entropies to the range 0 <= alpha < 1 only, when the Master himself has allowed alpha > 1 as well? (Jan H, The NL)

Answer: Right, there is a typo, should be "lim_alpha --> 1" that produces Shannon entropy. The range of alpha could possibly be extended to values 1 < alpha < oo, but my published double-Jensen proof (4.9.6) does not work here, so any extension proof is left as an exercise to the reader [as text-book authors do when they are uncomfortable with a proof].

To Top of Page

Question: MTI p364 mentions Wiener's ideal receiver IR apparently mentioned in Papoulis 1965 = 1st ed p400 which i Luky Lukiano am just lukking into and i know that u have no Alexandrian lib on Aland so i quote :"We are given two processes g(t) and x(t). The first process is the signal s(t) or a functional of this signal. The second process x(t) is statistically related to the first, and we ASSume that it is "known" (see foot, p 387) for every t in an interval (a,b), where the end points of this interval might depend on t ." Thats all, no IR or Ideal Observer mentioned there anywhere. So IT must be the little word "known" in quotes by Papoulis that has caught our IA's [Ideal Author] attention. So lets luk at p 387 foot : "It should be understood that the data are are NOT KNOWN NUMBERS; they are r.v., and when we say that they are known, we mean that the estimate ^g(t) must be expressed in terms of these r.v." This thoughtful foot on p 387 referred to the simple remark there on p 387 saying that : "In an estimation problem we are given: a. The transformation, b. some or all of the statistics c. The data x" and the foot referred to the data. The empasis on NOT KNOWN NRS is by Papoulis himself. Just in case IA doesNt know what r.v. means IT means random variables. So Papoulis says on p 387 that The data are given, but are NOT KNOWN NRS. Thats all, no further trace of Ideal Receivering. My Q2 proper now is where is the meat ?? ie where is your Wiener's Ideal Receiver hidden here ?? Hope u r not gonna 2 play a hide & seek game with your readerz. (Jan H, The NL)

Answer: Correct, the "ideal receiver" is hidden in the sense that it is implicit in the very notion of optimum filtering, in this case Wiener-Kolmogoroff . An "ideal receiver" extracts all available information from a signal and as defined in my book, a receiver is ideal if its performance cannot be improved by any additional block (e.g. filter) placed between the signal and the receiver (Section 1.8 The Law of Diminishing Information). The Wiener-Kolmogoroff filter satisfies this condition. Here the amount of received information is measured against the given performance criterion. The Wiener-Kolmogoroff theory provides an example of a mathematically constructed "ideal receiver". [Correct too that I have no scientific library during vacation on my solitary island.]

To Top of Page

Question: How come that MTI does NOT even mention such fundamentals, indeed pillars of METRicity like:

  • m(X U Y) + m(X ^ Y) = m(X) + m(Y) here m() is a measure or a metric
  • c(X U Y) + c(X ^ Y) = c(X) + c(Y) here c() is a cardinality of a set
  • H(X , Y) + I(X ; Y) = H(X) + H(Y) are Shannon's entropies
  • P(x or y) + P(x & y) = P(x) + P(y) are probabilities of events x, y

(Jan H, The NL)

Answer: The Law of Diminishing Information is a metrical triangle relation that in fact prohibits numerical additivity of information measures. The closest we get is the theorem "Two messages together give at least as much information as either message alone" (Section 3.15). From the Law follows also that subadditivity [i.e. using an example from the question, H(X) + H(Y) >= H(X,Y)] is not a necessary property of an information measure, illustrated (Section 5.9) by an example of colligation: 0 + 0 = 1. This example violates all cases of numerical additivity.

To Top of Page

Question: Two distributions are U = {1/2, 1/2} and W = {2/3, 1/6, 1/6}, log2(.) used :

  • Renyi's R0(U) = 1.00 < 1.58 = R0(W) ; = 0; = log(card(.))
  • Renyi's R.5(U) = 1.00 < 1.41 = R.5(W) ; = 0.5; = Hellinger
  • Shannon's R1(U) = 1.00 < 1.25 = R1(W) ; = 1;
  • Renyi's R2(U) = 1.00 = 1.00 = R2(W) ; = 2; = -log(Cont(.))
  • Cont(.) C(U) = 0.50 = 0.50 = C(W) ; no log
  • Renyi's R3(U) = 1.00 > 0.86 = R3(W) ; = 3 i.e. cubic entropy

Renyi's R(P) = (1/(1 -))*log(sum( p^ )), thus uniform distribs have R(U) = log(card(U)) for any eg card(U) = 2 and P uniform, hence R(U) = 1.00 for log2(2). Note that as increases from 0 to 2, the difference R(W) - R(U) decreases until it vanishes for = 2 or for Cont(). So for a pair of distribs we can get <, =, > as we like, depending on which entropy we choose. Can MTI/LDI advise which to choose when ?? (Jan H, The NL)

Answer: No, any order will do for functions of mere probability distributions. The Law of Diminishing Information is only concerned with information ABOUT something, the information B gives about A. For the Law to apply, the information function must contain the conditional probabilities (i.e. the transfer function from B to A): Theorem T17, Section 3.14. There is one exception, a distribution with only one possibility, e.g. P = {1,0,0) does not contain or provide any information, and consequently represents the minimum: Theorems T15 and T16, Section 3.13 Ford's formula.

To Top of Page

Question: Lets have the following results of medical diagnoses on a bunch of patients :


etc, where each line is one case eg a patient, the 0th column is the diagnosis X, other columns are the cues aka attributes eg symptoms Y1, Y2, Y3, etc representing a vector of cues (in parallel). All values are just discrete symbols ie letters or digits which are treated as letters. If we have enough cases it is easy to get good enough estimates of probabilities for the pairs of events (x, yi) so that we can compute info(X), info(Y1), info(Y2) etc, aaaand also info(Y1@X) , info(Y2@X) , etc ie info(Yi@X) = info(X) - info(X|Yi).

My Question proper: is it possible (for the above described general system) that various SPECific measures of info will yield OPPOSITE ORDERings of infos (similar to those in my earlier question) ie that eg infoOne(Yj@X) < infoOne(Yk@X) while infoTwo(Yj@X) > infoTwo(Yk@X) ie in plain English that One infomeasure will tell us that the cue Yj tells @bout X less than Yk does, while for Another infomeasure the "less" will become "more" for the same pair of j, k aaand j =/ k of course :-) ?????? (Jan H, The NL)

Answer: Yes! Different information measures may yield different orderings. This question leads to a new theorem, therefore it deserves a longer answer than a plain "Yes!". We compare two alternative medical tests B and C to diagnose a trait A. Which one is better? Does B give more information than C about A: inf(B@A) > inf(C@A)? And, the question proper: Is the ordering independent of the selection of information measure?

THE ORDERING THEOREM: For test B, let p1 be the probability of a correct positive test result, i.e. the conditional probability of a positive test result if somebody has the trait A. Let p2 be the conditional probability of a false positive test result if somebody does not have the trait A. For test C, the corresponding probabilities are q1 and q2. If the following two conditions are satisfied

0 <= (q1(1-p2) - q2(1-p1))/D <= 1

0 <= (q2p1 - q1p2)/D <= 1

where D is the determinant p1(1-p2) - p2(1-p1), then inf(B@A) >= inf(C@A) for all information measures.

Proof: The two conditions represent the realizability (Section 5.5 Beyond Bayes) of a transfer function -[x]- so that C-[]-A can be written as -C-[x]-B-[]-A. Hence inf(B@A) >= inf(C@A) follows directly from Law of Diminishing Information. Q.e.d. Note that the two conditions do not require knowledge of the frequency of the trait A.

A numeric example: Let p1 = 0.9 and p2 = 0.1 and the frequency of the trait A be P(a_1) = 0.3. Make a Cartesian coordinate system with x = q1 and y = q2. If the point (q1;q2) is in the rhombus with corner points (0;0), (0.9;0.1), (0.1;0.9) and (1;1), then inf(B@A) >= inf(C@A) for all information measures. E.g. for C characterized by q1 = 0.95 and q2 = 0.6, this inequality must be valid. If (q1;q2) is in either of the two tetragons (8/9;0), (1;0), (0.9:0.1), (1;1/9) or (0;8/9), (0;1), (0.1;0.9), (1/9;1), then inf(B@A) must be <= inf(C@A).

But, "less" can become "more". For points (q1;q2) outside those areas, e.g. q1 = 1 and q2 > 1/9, there is no ordering rule. If q1 = 1 and 1/7 < q2 < (1/7)(315/269), then the reliability (Section 3.6 Reliability) tells us B is superior, rel(B@A) >= rel(C@A), but the parabolic entropy (Section 4.4 The semantic cont measure) tells us C is superior, cont(B@A) >= cont(C@A). If q2 = 0.157, then rel(B@A) = 0.90, rel(C@A) = 0.89, but cont(B@A) = 0.252, cont(C@A) = 0.259.

Note: The Ordering Theorem does not prescribe a strict ordering. E.g. the measure rank (Section 5.7 Rank) is ranka(C@A) = 0 along the diagonal from (0;0) to (1.1), and ranka(C@A) = 1 for all other points (q1;q2).

For points (q1;q2) outside the domain of The Ordering Theorem, the ordering depends on the selection of information measure. Example: Some studies of breast cancer screening programs indicated that "screening had no effect on overall death rates. What screening did do was cause women to receive more medical interventions. As there was no change in overall death rates, the researchers concluded that any fall in cancer deaths was being cancelled out by deaths from the unnecessary treatments." {Geoff Watts: Safe or Sorry, New Scientist, 22 June 2002, p.35}. Consequently, to evaluate the information given by a medical test (or screening program), the benefit of a correct positive must be weighted against the cost of a false positive, r = benefit/cost. Hence the relevant measure is von Neumann utility (Section 3.4 Utility)

Nut(B@A) = Max[p1P(a_1)r - p2(1-P(a_1)), 0] + Max[(1-p1)P(a_1)r - (1-p2)(1-P(a_1)), 0]

in accordance with the example in Section 4.12 A Medical Test. The test is useless if the Max operator leads to Nut(B@A) = 0 + 0. Provided the decision process is rational, as in Nut(B@A), no medical test can be worse than useless. Another information measure, e.g. entropy, can make a spurious difference between two useless test programs.

To Top of Page

Question: Is there more to The Law of Diminishing Information than the game of telephone or post office? (W. Small Jr.)

Answer: Sure, there is more to it. The telephone represents a homogenous chain in the sense that we can expect the transmitted information to decrease if one more player is added anywhere in the chain. This is a trivial special case of the Law of Diminishing Information: noise is never beneficial. Nothing much can be proved based on that. For a chain A-[1]-B-[2]-C where we compare inf(B@A) and inf(C@A), the second block (or channel) -[2]- is critical.

As with any measurement, we want to have the meter as good as possible. The best information meter is the ideal receiver: "If you understand a message, you probably get more information out of it, than if you don't understand it". An ideal receiver can be defined as a receiver that cannot be improved upon by adding a block before it. If inf(C@A) would be better than inf(B@C), then the receiver of B cannot be an ideal receiver because we can replace it by a better receiver -[2]-C.

The output of a block (or channel) is a transformation (deterministic or probabilistic) of the input. We can interpret such a block in different ways:

1. A block is interpreted as a Translator: Think of A as a message in Finnish, and -[2]- as a computer programmed for Finnish-to-English translation. An Englishman receiving B tells there is zero information (gibberish!), but receiving C he gets useful information. Then, he is not an ideal receiver, because his performance can be improved by adding a block before him. No translator can be of help to an ideal receiver.

2. A block is interpreted as Noise. Think of B as a light signal received by a photocell. Then adding a disturbance (chopper or random noise) -[2]- before the photocell, its performance is improved. Then the photocell by itself cannot be an ideal receiver as the combination of disturbance and photocell forms a better receiver. Noise can be beneficial, but not to an ideal receiver.

To Top of Page

Question: How to optimally estimate the information in biological signals (as e.g. the search for a neural code has failed)? (Frank Borg, Finland)

Answer: The Mathematical Theory of Information provides the tools for information estimation, but to answer a real world question, the theory must be supplemented by raw material in the form of data or assumptions. This theory is general enough to apply to biological signals. In contrast, the classical (Shannon) information theory does not apply to neural signals: it assumes information decreases with increasing signal-to-noise ratio, and neural communication (Section 12.17 Neural correlates) exemplifies that noise can be beneficial (Section 1.10 Noise and nonlinearities).

To Top of Page

Question: Is there a minimum energy quantum per bit-process (Landauer etc)? (Frank Borg, Finland)

Answer: No. A construction of a cyclic Szilard engine (Section 14.16 Szilard's engine) defies the argumentation based on quantum uncertainty originated by Brillouin that "Boltzmann's constant k is shown to represent the smallest amount of negative entropy required in an observation". On the other side, quantum uncertainty limits the information flow (bits/s) if the available power (Watts) is given (Section 14.1 The Heisenberg receiver). In that case the optimum number of signal levels is 5 (pets) instead of 2 (bits).

To Top of Page

Question: Does information disappear into black holes or is e.g. a form of the holographic principle ('t Hoft) valid? (Frank Borg, Finland)

Answer: Assuming the laws of physics are time-reversible, information cannot be destroyed. But, information can disappear, i.e. be beyond observation (Section 11.0 The macroscopic observer). Information disappears into black holes but it's still there, preserved by holography or by "a giant tangle of superstrings" (Section 14.10 In search of hidden order). Also an atom is sort of a black hole! When an atom is excited by absorbing a photon, it will remember the exact moment of time it was excited, but this information will disappear beyond observation in less than 1 microsecond (Section 1.12 Bliss of ignorance).

To Top of Page

Question: Why do you refer to Bar-Hillel's "ideal receiver", but not mentioning Woodward who treats "ideal receiver" in more detail? (Jan H, The NL)

Answer: I think Woodward's book is the most profound book on information theory written since Shannon. Yet in my book, the ideal receiver appears in the verbal Chapter 1, and not in the mathematical Chapter 2. Hence the philosopher's description did fit in better.

In contrast to my treatise, neither of them develops the idea of "ideal receiver" further. Bar-Hillel writes "The interpretation of semantic information with the help of such a superhuman fictitious intellect should be taken only as an informal indication" {Bar-Hillel, 1964, p.224}. Woodward writes "Unfortunately the difficulties of applying these simple-sounding ideas in anything like a rigid form are usually insuperable, yet the theory is of some interest for the understanding it will be found to give of the general reception problem."{Woodward, 1955, p.63}

Woodward is one of he few authors on information theory or probability theory that does not sweep the "inverse probability" under the rug. Most authors neglect the many cases where Bayes' theorem does not apply, but has to be replaced by matrix inversion (Section 5.4 Beyond Bayes).

To Top of Page


Question 2:Can information of a same object be colligated? Suppose a objest A with two kinds of information I1 and I2, can I1 and I2 be colligated as a whole (e.g. I) to express A?

For example, when I hear the voice of a crime, I can know he is there in general. But I am not sure, for his voice maybe like someone other. Then I also see his body from far away and I think I can ensure more than only hearing his voice. So how the two kinds information of the crime can be colligated together to express my increased assurance? (Liu C, China)

Answer 2: Two pieces of information I1 and I2 about the same object will as a rule COMBINE to produce a better result. I1 and I2 together give at least as much information as one message alone, (3.15.1) in my book:
(a) inf(I1*I2@A) >= inf (I1@A)
Colligation has another meaning. It is that information I2 is more valuable if we already know I1:
(b) inf(I2@A|I1) >= inf(I2@A) (=colligation)
Colligation characterizes information measures such as reliability (3.6.1). Entropy has a property called convexity that is opposite to colligation
(c) ent(I2@A|I1) <= ent(I2@A) (=convexity)
Using entropy, I2 provides less information if you already knows I1: the more you know the less the value of any new information. I feel this is contrary to human experience, which tells us that the more we know the better we can appreciate new information. Entropy is good for engineering purposes, but not for evaluation of human information.

To measure information, we must consider all potential outcomes. A numerical example: Take four potential culprits #1,...,#4. We do not know who of them is guilty, but before we get any information, the probabilities of guilt are P(#1) = 0.1, P(#2) = 0.2, P(#3) = 0.3, P(#4) = 0.4. I1 is the observation of the voice of the culprit whether it was high or low, I2 the observation that he was long or short. We know that only #1 and #2 have low voices, and that only #1 and #3 are short. Using (4.9.2...4), the information values measured in bits are
(d) ent(I1*I2@A) = 1.846439345
(e) ent(I1@A) = 0.881290899
(f) ent(I2@A) = 0.970950594
(g) ent(I1@I2) = ent(I2@I1) = 0.005802149
We can see that (a) holds. We can also see that convexity holds from (5.9.12)
(h) ent(I2@A|I1) = ent(I2@A) - ent(I2@I1) = 0.965148445
For reliability (3.6.1), we get
(i) rel(I1*I2@A) = 1
(j) rel(I1@A) = 0.6
(k) rel(I2@A) = 0.7
(m) rel(I2@A|I1) = 1
Hence both inequalities (a) and (b) are satisfied. Further, I think it makes more sense in a case like this to measure information in terms of reliability instead of entropy.

Question 3: Maybe it is necessary to find a method to measure the colligation of different kinds of information about the same object? (Liu C, China)

Answer 3: I suggest measuring colligation as the difference

coll(B@A|C) = inf(B@A|C) - inf(B@A)

If coll() is positive, its colligation, if it is negative it is anticolligation or convexity. From (5.9.12) you can see that for entropy, coll() is always negative = - ent(B@C).

For your following two questions, I ask for your patience to let me start from the beginning. All books I have seen on probability theory are quite disorganized. All is made SIMPLE if you know the concept "statistically determined" (Section 2.3 page 27): all probabilities P(ai,bj,ck) are known. Then all other values can be calculated. And vice versa, if we dont know all P(ai,bj,ck), the system is not determined. The P(ai,bj,ck) provides a general description, any system can be defined by those values. Only one restriction: the sum of all P(ai,bj,ck) must add up to = 1.

Question 4: At the page 113 of your book, how do you get (4.12.9) from (4.12.8)? I mean that how the value -0.00891 is obtained? What is the steps between relabs(b1@A)= -0.00891? Can you give me the details? (Liu C, China)

Answer 4: The system is statistically determined because all P(ai,bj) are given in (4.12.1). From that you calculate for case I:
P(b1) = 0.0095 + 0.0495 = 0.059 (2.3.1)
P(a1) = .0095 + 0.0005 = .01 (2.3.1)
P(a2) = 1 - P(a1) = 0.99 = Max_i P(ai)
P(a1|b1) = 0.0095/0.059 (2.3.5)
P(a2|b1) = 0.0495/0.059 = Max_i P(ai|b1)
rel_abs(b1@A) = 0.059[(0.0495/0.059) - 0.99] = - 0.00891 (4.12.8)

Question 5: Assume there are three kinds of information X (voice), Y (height) and Z (blood type) about a same culprit O, and X, Y and Z are independent, that is P(XY)=P(X)P(Y), P(XZ)=P(X)P(Z), P(YZ)=P(Y)P(Z). Let X, Y and Z be sets as X={x1 (high), x2 (middle), x3 (low)}, Y={y1 (long), y2 (middle), y3 (short)}, Z={z1 (O-Type), z2 (B-Type), z3 (A-Type)}. Let O be the set of O{o1 (culprit), o2 (not culprit)}. If P(o1)=P(o2)=0.5, it means I dont know a person is a culprit or not. And P(o1|x1)=0.1 means that when I know ones voice is high, then I can think he is a culprit with the probability of 0.1, or I can think he is not a culprit with probability of 0.9, that is P(o2|x1)=0.9. Further, P(o1|x2)=0.2 means that when I know ones voice is middle then I can think he is a culprit with the probability of 0.2, or I can think he is not a culprit with probability of 0.8, that is P(o2|x2)=0.8. And P(o1|x3)=0.7 means that when I know ones voice is low then I can think he is a culprit with the probability of 0.7, or I can think he is not a culprit with probability of 0.3, that is P(o2|x3)=0.3.
P(o1|x1)=0.1, P(o1|x2)=0.2, P(o1|x3)=0.7;
P(o1|y1)=0.1, P(o1|y2)=0.2, P(o1|y3)=0.7;
P(o1|z1)=0.1, P(o1|z2)=0.2, P(o1|z3)=0.7;
The probability of x1, x2 x3 is those the following (perhaps not necessary for my question):
P(x1)=0.2, P(x2)=0.5, P(x3)=0.3
P(y1)=0.2, P(y2)=0.5, P(y3)=0.3
P(z1)=0.2, P(z2)=0.5, P(z3)=0.3
Then my question is: when I know a short (y3) person with high voice (x1) and B-Type (z2) blood shape, what is the probability that I can know he is a culprit?
(Liu C, China)

Answer 5: If all P(o_s, x_i, y_j, z_k) are known, we can directly calculate any probability, as in your question
(a) P(o1|x1,y3,z2) = P(o1,x1,y3,z2)/[P(o1,x1,y3,z2)+P(o2,x1,y3,z2)]
If P(ai,bj,ck,...) are not known explicitly, in principle its possible to calculate them from given sufficient conditions, but in practice the calculations can be cumbersome. We start with a simple illustrative example with only two binary sets, ai and bj. We have four unknowns: P(a1,b1), P(a1,b2), P(a2,b1), P(a2,b2) and only one equation to start with
(b) P(a1,b1) + P(a1,b2) + P(a2,b1) + P(a2,b2) = 1
We need three more equations. Lets stipulate that P(a1) = 0.4. Then (2.3.1) yields
(c) P(a1,b1) + P(a1,b2) = 0.4
P(a2) = 0.6 does not provide a useful equation because it follows from (b) and (c). Instead we assume that A and B are statistically independent: P(ai,bj) = P(ai)P(bj) provides four conditions. We start with P(a1,b1) ) = P(a1)P(b1) and we get after some elementary algebraic calculations
(d) P(a1,b1) P(a2,b2) - P(a1,b2)P(a2,b1) = 0
That is, the determinant of P(ai,bj) = 0. All four P(ai,bj) = P(ai)P(bj) produces the same result, so the independence condition produces only one equation (d). We need one more equation, and add P(b1) = 0.3 and get
(e) P(a1,b1) + P(a2,b1) = 0.3
From (b) ... (e) we can solve: P(a1,b1) = 0.12, P(a1,b2) = 0.28, P(a2,b1) = 0.18, P(a2,b2) = 0.42. The system is now statistically determined and we can easily calculate whatever we want.

Your case of P(o_m, x_i, y_j, z_k) contains 2*3*3*3 = 54 unknowns so a systematic approach is necessary. If the number of equations is less than 54, the system is not completely defined. But, there is another danger too: Some conditions can be conflicting. Your specification contains one such conflict. On one hand P(o1) = P(o2) = 0.5 an on the other hand, the given P(o1|xi) and P(xi) produces P(o1) = 0.1*0.2 + 0.2*0.5 + 0.7*0.3 = 0.33 which is inconsistent with P(o1) = 0.5.

Instead we make a new start from
(f) P(o1) = P(o2) = 0.5.
Then the next step would be to define P(xi|o1), P(yj|o1), P(zk|o1). IMPORTANT: if we instead start to define P(o1|xi) etc we end up in seyaBs theorem (Section5.5 pages 126...129) and no end of cumbersome calculations. Always avoid seyaB! Moreover, P(xi|o1) etc can be put on a sound empirical base. E.g. P(xi|o1) = 0.80 means that 80% of observations about o1 are "person with high voice". Also, there has been an ambiguity about xi: is it a property of somebody or a description of an observation? This ambiguity is now removed, as xi is directly connected to the observation. As a numerical example, I define:

(g) P(x1|o1) = 0.80, P(x2|o1) = 0.15, P(x3|o1) = 0.05, P(y1|o1) = 0.10, P(y2|o1) = 0.20, P(y3|o1) = 0.70, P(z1|o1) = 0.30, P(z2|o1) = 0.40, P(z3|o1) = 0.30

We also assume that the observations about o1 are independent (something we can do in an example but can be a risky assumption in real life applications):
(h) P(xi,yj,zk|o1) = P(xi|o1)P(yj|o1)P(zk|o1)
(i) P(o1,x1,y3,z2) = P(x1|o1)P(y3|o1)P(z2|o1)P(o1) = 0.8*0.7*0.4*0.5 = 0.112
The alternative o2 (= o1 not the offender) can be thought of as representing the average citizen. For example, let 30% of the general population have high voice, 40% middle voice, 30% high voice etc. This way we could calculate e.g.
(j) P(o2,x1,y3,z2) = P(x1|o2)P(y3|o2)P(z2|o2)P(o2) = 0.3*0.3*0.4*0.5 = 0.018
Insertion of (i) and (j) into (a) yields
(k) P(o1|x1,y3,z2) = 0.112/(0.112 + 0.018) = 0.112/0.13 = 0.86 (approx.)
The observation x1,y3,z2 increases the probability of guilt from 50% to 86%. This statement has the weakness, however, that it is dependent on the assumption that o1 is guilty with the a priori probability P(o1) = 0.5. If o1 is picked at random from the streets of Tokyo, maybe P(o1) = 0.000001. Then the observation x1,y3,z2 increases the probability of guilt P(o1) from 0.000001 to about 0.000006. Not very convincing I would say (but perhaps close enough for an American jury). A more reasonable case to assume (f) would be to decide between two suspects, o1 and o2 with the same opportunity, the same character and the same motive.

If we assume the values (i) and (j) we can calculate whatever we want. The inf() denotes the generic information measure and a numerical value can not be calculated without a specific definition, e.g. inf() = rel(). Also, I think its not much point to discuss rel(x1,y3,z2@o1), the reliability of an observation x1,y3,z2 as related to a singular suspect. Instead, we should think about reliability of the observation x1,y3,z2 related to the set of suspects O containing o1 and o2. From (4.12.3) and (3.4.9)
(m) rel(x1,y3,z2@O) = Max_i P(oi,x1,y3,z2) = Max [0.112 or 0.018] = 0.0112
The same way as we selected the form (4.12.3) from (4.12.2), we select from (3.4.8)
(n) Nut(x1,y3,z2@O) = Max_k Sigma_i P(oi,x1,y3,z2)U(oi|dk)
To obtain a numerical value, we must specify the utility U(oi|dk), for example fair odds (3.5.1) and hence by definition (3.5.4)
(p) Nut_1/P(x1,y3,z2@O) = Max_i P(x1,y3,z2|oi) = Max [0.112/0.5 or 0.018/0.5] = 0.112/0.5 = 0.224

Question 6: Although i promised to myself not to get involved in TI, your colligations seems irresistible. In MTI u have not written up an explicit formula for colligation as u do (here above), but in retro i see u came damned close to IT in (5.9.2). Wouldnt it be proper to say that coll(.) measures a SYNERGY between pieces of knowledge?? I mean: wouldnt synergy be better than synthesis (p.136) ?? (Jan H, NL)

Answer 6: I would reserve COLLIGATION for the asymmetric position of B and C in inf(B@A|C) and could use SYNERGY for the symmetric inf(B*C@A). Both cases would be SYNTHESIS.

Question 7: why do u use the channel B-A-C and not A-B-C ?? Or are u assuming that the actual channel is A-B-C, but u need/have to VIEW it as-if it is B-A-C ?? If so then why ?? (Jan H, NL)

Answer 7: Here the actual channel is B-A-C or with symbolic arrows B<-A->C. I try to use A as what the information is about, inf(.@A).

Question 8: u know my weakness for playing with various measures of influence between (pairs or triples of) events [ie not averages like inf(os) ]. Can u envision a colligation between events a,b,c (ie not variables ie sets of events A,B,C ) ?? (Jan H, NL)

Answer 8: I see meaning in discussing individual elements a,b,c if they can be related to the total picture described by A*B*C, e.g. by defining an addition rule how to build the whole picture from the individual elements. I lack understanding in such questions, as to quantization of causation for individual elements.

Question 9: can u show me some nontrivial practically relevant apps (examples of use) of colligation for variables A,B,C and if possible also for events a,b,c ?? (Jan H, NL)

Answer 9: Im the wrong person to ask about PRACTICALLY RELEVANT APPS due to lack of empirical experience. It is your area of expertise. The forensic question posed by Liu is as close as I get.

Question 10: here u say that for entropy the colligation will always be < 0. For which inf(os) will colligation not (always) be < 0 ?? For cont(.) will be (always) what (not) ?? (Jan H, NL)

Answer10: The other Shannon measure, reliability, has the property that colligation >= 0. This is proved by

rel(B@A) = Sigma_j P(b_j)Max_i P(a_i|b_j) = Sigma_j Max_i Sigma_k P(a_i b_j c_k)

rel(B@A|C) = Sigma_k P(c_k)Sigma_j P(b_j|c_k)Max_i P(a_i|b_j c_k) =
= Sigma_j Sigma_k Max_i P(a_i b_j c_k)

As there is maximization choices for each k and i in rel(B@A|C) but only for each i in rel(B@A), we get
rel(B@A|C) >= rel(B@A)
or reliability is characterized by colligation. There is a possibility that colligation >= 0 for all the utility type information measures and that colligation <= 0 for all measures generated by convex functions, including cont(.), but I have not checked it up.

To Top of Page


Page v: Is 1.9 Shannnon's Should be 1.9 Shannon's
Page 21: Line 2 Is Rydberg, 1906, p.13 Should be Rydberg, 1906, p.17
Page 53: Footnote 6 Is psycholgist Should be psychologist
Page 79: Line 32 Is (Section 1.11) Should be (Section 1.12)
Page 82: (4.0.10), (4.0.11) Is Max_iP(a_i¦b_j)U(a_i) Should be Max_i(P(a_i¦b_j)U(a_i))
Page 84: (4.1.13) Is log(P(b_j|a_i)) Should be log(P(a_i|b_j))
Page 103: (4.8.7) Is f ''(0) >= 0 Should be f '(0) >= 0
Page 106: Under (4.9.5) Is lim_alpha -> oo Should be lim_alpha -> 1
Page 112: (4.12.1)Is (1-epsilon)(1-delta_1) Should be (1-epsilon)(1-delta_2)
Page 124: Footnote 3 Is rectangle Should be square
Page 126 & 490: Is Woodward, 1955 Should be Woodward, 1953
Page 128: (5.5.13) both right hand if-conditions Is <= Should be >=
Page 136: number (5.9.9) is missing, but no equation is missing
Page 148: Above (6.4.8) Is any function f(P(a_i)) Should be any positive function f(P(a_i)) >= 0
Page 177: Line 11 Is g(x)log Should be g(x) = xlog
Page 261: Footnote 19 Is {8.17p.412} Should be {Holton,1988, p.412}
Page 278: (9.9.1) Is /2sigma^2)) Should be /(2sigma^2))
Page 305: (10.11.2) Is ) + Should be )dx +
Page 308: (10.13.15) Is 1/nsigma^2 Should be 1/(nsigma^2)
Page 399: Line 32 Is (Vaihinger, Section 8.13) Should be {Vaihinger, 1927}
Page 399: Line 33 Is (Planck, Section 6.23) Should be {Planck, 1955, p.23}
Page 429: (13.18.1) Is -n^2)) Should be -n^2)
Page 434: Is {Einstein,1907, p.1945} Should be {Parker, 1989, p.1945}
Page 441: (14.3.7) Is (-h Should be -(h
Page 442: (14.3.11),(14.3.12),(14.3.13) Is (-h Should be -(h
Page 442: (14.3.14) Is (-h Should be (h
Page 442: (14.3.14) Is V(x))psi(x) Should be V(x))2psi(x)
Page 464: (14.14.3) Is W Should be Delta W
Page 469: Line 17 Is Bolzmann's Should be Boltzmann's
Page 480, 495, 497: Is Csizár Should be Csiszár
Page 484: Should be no empty line between Horgan and Houdini
Page 493: Add on empty line _-a / ^a Prefix to insert values -a and a into an integral (14.4.6)
Page 494: Is apples & pears, 37, Should be apples & pears, 35, 37,
Page 494: Is Bar-Hillel, Yehoshua, 12, 88 Should be Bar-Hillel, Yehoshua, 4, 12, 58, 81, 88
Page 497: information properties Is content 62, 118 Should be content 62, 111, 118
Page 499: Is receiver, ideal 222, 246, 437 Should be receiver, ideal 12-14, 27, 34, 40, 48, 50, 76, 77, 221, 222, 246, 260, 291, 307, 326, 327, 364 footnote, 367, 433, 437, 474, 476

To Top of Page

Contact: jankahre (at) hotmail.com