# The Mathematical Theory of Information: (Re)Search Hints by Jan Hajek, The Netherlands

## Introduction

Jan Hajek, The Netherlands, has contributed with this appendix about different functions used to measure information. Questions can be directed to him at hajek@matheory.info.

## (Re)Search Hints

When young, I was interested in cybernetics and in infotheory, but got an opportunity to do some IT only after 1978 when finished my R&D work on the automated verification/validation of communication protocols including TCP. Let me complement Jan Kåhre's MTI with some (re)search hints & pointers for the readers living & working in the age of Internet still driven by TCP. Just search for the names and keywords (JASA = Journal of American Statistical Assoc., LDI is the Law of Diminishing Information) and get your hits & bits.

Hint 0: Verify that the LDI follows from the fact that no (ir)reversible remapping (like e.g. A-B-C) of a set of discrete symbols can increase info, and usually it will decrease it.

Hint 1: The hallmark of the LDI inf(C@A) <= inf(B@A) is its generality and consequently its asymmetry. The classical symmetrical inequality for cascaded channels is I(B;A) >= I(C;A) <= I(C;B). A metric distance must be symmetrical, e.g. d(A;B) = H(A|B) + H(B|A), normalized dn = d(A;B)/H(A,B), see C. Rajski, Entropy and metric spaces, pp.41-45 (vector diagram of a channel!), in the book Information Theory - 4th London Symposium, 1956, editor Cherry; Shannon 1950 on The lattice theory of information; Yasuichi Horibe 1973, Linfoot 1957 on An information measure of correlation, Hamdan & Tsokos 1971 on An information measure of association, 1971, all 3 in Info & Control. Check and compare these via LDI which in fact is much stronger than the triangle inequality, since inf(C@A) <= inf(B@A) + any nonnegative number like e.g. I(C@B) or 0, hence the LDI is a scale law.

Hint 2: The cont() measure and its complement 1 - cont() have several meanings [e.g. 1-cont(A) is the expected probability of A] and applications ranging from diagnosing to codebreaking. Cont() and 1 - cont() were rediscovered many times, and were called many names, e.g.: Gini/Simpson index of diversity and concentration (1912, 1938; Nature 1949, vol.163, p.688); repeat rate by A. Turing & I.J. Good was used for code-breaking since WWII (also in the cryptography book by A. Sinkov, 1968; Simpson, Kullback and Leibler were cryptanalysts indexed in J. Bamford's Puzzle Palace); relative decrease in the proportion of incorrect predictions tau = cont(B@A)/cont(A) = (cont(A) - cont(A|B))/cont(A) = 1 - cont(A|B)/cont(A) is the eq. (31) on p.759-760 in the survey paper on the Measures of association for cross classifications, part I, JASA 1954, part III, JASA 1963 eqs. (4.4.1..3) on pp.353-354, later published as a book; tau is the eq. (11.3-22) in the book by Y. Bishop & S. Fienberg & P. Holland, 1975, p.390, where the chap. 12 derives pseudo-Bayes estimators of probabilities employing cont(A) in K as a posterior risk; energie informationelle by Octav Onicescu in 1966; C.L. Sheng & S.G.S. Shiva On Measures of information, Proc. Nat. Electronics Conf. 1966, Ottawa, pp.798-803, watch for the bugs in the eq. (40) and in the following inequalities on p.802 - still a nice paper; Havrda & Charvat 1967; quadratic entropy by I. Vajda 1968; JASA 1971 p.534, 1974 p.755; quadratic mutual information by G. Toussaint 1972; J. Zvarova 1974; Toomas Vilmansen 1972; Bhargava & Uppuluri 1975-1977; JASA 1982, p.548-580; S. Watanabe in his books Knowing and Guessing 1969 p.14, Pattern Recognition 1985 p.150; information energy surveyed by I.J. Taneja 1989 & 1995, with L. Pardo 1991; parabolic entropy; Bayesian distance thoroughly analyzed by P. Devijver 1972-1979; Jan van der Lubbe has papers and Ph.D. thesis on certainty 1981; Pielou on Mathematical Ecology 1977; Colette Padet 1985; Tsallis entropy since 1985 plays a role in physics; see the Encyclopedia of Statistical Sciences and the WWWeb for diversity indices, generalized information measures, generalized divergences, measures of association.

Hint 3:
 Imre Csiszar: f-divergence, f-information, f-informativity; Moshe Ben-Bassat: f-entropies, Bayes, probability of error; Cornelius Gutenbrunner: f-divergences as averaged minimal Bayesian risk

Hint 4: On desiderata for (fuzzy) entropies: De Luca & Termini in Info & Control 1972 & 1974; Bruce Ebanks in J. of Mathematical Analysis and Apps 1983, do not miss his theorem 3.2 on p.32 which says that font() is the only measure of fuzziness which satisfies all 6 desiderata.

Hint 5: On ent(B|a) as H(B|a): Nelson Blachman, The amount of information that y gives about X; Tebbe & Dwyer, Uncertainty and the probability of error, both in IEEE-IT 1968; Robert Fano's 1963 book on Transmission of Information. Try to formulate new ent(a|B), e.g. H(a|B), cont(a|B), etc.

Hint 6: Check that the probability of misclassification is bound by
 1 -rel(B@A) <= cont(A|B) <= Min[ H(A|B)/2, 1 - 2^-H(A|B)] rel(B@A) >= 1 - cont(A|B) >= Max[1 - H(A|B)/2, 2^-H(A|B)]
see p.23/2.7a-e in the book by M. Mansuripur on Information Theory, 1987. Which other interpretations can you see in the above inequalities?

Hint 7: The asymmetry of LDI implying inf(B@A) =/ inf(A@B) is naturally desirable for prediction, forecasting, classification, identification and diagnostic tasks, and of course for measuring the cause-to-effect strength. The book by A. Renyi, A Diary on Information Theory, 1987, the 3rd lecture, discusses (a)symmetry and causality on pp.24-25+33, without offering a solution. Lets reopen the discussion with a simple proposal plus its criticism. The obvious need for asymmetrization of I(A;B) = I(B;A) may seem to be solved simply by the normalization, since H(X|Y) <= H(X), hence
asy(B@A) = 1 - H(A|B)/H(A) = (H(A) - H(A|B))/H(A) = I(A;B)/H(A)
asy(A@B) = 1 - H(B|A)/H(B) = (H(B) - H(B|A))/H(B) = I(A;B)/H(B)
from which it is clear that asy(B@A) =/ asy(A@B) solely due to H(A) =/ H(B). Since H(X) increases with the number of the distinct discrete values a.k.a. the cardinality of X, the asymmetry is trivial if card(A) =/ card(B). The question is whether non-Shannonian informations such as e.g. the tau above [asymmetrical due to cont(B@A) =/ cont(A@B) ], can be at all substantially less simplistic in this respect? Can the LDI help?

Hint 8: M. Bongard's book on Pattern Recognition, 1970, chap.7 on Useful information, on disinformation pp.100-101; see p.121. A. Hobson & Bin-Kang Cheng, A comparison of the Shannon and Kullback information measures, J. of Statistical Physics, 1973, pp.301-310, p.305... on undesirable additivity. The book by A.M. Mathai & P.N. Rathie, Basic Concepts in Information Theory and Statistics, 1975, pp.27,68,79 on Kerridge inaccuracy, p.84, p.90, p.100-102, p.110.