-.- .-
Causal INSIGHTS INSIDE for data mining to fight data tsunami and confounding
or
Causation and confounding as indicated by probabilistic Implication*Surprise
in relative risk RR, likelihood ratio LR, I.J.Good, Kemeny vs Popper, Google,
for data mining in epidemiology, evidence-based medicine, economy, investments
Copyright (C) 2002 - 2004, Jan Hajek, Netherlands
NO part of this document may be published, implemented, programmed, copied
or communicated by any means without an explicit & FULL reference to this
author together with the FULL title and the website WWW.MATHEORY.INFO or
WWW.MATHEORY.COM plus the COPYRIGHT note in texts and in ALL references to
this. An implicit, incomplete, indirect, disconnected or unlinked reference
(in your text and/or on www) does NOT suffice.
All based on 1st-hand experience. ALL rights reserved.
Version 1.59 of May 27, 2004, has 3252 lines of < 79+CrLf chars in ASCII,
likely to be updated soon, written with "CMfiler 6.06f" from www; submitted
to the webmaster of http://www.matheory.info aka http://www.matheory.com .
This epaper may read better (more of left margin, PgDn/Up) outside your email.
This epaper has new facilities for fast finding and browsing. Please save the
last copy and use a file differencer to see only where the versions differ:
download Visual Compare VC154 and run it as: VCOMP vers1 vers2 /k /i which
is the best and the brightest colorful comparer for plain .txt files.
Your comments (preferably interlaced into this .txt file) are welcome.
Browsers may like to repeatedly find the following markers :
!!! !! ! ?? ? { refs }
Q: Single spaced keywords on this list indicates semantical closeness :
?( asymmetr attributable etiologic B( B(~ Bayes factor beta Bonferroni
:-) as-if boost confound Cornfield Gastwirth caution chain conjecture
:-( Brin caus1( causa causes code cofa cofa0 cofa1 CI confidence confound
contingency cont( --> conv( conv1 conv2 conv3 conviction corr( correl cov(
confirm C( corroborat counterfactual degree depend DeMorgan entrop
error example expos F( F(~ F0( factual support Kemeny Gini I.J. Good
Hajek hypothe --> impli 0/0 independ infinit oo inhibit inh0( inh1(
Kahre Kemeny key likelihood LR meaning mislead MDL MEL MML LikelyThanNot
necess suffic Occam odds( PARADOX Pearson Phi Popper princip proper
ratio relative risk RR( RR(~ r2 refut relativi rapidit regraduat remov
rule NAIVE Schield SIMPLISTIC SeLn( SeLn(OR) SeLn(RR) SIC slope Shannon
Sheps surpris symmetr Spinoza Venn 2x2 table 5x2 tendency triviality
variance regress tanh TauB UNDESIRABLE weigh evidence W( W(~ WinRatio
WR www opeRation -log( sense exaggerat Folk Google 17th
-.- separates sections .- separates (sub)sections |- tables & Venn diagrams
+Contents : each +Word allows instant finding of the section; the content
of each section is much better than the Contents suggest
+Who might like to read this epaper
+Intro
+Extended abstract = Insight inside :
+Key contrasting formulas
+Key construction principles of good association measures
+MicroTutorial on key elements of probabilistic logic :
! +The simplest thinkable necessary condition for CONFOUNDING !!!
+Executive summary (read it only after the extended abstract)
+Mottos
+Combining : priorities, averages, median
+Notation, tutorial, basic insights, PARADOXical "independent implication"
+Interpreting a 2x2 contingency table wrt RR(:) = relative risk = risk ratio
!! see squashed Euler-Venn diagrams
+More tutorial notes on probabilistic logic, entropies and information
+Rescalings important wrt risk ratio = RR(:) = relative risk
+Correlation in a 2x2 contingency table
+Example (find more as example without + )
+Folks' wisdom
+Acknowledgements
+References
-.-
+Who might like to read this epaper :
This epaper started as notes to myself ( Descartes called them Cogitationes
privatae). Now it is a much improved version of my original draft tentatively
titled "Data mining = fighting the data tsunami :
When & how much the evidential event y INDICATES x as a hypothesised cause,
for doctors, engineers, investors, lawyers, researchers and scientists",
who all should be interested in this stuff. This epaper is primarily targeted
at British-style empiricists or BE's (sounds better than BSE :-).
Continental Rationalists (CR's) a la Descartes, Leibniz and Spinoza prefer to
apply deductive analytical methods to splendidly isolated and well defined
problems, while BE's a la Locke, Berkeley, Hume are not afraid of using
inductive inferential/experimental/observational methods even on messy tasks
in biostatistics, econometry, medicine, and in military and social domains.
BE's credo is Berkeley's "Esse est percipi".
CR's credo is Descartes' "Cogito ergo sum".
-.-
+Intro :
When confronted with events, and events happen all the time, humans ask about
and search for inter-event relationships, associations, influences, reasons,
and causes, so that predictions, remedies and decision-making may be learned
from the past experiences of such or similar events. To find a cause, an
explanation, and/or a remedy is the ultimate goal, the Holy Grail of advisors,
analists, attorneys, barristers, doctors, engineers, investigators, investors,
lawyers, philosophers, physicians, prosecutors, researches, scientists, and
in fact of all wonderful expert human beings like you and me, who use or just
think the words "because", "due to", "door" (in Dutch :-), and "if-then".
David Hume (1711-1776) used to say that the "causation is the cement of the
Universe". Max Planck (1858-1947) quoted in { Kahre 2002, p.187 } :
"Causation is neither true nor false, it is more a heuristic principle, a
guide, and in my opinion clearly the most valuable guide that we have to
find the right way in the motley hotchpotch [= bunten Wirrwarr], in which
scientific reserch must take place, and reach fruitful results."
One man's mechanism is another man's black box,
wrote Patrick Suppes at Stanford, and I say:
One man's data is another woman's noise, and also:
one man's cause is another woman's effect, eg:
gene ...> hormone level ...> symptom ; or if we view the notion of specific
illness as-if real (in fact it is an abstraction), then eg:
gene ...> illness .........> symptom . In this causal chain
a researcher may see the illness as an effect caused by genes,
while a physician, GP or clinician, sees it as a cause of a symptom, eg a
pain in the neck to be removed or at least suppressed. Cause-effect
relationships are relative wrt to the observer's frame of view, like EinStein
would have loved to say.
Like an implication, causation is supposed to be transitive like eg in math
if A > B and B > C then A > C.
!!! Caution: causation works in the opposite direction wrt implication. This
is so, because ideally an effect y implies a cause x, ie
a cause x is necessary for an effect y (draw a Venn diagram).
Note that an inference rule :
IF effect ie evidence THEN hypothesised cause (eg an exposure)
is reflected in the ( evidence implies hypothetical cause (eg a treatment),
while the causation goes in the opposite direction:
( exposure may cause effect or evidence ). Hence we must be careful about the
assigned meanings and about directions of arrows and notations like (a:b),
(x:y) , (y:x) , (~y:~x) , (~x:~y) , conv(x --> y) , etc. Many cues or
predictors are symptoms caused by a health disorder, but some cues are surely
the causes of an illness, so eg :
IF (wo)man THEN "(fe)male disorder likely" makes sense, but it would be
foolish to think that a disorder caused a human to be a (wo)man. Although
IF (fe)male disorder THEN (wo)man, is correct, it (usually) is pointless.
-.-
+Extended abstract = Insight inside :
.- +Key contrasting formulas :
Too many measures of statistical association were (re)invented under even more
all too suggestive names { Feinstein 2001, sect.17.6, pp.337-340 tells many }.
All of them somehow capture statistical dependence, often in form of ARR(:) or
RR(:) which both contrast P(y|x) vs P(y|~x). The key question is which formula
is the best for which purpose? For the effect y if exposed to x (eg x is a
treatment), the key CONTRASTing formulas in epidemiology and in EBM ie in
evidence-based medicine for binary x are (for multivalued x or for any other
kind of exposure just replace ~x by z ) :
ARR = P(y|x) - P(y|~x) = Absolute risk reduction aka attributable risk ,
"absolute" ie not "relative", often |ARR| too
= a/(a+b) - c/(c+d) in a 2x2 contingency table (find 2x2 )
= [ Pxy - Px*Py ]/[ Px*(1 - Px) ] <= 1 even for tiny Px > 0
= cov(x,y)/var(x) = slope(of y on x) = beta(y:x) <= 1
= 0 if x,y are independent (then also P(x|y)=Px, Pxy=Px*Py & P(y|x)=Py)
! = 0 to be enforced if Px=1 then P(y|~x)=0/0 & Py=Pxy=Px*Py & P(y|x)=Py
! = 0 to be enforced if Py=1 then P(x|~y)=0/0 & Px=Pxy=Px*Py & P(x|y)=Px
With these enforced values we get a more meaningful ARR (but Py=1 or Px=1
are too extreme to be of much importance). For DISCOUNTing of the lack
! of surprise in y, the measure P(y|x) - Py <= 1-Py is better (find SIC )
since too common y is seldom perceived as much of a risk anyway (find RDS )
ARR == PNS under exogeneity=no-confounding & monotonicity=no-prevention by
exposure to the risk factor { Pearl 2000, pp.289,291,300 } ;
PNS = Probability of Necessity and Sufficiency (in general).
NNT = 1/|ARR| = Number needed to treat for 1 more |or 1 less| good effect y
NNH = 1/|ARR| = Number needed to harm 1 more |or 1 less| by side effects z
NNS = 1/|ARR| = Number needed to screen to find 1 more |or 1 less| case
NNE = 1/|ARR| = Number needed for 1 extra effect { Feinstein 2001, p.172 }
1/|ARR| is the most realistic measure of health effects in general, as
!!! it is the least abstract & least exaggerating ie most HONEST,
!!! and moreover NNT, NNS also measure EFFORT PER EFFECT.
NNH(z:x)/NNT(y:x) is also highly informative. It should be >> 1 ie many
more have to be x-treated before 1 z-harm will occur,
while many more patients have y-improved already.
OR = Odds ratio = LR+/LR- (far below find more on Odds , LR- ).
LR- = P(~x|y)/P(~x|~y) = LR- = negative LR = ( 1 - sensitivity )/specificity
LR = P( x|y)/P( x|~y) = LR+ = likelihood ratio = sensitivity/(1-specificity)
= simple Bayes factor B(x:y)
RR = P( y|x)/P( y|~x) = relative risk = risk ratio (unlike ARR, NNT, NNH,
RR(:) seems more "impressive" to the innocents) ;
RR(:) is a part of important, meaningful formulas :
PFR = -ARR/P(y|~x) = 1 - RR = Prevented fraction = -RRR :
RRR = ARR/P(y|~x) = RR - 1 = Relative risk reduction = Excess relative risk
ARP = ARR/P(y|x ) = RRR/RR = 1 - 1/RR = Attributable risk percent =
= Attributable risk for exposed =
= Attributable proportion = etiologic fraction for exposed group = EFE =
= Attributable fraction in exposure group = AFE =
= Excess risk ratio = ERR { Pearl 2000, p.292 } Since to err is a word,
ERR is cannot be found on www :-( My abbreviations
ARP = EFE = AFE, RDS, PFR and PRP are chosen as findable non-words.
RDS = ARR/P(~y|~x) = ARR/[ 1 - P(y|~x) ] = relative difference a la Sheps
= [ P(y|x) - P(y|~x)]/[ 1 - P(y|~x) ] for binary x ; for non-binary :
= slope(of y on x) /[ FICTIVE Max. slope of y on x ] (find as-if )
RDS = [ P(y|x) - P(y|z) ]/[ 1 - P(y| z) ] = relative difference by Sheps
= [ successful y if x minus if z ] / [ failure rate of y if z ],
as P(y|x) <= 1=MAX, the 1 - P(y| z) is the MAXImal thinkable value of
! RDS's numerator, ie 1 - P(y| z) is a meaningful normalization. Also
!! the IDEA is that failures if z are available to become successes if x
!! and that RDS is more honest than RRR, ARP ie AFE if P's are small,
as they often are, which inflates the latter measures. M.C. Sheps'
RDS of 1958 { Feinstein 2002 p.174 } can be found for z == ~x in :
- { Patricia Cheng 1997 } as eq.(16) = RDS, eq.(30) = 1 - RR = -RRR
{ Novick & Cheng 2004 } too ;
- { Pearl 2000 } on pp. 284, 292 : PS = RDS, PN = ERR = 1 - 1/RR = ATE
but think! :
P(y|x) - P(y|~x) = ARR is Absolute risk reduction of the effect y , but
P(x|y) measures of how much y implies x ie y Suffices for y, hence :
!! P(x|y) - P(x|~y) is a measure of y --> x or how much is x Necessary for y
!! P(x|y) - Px <= 1-Px DISCOUNTS the lack of SURPRISE in x (find SIC )
since a common x is not seen as a real CAUSE ;
if Pxy=Py then P(x|y) - Px = 1 - Px
if Pxy=Px then P(x|y) - Px = Pxy*(1/Py - 1) = P(x|y)*(1 - Py)
[ = Px *(1/Py - 1) ] <= 1 - Px (find SIC )
where [.] may suggest that Py, Px can be varied, but Pxy <= min(Px, Py).
!! P(y|x) - Py <= 1-Py DISCOUNTS the lack of SURPRISE in y (find SIC )
since a common y is not seen as a real RISK ??
if Pxy=Px then P(y|x) - Py = 1 - Py
if Pxy=Py then P(y|x) - Py = Pxy*(1/Px - 1) = P(y|x)*(1 - Px)
[ = Py*(1/Px - 1) ] <= 1 - Py (find SIC )
PRP = Pep*(RR-1)/[ 1 + Pep*(RR-1) ] where Pep = P(exposed in population)
= Population attributable risk percent ( RR is for the studied group)
= population attributable fraction = etiologic fraction for community
F(y:x) = ARR/[ P(y|x) + P(y|~x) ] = (RR -1)/(RR +1) by { Kemeny 1952 }
= -F(y:~x) and anaLogically for any mix of events x,y,~y,~x since
An event is an event is an event (sorry Gertrude :-)
!! Health effects can be expressed either in negative terms (eg ill, or dead)
or in positive terms (cured, or alive). Hence we are free to replace any
P(y|.) with P(~y|.) = 1 - P(y|.), in any formula, consistently of course. As
P(.|.)'s are often quite small, 1 - P(.|.) =. 1. The results will be then
very different, depending on our choice of +terms or -terms. These facts
create ample opportunities for honesty/dishonesty, for leading/misleading.
Clearly, if P(y|~x) < 0.5 then RDS < RRR which only seems more "impressive".
Honestly IF P(y|~x) < 0.5 THEN RDS should be used ELSE RRR should be used.
IF P(y|~x) =. 0 THEN RDS =. ARR
Of course ARR <= RDS, so ARR can never mathematically exaggerate an effect.
.- +Key construction principles of good association measures :
P1: "Measures of association should have operationally meaningful
interpretations that are relevant in the contexts of empirical
investigations in which measures are used." { Goodman & Kruskal, 1963,
p.311, there also in the footnote } Henceforth I discuss events x, y, but
it all holds for their expected values ie averages over variables X, Y
ie sets of events too.
P2: OpeRational meaningfulness is greatly enhanced if a measure has its
range of values with 3 fixed points of fixed meanings, eg [0..1..oo] or
[-1..0..1], where the midpoint means independence, and the endpoints
mean extreme dependence (-....+), ideally an implication aka entailment.
Yet there are arguments for the range [-Px..0..1-Px] (find SIC Kahre ).
P3: Various results from a single measure should be meaningfully COMPARABLE
regardless of the total count N of all joint events in a contingency
table. This means that a measure should be built from proportions P(:)
only, without an uncancelled N. Thus measures based on ChiSquare do not
qualify for our purposes. But N must play role in confidence intervals.
P4: To measure association means to measure statistical dependence. I can
list 16+1 = 17 equivalent conditions of independence ie equalities
lhs = rhs, like eg Pxy = Px*Py, or P(y|x) = P(y|~x), from which 2*17 = 34
measures of dependence can be made by CONTRASTing: lhs - rhs like ARR(:),
or lhs/rhs like RR(:) above, both asymmetrical wrt events x, y. Eg the
Pxy/(Px*Py) = P(x|y)/Px = P(y|x)/Py is symmetrical wrt x, y, and the
correlation coefficient is also symmetrical wrt x, y :
Sqrt(r2) = Sqrt[ beta(y:x) * beta(x:y) ]
= Sqrt[ (slope of y on x) * (slope of x on y) ]
note that -1 <= beta <= 1; find r2 below.
Measures of confirmation, evidence, indication, influence,.., and of course
causation should be DIRECTED ie ORIENTED ie ASYMMETRICAL wrt events x,y.
Asymmetry is easily obtained by taking a symmetrical association measure
(lhs - rhs) and dividing it by either lhs or rhs, or by 1 - rhs, or by
normalization with a function of one variable only, eg:
ARR(y:x) = (Pxy - Px*Py)/(Px*(1-Px)) = cov(x,y)/var(x) = beta(y:x)
= P(y|x) - P(y|~x)
P5: Measures of CAUSATION tendency should be decomposable into a product of
terms such that one term itself measures probabilistic IMPLICATION ie
ENTAILMENT, but the equality Measure(y:x) = Measure(~x:~y) is UNDESIRABLE.
Alas, the conviction measure conv(y --> x) = conv(~x --> ~y)
by Google's CEO { Brin et al 1997 } does not qualify (find UNDESIRABLE ).
Entailment provides a link with the notions of necessity and sufficiency
where (y implies x) == (y is Sufficient for x) == (x is Necessary for y).
P6: Measure(y:x) should yield meaningful values if Pxy = 0; and if Px = 1 :
eg: RR(y:x) = 0 if Pxy = 0 ie if x,y are disjoint events
! RR(y:x) = 1 if Px = 1 hence Py - Pxy = 0 AND YET Pxy = Px*Py,
1 means independent x,y [ find Pxy/(0/0) as special case ]
conv(y --> x) = Py*P(~x)/P(y,~x) = [ Py - Px*Py ]/[ Py - Pxy ] in general;
= P(~x)/P(~x|y) = [ 1 - Px ]/[ 1 - P(x|y) ]
= 0/0 numerically if Px = 1 whence Py = Pxy hence:
= 1 if Pxy=Px*Py also [ Py - 1*Py]/[ Py - Py ] = 1 algebra
!! = 1 - Px if Pxy=0; this is not a nice fixed value, but
1 - Px is interpretable as "semantic information content" SIC
which makes NO SENSE for Pxy=0 :-( , nevertheless :-):
1 - Px < 1 = for x,y independent, so for Pxy=0 is conv < neutral 1 :-)
!! Similarly P(~y|~x) = 1-Py makes NO SENSE for Pxy = Px*Py,
if P(~y|~x) is ~y Necessary for ~x ie x Necessary for y
To avoid overflows due to /0, such extreme/degenerated/special cases of P's
must be numerically prechecked and detected at run time and handled apart
according to the meaningful interpretation (or conventions) as just shown.
!! Since any single formula is doomed to measure a mix of at least 2 key prop-
erties ( dependence and implication mixed due to my INDEPendent-IMPlication
PARADOX ), it is a good idea to detect & report important extreme/special
cases which do not always obviously follow from the values returned. Such
! automated reporting adds semantics and avoids misreading/misinterpretation.
P7: Although it is useful to consider the values returned by measures under
extreme circumstances like eg Px=1 or Py=1, these will not occur too often,
and should be prechecked apart anyway. It is more important to choose a
measure which will return reasonable values for the application at hand.
We cannot hope that there ever will be a single universally best measure.
So far for my key construction principles. More analysis follows.
RR(y:x) is compared with few related measures like eg:
W(y:x) = weight of evidence by I.J. Good (Turing's statistical assistant);
F(y:x) = degree of factual support by John Kemeny ( Einstein's assistant);
C(y:x) = corroboration by Karl Popper (he often called it confirmation, an
overloaded term, so Popper corroborates here to be findable);
it is funny that Sir Popper who stressed refutation has worked
out measures of confirmation, but not of refutation :-) Why?
conv(y:x) = conviction conv(y --> x) by Google's CEO Sergey Brin et al.
Such comparisons increase our insights. How well these formulas measure
causal tendency is also discussed. All this & much more was/is implemented
in my KnowledgeXplorer program KX which not only infers & indicates (ie
identifies, diagnoses, predicts, etc) but also extracts knowledge (on both
event- & variable level of interest) from the information carried by data
input in the simple format. KX has graphical and numerical outputs in
compact, comparative, hence effective forms (eg my squashed Venn diagrams).
.- +MicroTutorial on key elements of probabilistic logic :
There are 16+1 = 17 equivalent == relations for independence = , and
17 for -dependence < , and 17 for +dependence > of x,y :
the ? stands for any single symbol < , = , > consistently applied :
[Pxy ? Px*Py] == [P(x|y) ? Px] == [ P(y|x) ? Py] ==... [P(~y|~x) ? P(~y|x)]
eg
[Pxy - Px*Py ? 0 ] == [ P(y|x) - Py ? 0 ] == etc ... 17 times
[Pxy / Px*Py ? 1 ] == [ P(y|x) / Py ? 1 ] == etc ... 17 times
eg
RR(y:x) = [ P(y|x)/P(y|~x) ? 1 ] == [ P(x|y)/P(x|~y) ? 1 ] = RR(x:y)
Formulas left of an ? are candidate elements for a measure of (x CAUSES y).
Other elements for (x CAUSES y) must be derived from logic. For 2 binary
variables there are 16 different logical functions, of which only the
2 implications and 2 inhibitions are ASYMMETRIC ie DIRECTED ie ORIENTED
(the remaining 12 functions are either symmetric wrt x,y, or are functions
of 1 variable only, either x only or y only). Clearly (x CAUSES y) must
to be ASYMETRICAL wrt x,y. But there are more requirements.
P(~x,~y) = P(~(x or y)) = 1 - (Px + Py - Pxy) by P(Occam-DeMorgan's law)
~(y,~x) == (y --> x) == (~x --> ~y) == ~(~x,y) == (x or ~y) in logic.
Let y = the observed effect ie evidence; x = a hypothesised cause of y :
P(x|y) = Pxy/Py is a NAIVE measure of how much y suffices to determine x
P(x|y) = 1 = max iff y --> x ie y implies x deterministically ie Pxy = Py
P(x|y) = Px iff x,y are independent ie Pxy = Px*Py. In extreme case
of Px = 1 it holds: if Px=1 then Py = Pxy = Px*Py AND P(x|y) = 1
!! ie y determines x 100% AND x,y are independent (seems PARADOXical).
If Px > 0 & Py > 0 & Pxy > 0 then
if Px > Py then P(x|y) > P(y|x) else
if Px = Py then P(x|y) = P(y|x) else
if Px < Py then P(x|y) < P(y|x) else Mission Impossibile.
So far on relatively unproblematic sufficiency; now on less clear measures of
necessity:
Pioneers { Buchanan & Duda 1983, p.191 } explained the rule y --> x thus :
"... let P(x|y) denote our revised belief in x upon learning that y is true.
... In a typical diagnostic situation, we think of x as a 'cause' and y as an
'effect' and view the computation of P(x|y) as an inference that the cause x
is present upon observation of the effect y." (find +Folks' for more).
My preferred wording is: P(x|y) is a NAIVE measure of how much evidence does
y provide for x ie how much y implies x as a potential cause of y, hence
!!! also how likely x CAUSES y.
!!! Note that in P(x|y) the y = evidence, x = a hypothesised cause. Ideally
x CAUSES y if Pxy = Py ie y implies x ie y --> x , eg if P(x|y) = 1.
Causation assumes that without a cause x there will be no effect y , hence
that a cause x is NECESSARY for effect y which then serves as an evidence
for that cause. From the reasonings on the last dozen of lines few candidate
measures (marked by their +pros, -cons, .neutrals ) follow :
1. P(x|y) = Pxy/Py is a NAIVE measure of how much is x necessary for y
- is not a fun(Px), eg canNOT discount lack of surprise if Px =. 1
. is = Px if Pxy = Px*Py ie x,y independent
+ is = 0 if Pxy = 0 ie x,y disjoint
+ is = 1 if Pxy = Py ie P(y,~x) = 0 from Pyx + P(y,~x) = Py
ie "without x no y" ie "x necessary for y" (draw a Venn)
ie "if y then x" ie "y sufficient for x"
!! - is = 1 see the CounterExample few lines below (also find SIC ).
!! - is a single P(.|.) while all single P's were REFUTED as measures of
confirmation or corroboration { Popper 1972, chap.X/sect.83/footn.3
p.270, and Appendix IX, 390-2, 397-8 (4.2) etc }
P(y|x) = Pxy/Px is analogical (just swap x with y)
1 iff "y follows from x" is the phrase in { Popper 1972, p.389 }
+ is used in simple Bayesian chain products for multiple cues.
!! CounterExample shows that P(.|.) is not good enough measure of causation:
Let x = a hypothesised cause, a conjecture
y = a widely present symptom, eg 10 fingers on each hand.
Then P(x|y) =. 1 ie Pxy =. Py since almost all with y are ill. Yet it is
neither wise to assume that y is sufficient for x,
nor wise to assume that x is necessary for y. (find SIC )
2. An alternative single P-measure of how much is x necessary for y :
P(~y|~x) = [1 -(Px + Py - Pxy)]/[1 - Px] = 1 - ([Py - Pxy]/[1 - Px])
+ is a function of Px, Py, Pxy , but:
?- is 1-Py if Pxy = Px*Py ie x,y independent; note that
1-Py is a measure of "semantic information content" SIC;
? does 1-Py make sense if x,y are independent ? I dont think so.
? (similar NO SENSE is conv(y --> x) = 1-Px if Pxy=0 ie disjoint)
?. is = 0 if 1 = Px + Py - Pxy ( unlikely to occur ? )
- is just a single P which all were REFUTED by Popper
- is <> 0 if Pxy = 0 ie x,y disjoint :-( but it can be forced:
if Pxy = 0 then NecessityOf(x for y) = 0 ELSE = P(~y|~x)
+ is = 1 if Pxy = Py , as explained next :
+ is = 1 if y --> x 100% ie if y implies x fully then :
! Pxy = Py AND P(~y|~x) = 1 = P(x|y) , which is consistent with logic:
~(y,~x) == (y --> x) == (~x --> ~y) == ~(~x,y) are all equivalent in logic.
P(~y|~x) = 1 is the only nicely interpretable fixed point.
P(~y|~x) as a candidate has arisen from my COUNTERFACTUAL reasoning:
the semantical Necessity of x for y follows from IF no x THEN no y, ie
removed or suppressed x suffices for removed or suppressed y, ie
~x implies ~y ie ~x --> ~y. The COUNTERFACTUALity in human terms says:
IF x disappears THEN y will disappear too. For more find +Folks' wisdom.
Only after I worked out P(~y|~x) above, I came across { Hempel 1965 } where
at the very end of his very long and very abstract paper I could decode his
eq.(9.11) as P(~y|~x). He derived it as a "systematic power closely related
to the degree of confirmation, or logical probability"{p.282} via his eq(9.6)
which is in fact 1 - P(.) ie SIC. On p.283 the last lines tell us why :
"Range and content of a sentence vary inversely. The more a sentence asserts,
the smaller the variety of its possible realizations, and conversely."( SIC )
"The theory of Range" is a section in { Popper 1972, sect.72/p.212-213 } where
on p.213 Popper refers the notion of [semantic] Range to { Waismann: Logische
Analyse des Wahrscheinlichkeitsbegriffes, Erkenntnis 1, 1930, p.128f. } .
3. -Px <= [ P(x|y) - Px ] <= 1 - Px { Kahre 2002, p.118-119 }
-Px if Pxy = 0 1 - Px if P(x|y) = 1 ie Pxy=Py , find SIC
Note that (1-Px) - (-Px) = 1 ie the absolute magnitudes of both bounds are
COMPLEMENTary. This makes sense since a REFUTATION of a conjecture means
CONFIRMation of its COMPLEMENTary conjecture. Yet users like fixed points.
4. Better measures of sufficiency and necessity are RR(:)'s or LR(:)'s, like
I.J. Good's
Qnec = P( e| h)/P( e|~h) = RR( e| h) = Lsuf , see Folk1 ;
Qsuf = P(~e|~h)/P(~e| h) = RR(~e|~h) = [1 - P(e|~h)]/[1 - P(e|h)]
= 1/Lnec
Find +Folks' wisdoms for more. These ratios of ratios have ranges with
3 semantically fixed values, which enhance opeRational interpretability,
and are not just single P's all REFUTED as measures of confirmation or
corroboration in { Popper 1972, chap.X/sect.83/footn.3/p.270, and in
Appendix IX, pp.390-392 etc }.
.- end of microtutorial .
Deeper insights into RR, LR, and into confounding are gained by dissecting
RR(:) and LR(:) thus:
RR(y:x) = P(y|x)/P(y|~x) ; y is effect, x is the hypothesized cause,
eg x is exposure or test result
= P(y,x)/P(y,~x) * ( 1 - Px)/Px
= [ P(y,x)/(Py - P(y,x)) ] * ( 1 - Px)/Px , find " confound "
= P(y,x)*( y implies x ) * SurpriseBy(x)
= P(y,x)/P(y,~x) * SurpriseBy(x)
= LikelyThanNot(y:x) * SurpriseBy(x) ; note that :
1/P(y,~x) = 1/(Py - Pyx), or 1 - P(y,~x) = 1 - (Py - Pyx), are measures of
how likely ( y implies x) ie IF y THEN x ;
recall that ~(y,~x) == (y --> x) == (~x --> ~y) == ~(~x,y) ;
note that Py - Pxy = P(~x) - P(~x,~y) in general ie
also for imperfect implication = (1-Px) - [1-(Px+Py-Pxy)] = Py-Pxy
!!! but equality of fun(y:x) = fun(~x:~y) is UNDESIRABLE for a measure of
causal tendency (find UNDESIRABLE below to find out why? ).
Another fun(y:x) is P(x|y) which also measures (y implies x), however :
100% implication [ P(x|y) = 1 ] = [ P(~y|~x) = 1 ], while for less than
100% implication P(x|y) <> P(~y|~x) in general since :
Pxy/Py <> [ 1 - (Px + Py - Pxy) ]/[ 1 - Px ]
Pxy/Py <> 1 - ( Py - Pxy)/[ 1 - Px ]
where by DeMorgan's rule P(~x,~y) = 1 - (Px + Py - Pxy) = P(~(x or y))
!! (y sufficient for x) ie (y implies x) ie:
(x necessary for y) ie potentially (x CAUSES y), because removal,
blocking or reduction of ANY SINGLE necessity (out of several required)
x necessary for y , will annul or suppress its consequent effect y. Draw
x enclosing y in a Venn diagram, and see that it is necessary to hit
x to have any chance of hitting the enclosed y, but not vice versa.
Hence it is the necessary condition which should be seen as a potential
cause, removal/suppression of which will remove/suppress the effect y.
1/P(x,~y) = 1/(Px - Pxy), or 1 - P(x,~y) = 1 - (Px - Pxy), are measures of
how likely ( x implies y) ie IF x THEN y ;
recall that ~(x,~y) == (x implies y) == (~y implies ~x) == ~(~y,x) ;
LR = P(x|y)/P(x|~y) = RR(x:y)
= P(x,y)/P(x,~y) * ( 1 - Py)/Py
= [ P(x,y)/(Px - P(x,y)) ] * ( 1 - Py)/Py
= P(x,y)*( x implies y ) * SurpriseBy(y)
= P(x,y)/P(x,~y) * SurpriseBy(y)
= LikelyThanNot(x:y) * SurpriseBy(y)
From RR, LR and also from a Venn diagram, it follows that since the joint
P(y,x) = P(x,y), it must be only the unequal marginal probabilities Py, Px,
which decide whether (y implies x) more or less than (x implies y) by the
rule:
!!! if Py < Px then RR(y:x) >= RR(x:y) ie LR(x:y),
if Py > Px then RR(y:x) <= RR(x:y) ie LR(x:y), where
the = occurs for x,y independent ie RR(:) = 1,
or if Pxy=0=RR(:), as my program Acaus3 asserts. For more find Py < Px below.
IF LikelyThanNot(y:x) < 1 ie Pyx < P(y,~x) ie Less likely than not,
AND SurpriseBy(x) is large enough ie Px is low enough
THEN RR(y:x) > 1 may still result due to low Px.
IF RR(y:x) > 1 AND LikelyThanNot(y:x) > 1 ie More likely than not ie
P(x|y) > 1/2 ie Pxy > Py/2 ie Pyx > P(y,~x) = Py - Pxy ie 2Pxy > Py
THEN there is a stronger reason for the conjecture that (y implies x) ie that
(x causes y), than it is if Pyx < P(y,~x) AND RR(y:x) > 1.
The P(x|y) > 0.5 has been:
- required as "the critical condition for confirming evidence" in { Rescher
1958, 1970 pp.78-79, and on p.84 swapped to P(y|x) > 0.5 };
- recommended as a potent (not just potential) Necessity N of exposure x for
case y : N > 0.5 in { Schield 2002 sect.2.3 & Appendix };
- considered in { Hesse 1975, p.81 } but dismissed as a single measure of
"confirming evidence" because P(x|y) > 1/2 "may be satisfied even if y has
decreased the confirmation of x below its initial value in which case y has
disconfirmed x". Mary Hesse (Oxford) then opted for P(x|y) > Px as the
condition for "y confirms x" aka Carnap's "positive relevance criterion".
A PARADOXical behaviour of RR(:), and of other formulas, nearby some extreme
values is identified :
!!! huge, even infinite RR(y:x) = oo is possible while y, x are almost
independent !!!
Let:
== is equivalence ; rel is >=< ie < , = , > , etc
== is IF [.] THEN [_] and vice versa, ie simultaneously IF [_] THEN [.] .
Keep in mind that there are at least 17 equivalent (in)dependence relations:
[ P(y,x) rel Py*Px ] , which divided by Px or by Py yields :
== [ P(y|x) rel Py ] == [ P(x|y) rel Px ]
== [ P(y|x) rel P(y|~x) ] == [ P(x|y) rel P(x|~y) ]
== [ P(~y|~x) rel P(~y|x) ] == [ P(~x|~y) rel P(~x|y) ] , etc.
Since both relative risk RR(:) and odds ratio OR(:) are in use, it is good
to remember their relationships:
OR(y:x) = OR(x:y) = ad/(bc) = Pxy*P(~x,~y)/[ P(x,~y)*P(~x,y) ]
[ OR(:) rel 1 ] == [ RR(:) rel 1 ] == [ Pxy rel Px*Py ]
hence
If OR(:) rel 1 (ie if Pxy rel Px*Py) then OR(:) rel RR(:), and vice versa,
eg
If OR(:) < 1 (ie if Pxy < Px*Py) then OR(:) < RR(:) ;
If OR(:) > 1 (ie if Pxy > Px*Py) then OR(:) > RR(:)
which means that if OR(:) > 1 then relative risk RR(:) will be smaller
than odds ratio OR(:). E.g. let only the OR(:) = 2.5 be known (eg
from a meta-study, so that a,b,c,d,n are not known and RR(:) is not
available). Then we may speculate about the corresponding RR(:) > 1 thus:
OR(:) = ad/(bc) =. eg (25*30) /(10*30) = 2.5 > 1,
or (25*30) /(30*10) = 2.5 ;
RR(:) = [a/(a+b)]/[c/(c+d)] =. eg [25/(25+10)]/[30/(30+30)] = 1.4 > 1,
or [25/(25+30)]/[10/(10+30)] = 1.8 , etc;
For 1 < RR(:) < OR(:) there is less risk than OR(:) suggests. Hence we
convert:
!!! RR(y:x) = OR(y:x)/[ 1 + (OR(y:x) -1)*P(y|~x)] where If P(y|~x) may
be a guesstimate.
Keep in mind that swapping rows and/or columns in a 2x2 contingency table
may change OR into 1/OR, but RR(:) will always change, in general.
.- !!! +The simplest thinkable necessary condition for CONFOUNDING :
Lets make search for & research of confounders easier & less expensive.
RR(y:x) = P(y|x)/P(y|~x) , y is the effect, x is the hypothesized cause,
eg x is exposure, or treatment, or test result.
Lets consider c as a competing (against x) candidate cause of y. Clearly
RR(y:c) > RR(y:x) is a necessary (but generally not sufficient) condition
for c to be, rather than x , a potential cause of y.
Less natural is the following condition by Jerome Cornfield et al. of 1959
{ reproduced in the Appendix of Schield 1999 } :
RR(c:x) > RR(y:x) is necessary for c, rather than x, to be a cause of y.
My decomposition of these RR(:)'s into :
RR(c:x) = [ Pcx/(Pc - Pcx) ] * ( 1 - Px)/Px
RR(y:x) = [ Pyx/(Py - Pyx) ] * ( 1 - Px)/Px
readily suggests that (1 - Px)/Px can be dropped from Cornfield's inequality,
ie
[ Pcx/(Pc - Pcx) ] > [ Pyx/(Py - Pyx) ]
ie
[ Pcx*Py - Pcx*Pyx ] > [ Pyx*Pc - Pyx*Pcx ] which simplifies to:
!!! P(x|c) > P(x|y) my simplest necessary condition for c overrulling x
!!! P(x|c) - P(x|y) my simplest necessary absolute boost Ab > 0 needed
!!! P(x|c) / P(x|y) my simplest necessary relative boost Rb > 1 needed
[ P(x|c) = P(c|x)*Px/Pc ] > [ P(y|x)*Px/Py = P(x|y) ] by Bayes rule ;
!!! P(c|x)/Pc > P(y|x)/Py my Bayesian boost condition
!!! P(c|x) > P(y|x)*Pc/Py 2nd form of necessary cond.
P(y|x) < P(c|x)*Py/Pc 3rd form of necessary cond.
lead to measures:
!!! P(c|x)/Pc - P(y|x)/Py = ABb(c:x; y:x) my absolute Bayesian boost
!!! [ P(c|x)/Pc ]/[ P(y|x)/Py ] = RBb(c:x; y:x) my relative Bayesian boost
!!! [ P(c|x)/Pc - P(y|x)/Py ]/[ P(c|x)/Pc + P(y|x)/Py ] is my absolute
Bayesian boost kemenyzed to the range [-1..0..1]
If abs.boost < 0 or rel.boost < 1
then confounder c CANNOT replace x as a potential cause of the effect y;
ie abs.boost < 0 or rel.boost < 1 SUFFICE to REFUTE c as a competitor with
x for a cause of y . This is Popperian refutationalism opeRationalized;
see +Mottos for McGinn on Popper , and find the last Spinoza below.
If abs.boost > 0 or rel.boost > 1
then confounder c MIGHT replace x as a potential cause of the effect y,
but abs.boost > 0 or rel.boost > 1 are only necessary (but not sufficient)
conditions for c to replace x as a potential cause of y.
Below find Bailey to read that a "globally" collected P(x|y) is more stable
than P(y|x), which can be estimated from a locally collected P(x|y) thus :
P(y|x) = ( P(x|y)*Py )/[ P(x|y)*Py + P(x|~y)*(1 - Py) ] by Bayes,
= 1/[ 1 + P(x|~y)/P(x|y) * (1 - Py)/Py ]
= 1/[ 1 + ( 1/LR+ ) * SurpriseBy(y) ]
= 1/[ 1 + SurpriseBy(y) / LR+ ] where
Py has to be the proportion of the effect y in POPULATION. Now it is clear
how much better it is to use my condition P(x|c) > P(x|y) for confounding.
Combining both necessary conditions for c to overrule x yields :
!! RR(y:x) < mini[ RR(c:x) , RR(y:c) ] is necessary for c, rather than x,
to be a potential cause of y;
!!! P(x|y) < P(x|c) AND RR(y:x) < RR(y:c) is its simpler equivalent.
Note that the user does not have to evaluate all (remaining) subconditions
after any single one of them is found to be violated, so that c becomes an
implausible competitor of x for potential causation of y.
My new necessary condition above can also be derived from the fact that in
RR(y:x) < RR(c:x) ie in P(y|x)/P(y|~x) < P(c|x)/P(c|~x)
conditionings |. are the same on both sides of the < , hence the conditional
P(.|.)'s can be turned into joint P(.,.)'s since the conditionings annul.
In the { Encyclopedia of Statistics, Update volume 1 , on Cornfield's lemma,
pp.163-4 } J.L. Gastwirth's exact condition for (non)confounding is shown.
Let me write it in a clearer notation and then simplify it a bit:
RR(c:x) = RR(y:x) + (RR(y:x)-1)/[ (RR(y:c)-1)*P(c|~x) ]
RR(c:x) > RR(y:x) is necessary for c, rather than x, to be a cause of y;
is Cornfield's necessary (but insufficient) condition.
From Gastwirth's equality follows my more concise sufficient condition
for c , rather than x , to cause y :
RR(c:x)-1 > RR(y:x)-1 + (RR(y:x)-1)/[ (RR(y:c)-1)*P(c|~x) ]
RR(c:x)-1 > [RR(y:x)-1] * ( 1/[ (RR(y:c)-1)*P(c|~x) ] )
[RR(c:x)-1]/[ RR(y:x)-1] > 1 + 1/[ (RR(y:c)-1)*P(c|~x) ]
lhs > rhs
lhs - rhs ; (lhs - rhs)/(lhs + rhs) has a kemenyzed range [-1..0..1].
When the reading gets tough, the tough get reading. This epaper has one thing
common with an aircraft carrier: there are multiple cabels to hook on and so
to land safely on the deck of Knowledge. There is no safety without some
redundancy at critical or remote points.
-.-
+Executive summary :
One good picture or example tells more than 10k words, but 1 formula captures
infinitely many examples (remember Pythagoras?). The table, without P(|), CI,
and RR(:), is from the handbook on evidence-based medicine aka EBM { Sackett
2000, p.77 }, but it could be economical, investment, or other data as well :
Data: Cases counted | | Information extracted by Jan Hajek :
Cue y=bad ~y=good | LR | Probab. Risk ratio 95% Confidence
xi n(y,xi) n(~y,xi) | (xi:y) | P(y|xi) RR(y:xi) interval CI(RR)
-----------------------------|--------|--------------------------------------
x1: < 15 474 20 | 51.9 | 0.96 5.9 5.34 to 6.52
x2: 15-34 175 79 | 4.8 | 0.69 2.5 2.25 to 2.78
x3: 35-64 82 171 | 1 | 0.32 1 =independ. 0.91 to 1.10
x4: 65-94 30 168 | 0.39 | 0.15 0.5 exercise
x5: > 94 48 1332 | 0.08 | 0.03 0.05 exercise
-----------------------------------------------------------------------------
Sums: n(y)=809 + 1770=n(~y)
2570 = n = sum total
P( y) = n(y)/n = 809/2570 = 0.31 = prevalence = prior ie pre-test probability
P(~y) = 1 - P(y) = 0.69
Sum_i:[ P(y|xi) ] <> 1
Sum_i:[ P(xi|y) ] = (Sum_i:[ P(xi,y)])/P(y) = P(y)/P(y) = 1.
Task: From the left half of the 5x2 contingency table of coincidence counts
extract information with opeRationally useful interpretations :
P(y|xi) = predictivity aka post-test probability of a bad outcome
RR(y:xi) = P(y|xi)/P(y|~xi) = relative risk aka risk ratio of a bad outcome
LR(xi:y) = P(xi|y)/P(xi|~y) = likelihood ratio aka simple Bayes factor.
Note that an another way to evaluate the above data would be to contrast a
line against another line. That would yield at least 5*(5-1)/2 = 10 pairs of
data-lines, each pair forming a 2x2 contingency table for which RR(y:xi),
LR(xi:y) and CI(.) would be computed. The number of pairs could be doubled by
swapping the 2 lines in each pair with different RR(:), LR(:) and CI(.),
because unlike odds ratio OR(:), the RR(:) and LR(:) are not invariant under
swaps or transpositions, but they have more of opeRationaly meaningful and
useful interpretations which OR(:) does not always have: relative risk.
The cue variable X could be discrete (eg binary ie dichotomous), or it can
be a continuous X split into 2 or more levels ie intervals. Here it is a
diagnostic test with 5 subintervals < 15,.., > 94, which are relevant for
use, but once judiciously chosen they stay fixed, and only the collected
classification counts matter. Finer partitioning (= quantization aka
discretization) of the continuous cue X into more subintervals xi would
decrease the joint counts n(y,xi), n(~y,xi) and thus degrade the robustness
of all results.
The solution:
P(y|xi) = P(y,xi)/P(xi) = P(y,xi)/( P(xi,y) + P(xi,~y) )
= n(y,xi)/( n(xi,y) + n(xi,~y) )
eg: = 474/(474 +20) = 0.96 or 96%
P(~y|xi) = 1 - P(y|xi) eg: = 0.04 or 4% is the predictivity of good outcome
= 20/(474 + 20) ; swapping the columns (or meanings) in the table
would turn risk ratio RR into my WinRatio WR = RR(~y:x) eg for
optimistic investors :-)
LR(xi:y) = P(xi|y) / P(xi|~y)
= [ n(y,xi)/n(y) ] / [ n(xi,~y)/n(~y) ]
eg = [ 474/809 ] / [ 20/1770 ] = 51.9
Not only is LR a bit easier to compute (from the data above) than RR, but in
medical applications LR will be more stable than RR. LR's can be collected
"globally" (eg on national scale) and via Bayes rule (find here below, or use
the nomogram at WWW.CEBM.NET Oxford) applied to the individual cases subject
to the local prevalence P(y), or applied to the individual prior probability
P(y), to obtain what we really want: the post-test probability P(y|x). Find
Bayes and Bailey below. It has been pointed out to me by prof. Brian Haynes
(McMaster University, Canada) and by prof. Paul Glasziou (Oxford) that it
would be misleading to publish P(y|xi), because a physician must use his or
her internal prior P(y) of an individual patient and update it (eg via the
nomogram) by LR(x:y) of the external population, to obtain patient's P(y|x).
So although LR may carry a more generally useful (because more robust ie
stable) partial information, RR carries information more meaningful finally
and individually: the risk ratio ie relative risk, and more (read on, pls).
RR(y:xi) = P(y|xi) / P(y|~xi)
= [ P(y,xi)/P(xi) ] / [ P(y,~xi)/( 1 - P(xi) ]
= [ n(y,xi)/n(xi) ] / [ n(y,~xi)/( n - n(xi) ]
note that n(y,~xi) = n(y) - n(y,xi); n(xi) = n(y,xi) + n(~y,xi), hence:
= [ n(y,xi)/( n(y) - n(y,xi) )] * [( n - n(xi) )/n(xi) ]
= [ 1/((n(y) / n(y,xi))- 1)] * [( n / n(xi) ) - 1 ]
eg: = [ 1/( 809/474 - 1)] * [(2570/(474+20)) - 1 ] = 5.9
or:
= [ n(y,xi)/n(xi) ] * [ ( n - n(xi) )/(n(y) - n(y,xi)) ]
eg: = [ 474/(474+20) ] * [ (2570 -(474+20))/( 809 - 474 ) ]
= 474/ 494 * 2076/335 = 5.9
Q: The meaning of P(y|xi) is quite easy to grasp, but what about RR and LR ?
A: Obviously RR(y:xi) is a relative risk as it contrasts the probability of
a bad outcome y if xi, against the probability of y if ~xi. That's easy,
opeRationally meaningful, hence useful. But there are other meanings hidden
in RR(:) and similar formulas. In this epaper we shall uncover those hidden
meanings or interpretations and properties to obtain fresh insights, eg:
RR(y:xi) = P(y|xi)/P(y|~xi) = P(xi,y)*(y implies xi) * SurpriseBy(xi)
= P(y,xi)/P(y,~xi) * SurpriseBy(xi)
= LikelyThanNot(y:xi) * SurpriseBy(xi)
LR = RR(xi:y) = P(xi|y)/P(xi|~y) = P(xi,y)*(xi implies y) * SurpriseBy(y)
= P(xi,y)/P(xi,~y) * SurpriseBy(y)
= LikelyThanNot(xi:y) * SurpriseBy(y)
One ounce of insight is worth one megaton of hardware. By comparing RR(:)
with other formulas we shall see how good it is. Also we shall investigate
how well does RR(:) indicate causal tendency, if any. Read on, please.
The 95% confidence interval CI of RR(:) completes my info-extraction :
n( xi) = n(y,xi) + n(~y,xi) from each row of the data
n(~xi) = n - n(xi) ; n(y,~xi) = n(y) - n(y,xi)
SeLn(RR) = standard error of Ln(RR)
= sqrt[ 1/n(y,xi) - 1/n(xi) + 1/n(y,~xi) - 1/n(~xi) ]
!! Caution: if n(y,xi) =. n(xi) or n(y,~xi) =. n(~xi)
then SeLn(RR) will be tight even for very small marginal
!! count n(xi) which obviously is UNreliable.
E.g.: n(x,y) = 3, n(x) = 3, n(y) = 54, n = 75 cases of hart patients :
RR(y:x) = [n(x,y)/n(x)] / [n(y,~x)/n(~x)] = [3/3]/[(54-3)/(75-3)] = 1.41
SeLn(RR) = sqrt( 1/3 - 1/3 + 1/(54-3) - 1/(75-3) ) = 0.0756
!! note that 1/3 - 1/3 = 0 contribution to error from 3/3 = P(y|x)
!! and even 1/1 - 1/1 = 0 :-((
Hence SeLn(RR) has not built-in the wisdom of the old German saying "Einmahl
ist keinmahl" ie Once is as-if never. For RR = 1.41 the CI is 1.22 to 1.64 ,
ie RR will not be outside CI in 95% of trials (to put it simply), hence also
RR = 1 (meaning independent x,y ie no relative risk) is not expected in
95% of trials. All seems fine while it is not, since low n(x) is UNRELIABLE.
The CI formula for a 95% confidence interval is:
Ln(CI) spans (Ln(RR) - 1.96*SeLn(RR)) upto (Ln(RR) + 1.96*SeLn(RR))
CI spans exp(Ln(RR) - 1.96*SeLn(RR)) upto exp(Ln(RR) + 1.96*SeLn(RR)).
The constant 2.576 is for 99% confidence intervals (are wider),
1.96 is for 95% confidence intervals (are common),
1.645 is for 90% confidence intervals (are narrower),
which means that eg in 95% of trials with the population counts we shall
get a RR(:) value within our CI which is based on much lower sample counts.
That's what the books suggest, but as I show above, RR's computed from ratios
close to 1 are misleadingly considered as if deserving our confidence :-((
Lets analyze another real example, an ECG test result x at a hart clinic :
n(y,x) = 21, n(x) = 74, n(y) = 21, n = 75 patients in total data set,
from which my KnowledgeXplorer KX computed and listed (among many other tests
and results) :
n(~x) = n - n(x) = 75 - 74 = 1 ie all but 1 patient tested had x
n(y,~x) = n(y) - n(y,x) = 21 - 21 = 0 !!
P(y| x) = 21/74 = 0.23
P(x| y) = 21/21 = 1.00
P(y|~x) = 0/1 = 0 !! hence RR(y:x) = oo ie infinite !!
also the standard error SeLn(RR) = oo due to 1/n(y,~x) = 1/0 = oo
The correlation coefficient between events x,y is :
r = [ P(y,x) - Py*Px ]/[ Py*(1 - Py) * Px*(1 - Px) ] = 0.073 is very low
The coefficient of determination (does not exaggerate dependence as r does) :
r2 = r*r = beta(y:x)*beta(x:y) = 0.28*0.02 = 0.0056 is even lower
This real-world example illustrates that RR(y:x) = oo may obtain for almost
independent events x,y . I have not seen a book or a paper telling this
!!! PARADOXICAL behavior. We see that 100% implication and near independence
are not incompatible. So we can have formulas which have nice opeRational
interpretation points with identical meanings [ -1..0..1 ], eg :
degree of factual support F(y:x) by { Kemeny 1952 } which is RR rescaled, and
measure of corroboration C(y:x) by { Popper 1972 }. I rewrite both only
formally, and find them to have very similar forms. Both yield identical 0.0
in the clean-cut case of 100% independence, but each formula may yield a very
different result when less clean-cut ie less extreme situation occurs so they
will differ in most common situations. In the last example above we get :
F(y:x) = [ P(y|x) - P(y|~x) ] / [ P(y|x) + P(y|~x) ] { F-form 1 }
= [ 0.23 - 0 ] / [ 0.23 + 0 ] = 1 = 100% implication
= [ Pxy - Px*Py ] / [ Pxy + Px*Py - 2*Pxy*Px ] { my F-form 2 }
= 0 if x,y independent
= 0 if Px = 1 ie unsurprising x , then also Py = Pxy = Px*Py
ie x,y independent AND yet P(x|y) = 1
= 0 if Py = 1 ie x,y independent AND yet P(y|x) = 1 as Px=Pxy=Px*Py
= [ P(x|y) - Px ] / [ P(x|y) + Px - 2*Px*P(x|y) ] shows that:
= 1 iff P(x|y) = 1 (regardless of Px )
! = 1 if Px < 1 & Py = Pxy ie (y --> x), also in the extreme case:
if y = x ie for F(x:x) ie when x implies itself
= -1 if Pxy = 0 ie x,y disjoint
= -F(y:~x)
= [ RR(y:x) - 1 ]/[ RR(y:x) + 1 ]
vs the very similarly looking, yet differently behaving :
C(y:x) = [ Pxy - Px*Py ] / [ Pxy + Px*Py - Pxy*Px ] { my C-form 2 }
= [ P(y|x) - P(y|~x) ] / [ P(y|x) + Py/P(~x) ]
= [ 0.23 - 0 ] / [ 0.23 + 21/1 ] = 0.01 = near independence
= 0 if x,y independent
= 0 if Px = 1 ie unsurprising x , then also Py = Pxy = Px*Py
ie x,y independent AND yet P(x|y) = 1 ,
= 0 if Py = 1 ie x,y independent AND yet P(y|x) = 1 as Px=Pxy=Px*Py
= [ P(x|y) - Px ] / [ P(x|y) + Px - Px*P(x|y) ] shows that here:
<= 1 - Px if P(x|y) = 1 eg C(y:x) = 0 if Px = 1 !!! compare with F(y:x)
<= 1 - Px if Px < 1 & Py = Pxy ie (y --> x), also in the extreme case:
if y = x ie for C(x:x) ie when x implies itself
1 - Px is "semantic information content" of x ( SIC by Popper's design)
Note that P(x|y)-Px = 1-Px if P(x|y)=1 ( SIC too w/o /norm :-)
>= -1 if Pxy = 0 ie x,y disjoint
!!! Such formulas, including RR(:) and LR(:) which are just rescaled F(:)'s,
may become mixed blessings because they inseparably mix measurements of
2 different properties to which each formula is differently (in)sensitive.
Conclusion: Although a single formula is handy to indicate associations of
interest, it cannot be blindly relied upon, especially not when
it yields extreme values. Other results from other formulas must be checked
alongside. Know thy formulas, and thou shalt suffer no disgrace! This is
my paraphrase of the great strategist Sun-Tzu who talked about thy enemies.
-.-
+Mottos :
Great minds discuss ideas, average minds discuss events, small minds
discuss people. { Adm. Hyman Rickover, father of the US nuclear navy, whose
assistant used to be Charley Martin, author of the best unorthodox cmdr
CMFiler which I use to do my epaperwork and for all my file handling, for
which I suggested 510 improvements (for MiniTrue only 235 :-) }
Somebody has classified people into three categories:
into the uneducated, who see only disorder;
into the half-educated, who see and follow the rules;
and into the educated, who see and appreciate the exceptions.
The computer clearly belongs to the category of half-educated.
{ Heinz Zemanek, IFIP President 1971-1974 }
Indeed, the thoughtful physician recognizes that each incremental advance in
scientific knowledge also unmasks new areas of the unknown that demand
resolution. Lewis Thomas has recently written: "The greatest single achieve-
ment of science [..] is the discovery that we are profoundly ignorant; we
know very little about nature, and we understand even less." ...
We strive for an unscalable summit, destined to be forever obscured in the
mists of undiscovered knowledge. ... However our progress is impeded by true
ignorance: lack of familiarity with that which is known and lack of compre-
hension of the need for - and the very nature of - the process of biomedical
research. { Thomas H. Weller (1915-????), Nobel prize for medicine 1954,
The mountain of the unknown; Hospital Practice, May 1982, pp.33+38+43 }
When you can measure what you are speaking about, and express it
in numbers, you know something about it. But when you cannot,
your knowledge is of a meagre and unsatisfactory kind.
{ William Thompson aka Lord Kelvin of Larg (1824-1907) }
What is measurable is managable.
In general we mean by any concept nothing more than a set of operations;
the concept is synonymous with the corresponding set of operations. ...
The proper definition of a concept is not in terms of its properties
but in terms of actual operations. ... Meanings are operational.
{ Percy W. Bridgman (1882-1962), Nobelist; father of operationalism 1927 }
Measures of association should have operationally meaningful
interpretations that are relevant in the contexts of empirical
investigations in which measures are used.
{ Goodman & Kruskal, 1963, p.311, there also in the footnote }
The true logic of this world is the calculus of probabilities.
{ James Clerk Maxwell (1831-1879) }
An event is an event is an event { my paraphrase of Gertrude Stein who thus
spoke about a rose. Sorry Gertude, me no Einstein, but Bayes rule and all the
independence conditions like P(y|x) = P(y|~x) hold for any mix of x,y,~y,~x }
I don't talk things, sir, said Faber, I talk the meanings of things.
... [Books] have quality. To me it means texture. ... Telling detail.
Fresh detail." { Ray Bradbury, Fahrenheit 451, Part II, pp.75, 83 }
One ounce of insight is worth one megaton of hardware.
Connect, always connect. Compare, always compare. { JH }
God is in the detail. { Mies van der Rohe, architect }
Would that I could discover truth as easily as I uncover falsehood.
{ Cicero }
Detecting error is the primary virtue, not proving truth. ...
There is nothing quite like a brilliant and beautiful theory that has been
decisively refuted.
{ Colin McGinn, Looking for a black swan = review of 4 books about/by
Karl Popper, in New York Review of Books, November 21, 2002 }
I know you believe you know what I know, but I don't know whether you
know what I don't know. { a private thought of every expert (system) }
One man's mechanism is another man's black box { Patrick Suppes, Stanford }
One man's data is another woman's noise;
one man's cause is another woman's effect. { JH }
An invasion of armies can be resisted but not an idea whose time has come.
{ Victor Hugo }
Rerum conoscere causas.
Same cause, same effect { Hempel 1965, p.348 }
Seek simplicity and distrust it. { Alfred North Whitehead (1861-1947),
Cambridge, London, Harvard, co-author of Principia Mathematica }
Keep it simple, but not simplistic { Jan Hajek }
Know thy formulas and thou shalt suffer no disgrace { my paraphrase of the
greatest strategist ever, Sun-Tzu, 2500 b.PC., : Know thy enemies .. }
Math is hard. Let's go shopping. { Barbie }
-.-
+Combining : priorities, averages, median
Contrasting is done by computing an absolute difference, or a ratio ie
a relative difference. An asymmetric denominator provides for
the vital ASYMMETRY or ORIENTEDness ie DIRECTEDness. I say:
It's the denominator, student!
Combining two semantically different measures (in different units) is
generally best done by multiplying them. IF you have to decide between
2 equally available or expensive objects (eg machines) or subjects (eg
[wo]men :-) and you have no further information, knowledge, preferences
except that the 1st has equally desirable key parameters pA1 and pB1 ,
and the 2nd has equally desirable key parameters pA2 and pB2 ,
where pA's have meaning (eg units of measurement) different from pB's ,
and pA1 and pA2 have the same meaning (eg units) but different values,
and pB1 and pB2 have the same meaning (eg units) but different values,
THEN the golden rule is to buy/choose the object/subject with the larger
product pA * pB. E.g.:
(S)he1 has IQ1 = 103 and Salary1 = 60k, so that 103*60 = 6180
(S)he2 has IQ2 = 98 and Salary2 = 70k, so that 98*70 = 6860 , then
your best pick is (S)he2, ceteris paribus. Feel free to combine IQ with,
say the breast/waist ratio, or decide between 2 PC's with different MHz and
different GigaBytes for the same price, or freely available from a PC-dump.
More math on this (with a threshold value) is in { Grune 1987 }.
Different preferences can be captured by assigning weight wA to paramA and
weight wB to paramB, etc. Make sure that weights >= 0, and params >= 1,
because eg (0.4)^2 = 0.16 < 0.4, but we are free to rescale all params so
that they all will be >= 1, so there will be no problem. Then the formulas
become :
priority1 = (paramA1^wA)*(paramB1^wB)*...etc for more params
priority2 = (paramA2^wA)*(paramB2^wB)*...etc for more params
priority3 = (paramA3^wA)*( etc for 3 objects or subjects to choose from.
The maximal priority wins. I have extensively used heuristic priorities
computed & pushed in priority queues during my pioneering R&D on automated
verification of communication / networking protocols back in 1977 done via
the thinktank RAND Corp. for DARPA (find both on WWW), when TCP was still
fresh & buggy. See my epaper on APPROVER on WWW.MATHEORY.INFO or .COM
The above task is related to the averages based on multiplication (rather
than on addition) :
harmonic average ha = 2*A*B/(A+B) <= sqrt(A*B) = ga = geometric average
which both (unlike the arithmetic average) yield zero when either A or B is
zero. For our task above the geometric mean would be fine, while harmonic
average would be harmful as it contains (A+B) which makes no sense
! if A and B have different meanings (eg different units). Multiplication
makes sense in general, since you dont want someone with IQ = 0 or with
the breast/waist ratio = 0, do you ? See www for "weighed averages".
Arithmetic mean minimizes the variance ie the sum of squared deviations
from the mean.
The median minimizes the sum of ABSolute deviations from the median.
I cannot go here into further criteria for when to use which kind of
average and median, but a good book on statistical literacy should go.
-.-
+Notation, tutorial, basic insights, PARADOXical "independent implication" :
?(y:x) denotes a measure of how much the evidence y CONFIRMS x as a cause,
conjecture or hypothesis x. (y:x) means that y implies x ie y --> x,
within such a measure. Note that P(x|y), or P(x|y) - Px are y --> x
where -Px discounts the lack of surprise if Px is high (find SIC ),
while in RR(y:x) = P(y|x)/P(y|~x) it is the 1/(Py - Pxy) == y --> x
[ in /P(y|~x) ], which due to its range overrules P(y|x) == x --> y ,
while (1-Px)/Px discounts the lack of surprise if Px is high (find SIC)
but don't get misled by the form, since most measures can be rewritten
as :
[ Pxy - Px*Py ]/[ denominator1 ] = cov(x,y)/denominator1
= [ P(x|y) - Px ]/[ denominator2 ] if denominator2 = 1 then y --> x
= [ P(y|x) - Py ]/[ denominator3 ] if denominator3 = 1 then x --> y
where the numerator captures dependence (is 0 if x,y independent),
while the denominator decides implication y --> x, or x --> y.
It's the denominator, students! :-) For example :
cov(x,y)/Py = P(x|y) - Px
cov(x,y)/Px = P(y|x) - Py
cov(x,y)/var(x) = cov(x,y)/[Px*(1-Px)] = P(y|x) - P(y|~x) = ARR
cov(x,y)/[ Pxy + Px*Py - Pxy*Px ] = C(y:x) { my C-form 2 }
cov(x,y)/[ Pxy + Px*Py - 2*Pxy*Px ] = F(y:x) { my F-form 2 }
Popper, Kemeny and I.J. Good used ?(x:y), until in 1992 I.J. finally
switched to my less error prone ?(y:x) which is more mnemonical, as
it matches the 1st term which is P(y|x) in their formulas.
[0..1..oo) is a half open interval including 0 but excluding infinity oo,
where the central point 1 means stochastic independence. Since I
assume marginals Px > 0 and Py > 0, many intervals will be half open
..oo) and not ..oo] ie not closed.
as-if useful fiction a la { Vaihinger 1923 }. E.g. the product of marginal
probabilities Px*Py provides a fictional point of reference for
dependence of events x and y. Fictional, because Pxy = Px*Py occurs
rarely, but we often contrast Pxy vs Px*Py in Pxy - Px*Py or in
log(Pxy/(Px*Py)) in Shannon's mutual information formula. I realized
that the Archimedean point of reference { Arendt 1959 } may be a
special case of the useful as-if fictionalism (find Px*Py here & now).
Also the MAXimal possible values are as-if values eg for normalization.
SIC stands for "semantic information content" ( sic! :-) ; find Popper
<> unequal ie either smaller < , or greater > then something
= equal
=. nearly or approximately equal, close to
:= assignment statement in Pascal (in C it is the ambiguous sign = )
== equivalence of two terms, or synonymity of notations or terms
equivalence is a logical relationship R(a,b) such that it is:
reflexive & symmetric & transitive;
reflexive(a) := R(a,a) for all a
symmetric(a,b) := R(a,b) & R(b,a) for all pairs of a,b
transitive(a,c) := R(a,b) & R(b,c)
<= less or equal
>= greater or equal
>=< any one of the relations > = < >= <= <> consistently used
* multiplication
oo infinity (eg 1/0 = oo , but 0/0 = undefined in general, but in
expected values we take 0*0/0 = 0 , eg in entropic means; in
RR(y:x) = conv(y --> x) = 0/0 = 1 if Px = 1 ie Py = Pxy = Px*Py ! )
^ power operator, eg 3^4 = 3^2 * 3^2 = 9*9 = 81
sqr(.) == square(.) == (.)^2 = (.)*(.)
sqrt(.) is a square root of (.)
Sum_i:[ . ] is a sum over the items indexed by i within [.]
lhs, rhs abreviate left hand side, right hand side respectively
exp(a) = e^a where e = 2.718281828 is Euler's number
ln(.) is logarithmus naturalis based on Euler's constant e
exp(ln(.)) = (.) is antilogarithm aka antilog
log2(a) = ln(a)/ln(2) = ln(a)/0.69314718 = 1.442695*ln(a) , now base = 2
log(a*b) = log(a) + log(b) where log(.) is of any base, eg ln(.)
log(a/b) = log(a) - log(b) = -log(b/a); log(1/a) = -log(a)
(a/b - b/a) = -(b/a - a/b) is a logless reciprocity function of a, b
(a - 1/a) = -(1/a - a ) is a logless reciprocity function of a.
Reciprocity is desirable when creating new entropy functions.
A logless additivity can be achieved by relativistic regraduation.
x, y, e, h symbolize events viewed as-if random events r.e.s
X, Y symbolize variables viewed as-if random variables r.v.s,
here an r.v. is a set of r.e.s
~x negation (ie complement) of an event x , so that P(~.) + P(.) = 1
P(.) is a probability, a proportion or a percentage/100. Empirical P's in
general and observational P's in particular should be smoothed from
the range [0..1] to (0..1) ie to 0 < P < 1.
There are several definitions of probability, the main distinction is
frequentist (based on repetition and exchangebility) vs subjectivist
(allowing plausibility or belief). I am an unproblematic guy because
antidogmatic Bayesian frequentist or a data-driven empirical Bayesian.
Here I see each proportion as an approximation of a probability.
In fact a proportion is a maximum likelihood (ML) estimate of a
probability, which is ok if in c/n the count c > 5 and n = large.
I designed robust formulas for estimates when c = 0, 1, 2, 3, etc,
and data-tested their great powers in my KnowledgeXplorer aka KX.
Px == P(x) is a parentheses-less notation for P(x_i) ie P(x[i]) ie P(xi).
1-Px has range [0..1]; it linearly decreases with Px, and it measures:
+ improbability of an event x ; x may be a success or a failure;
+ surprise value of x ; the less probable, the more surprising is
x when it happens. What is too common, cannot be surprising.
What is not surprising is not interesting, carries no new meaning.
More surprising x means more "semantic information CONTENT in x" SIC ,
since the lower the Px the more possibilities it FORBIDS, EXCLUDES,
REFUTES or ELIMINATES when x occurs (find SIC , Spinoza below).
1/Px has the range [1..oo) and it hyperbolically decreases with Px.
log(1/Px) = -log(Px) ranges [0..oo); log bends the steep 1/Px down, and
measures surprise in Shannon's classical information theory.
In 1878 Charles Sanders Peirce (1839-1914) has linked log(Px) to Weber-
Fechner's psychophysical law, see { Norwich 1993 }.
In 1930ies Harold Jeffreys wrote about log[ LR(x)/LR(y) ], Abraham Wald
in 1943, and Turing & Good used this "weight of evidence" during WWII.
(1-Px)/Px ranges [0..oo); is my steep measure of surprise in an event x .
E[f(x)] = Sum[Px*f(x)] = expected value of f(x) ie an arithmetic average
ie an arithmetic mean of f(x). Let f(x) = P(x) :
E[ Px ] = Sum[Px*Px] = Sum[Px^2] = expected probability of the variable X
1 - E[Px] = Sum[Px*(1 - Px)] = Sum[Px - Px*Px] = 1 - Sum[(Px)^2]
= expected probability of error or failure for r.v. X
= expected surprise = expected semantic information content SIC
= quadratic entropy, which is not only simpler and faster than
Shannon's, but also provably better for classification, identification,
recognition and diagnostic tasks. Shannon's entropies are better only for
coding. Don't tell this secret to any classical information theorist :-)
Variance of an indicator event x (ie binary or Bernoulli event) is:
Var(x) = Cov(x,x) = P(x,x) - Px*Px = Px - (Px)^2 = Px*(1 -Px), since
Cov(x,y) = P(x,y) - Px*Py = covariance of events x,y in general
Px*Py is a fictitious joint probability of as-if independent events x, y;
it serves as an Archimedean point of reference (a la Arendt )
to measure dependence of x,y either by Cov(x,y) = Pxy - Px*Py or
! by Pxy/(Px*Py), (find as-if ). If Px=1 or Py=1 then Pxy = Px*Py !
P(x,y) == Pxy == P(x&y) is the joint probability of x&y . Pxy measures
co-occurrence ie compatibility of x and y. Until early 1960ies
P(x,y) had used to denote P(x|y) in the writings of Hempel, Kemeny, Popper
and Rescher, while they used P(xy) for the modern P(x,y) ie my Pxy.
Empirical and observational proportions should be smoothed to :
0 < Pxy < minimum[ Px, Py ] ie an empirical P(x,y) should be less than
its smallest marginal P. Low counts n(x,y) >= 1 are much improved by
P(x,y) =. [n(x,y) - 0.5]/N , and P(y|x) =. [n(x,y) - 0.5]/n(x)
which I may show derived exactly (ie = , not just =. ) elsewhere.
P(x|y) = Pxy/Py defines conditional probability, and Bayes rule follows:
P(x|y)*Py = Pxy = Pyx = Px*P(y|x) shows invertibility of conditioning
P(x|y)/Px = P(y|x)/Py = Pxy/(Px*Py) is my favorite form of basic Bayes as:
P(x|y) ? Px == P(y|x) ? Py , where the ? is < , = , > ; and also
P(x|y) ? P(x|~y) == P(y|x) ? P(y|~x) where the ? is applied consistently.
P(x|y)/P(y|x) = Px/Py is Milo Schield's form of basic Bayes
P(x|y) = Px*P(y|x)/Py is the basic Bayes rule of inversion,
where Px = "base rate"; IGNORING Px is people's "base rate fallacy".
Odds form of Bayes rule :
Odds(y|x) = Odds(y) * LR(x:y) { Odds local or individual,
LR "global" eg national }
= P(y|x)/P(~y|x) = (Py/(1-Py)) * P(x|y)/P(x|~y) = P(y|x)/(1 -P(y|x))
= P(y,x)/P(~y,x)
= n(y,x)/n(~y,x) = n(y,x)/[ n(x) - n(y,x) ] would be the straight, but
misleading estimate in medicine (find Bailey / Glasziou / Haynes here).
P(y|x) = Odds(y|x)/(1 + Odds(y|x)) = 1/(1/Odds(y|x) + 1)
= 1/( 1 + n(x,~y) / n(x,y) )
= n(x,y)/( n(x,~y) + n(x,y) ) = n(x,y)/n(x) = P(y|x) q.e.d.
-log(Bayes rule) :
-log( P(x|y) ) = -log(Pxy/Py) =
-log(Px*P(y|x)/Py) = -log(Px) - log(P(y|x)) + log(Py) is the -log(Bayes)
Note that for only comparative purposes between several hypotheses
x_j we may ignore Py (but NEVER IGNORE the base rate Px !) since Py is
a (quasi)constant for all x_j's compared: the shortest code for max
P(x_j, y) wins. This holds for logless Bayesian decision-making too: the
maximal Pxy is the winner. This is Occam's razor opeRationalized,
as it has the minimal coding interpretation as follows :
x = unobserved/able input of a communication channel, or
unobservable hypothesis/conjecture/cause/MODEL to be inferred/induced;
y = observed/able output of a communication channel, or
available test result/evidence/outcome/DATA.
According to { Shannon, 1949, Part 9, p.60 } and provable by Kraft's
inequality, the average length of an efficient ie shortest and still
uniquely decodable code for a symbol or message z will be -log(P(z)),
in bits if the base of log(.) is 2. Hence the interpretations of our
-logarithmicized Bayes rule { Computer Journal, 1999, no.4 = special
issue on MML, MDL } are opeRationalized Occam's razors :
- MML = minimum message length (by Chris Wallace & Boulton, 1968)
- MDL = minimum description length (by Jorma Rissanen, 1977)
- MLE = minimum length encoding (by Pendault, 1988)
These themes are very very close to Kolmogorov complexity, originated in the
US by Ray Solomonoff in 1960, and by Greg Chaitin in 1968, and were designed
already into Morse code, and by Zipf's law evolved in plain language, eg:
4-letter words are so short because they are used so often. In Dutch we use
3-letter words because we either use them more frequently, and/or we are
more efficient than the Anglos :-))
Hence the total cost ie length of encoding is the sum of the cost of coding
the model x_j , plus the cost ie code size of coding the data y given
that particular model x_j. Stated more concisely :
cost or complexity = log(likelihood) + log(penalty for model's complexity)
and you know that I dont mean any models on a catwalk :-) The pop
version of Occam's "Nunquam ponenda est pluralitas sine necesitate" is the
famous KISS-rule: Keep it simple, student ! :-) Simplicity should be
preferred over complexity, subject to "ceteris paribus". Einstein used to
say: "Everything should be made as simple as possible, but not simpler".
The MOST SIMPLISTIC, NAIVE measures of causal tendency :
P(y|x) = Pxy/Px = Sufficiency of x for y { Schield 2002, Appendix }
= Necessity of y for x { follows from the next line: }
P(x|y) = Pxy/Py = Necessity of x for y { Schield 2002, Appendix }
= Sufficiency of y for x { follows from above }
but WATCH OUT , CAUTION :
!!! let x = a disease, y = 10 fingers : P(y|x) = 1 in a large subpopulation
but it would be a semantic NONSENSE to say that x suffices for y , or
that y is necessary for x { courtessy Jan Kahre, private comm. }.
My analysis: P(y|x) = Pxy/Px is not a DECreasing function of Py, hence any
y with P(y) =. 1 ie too COMMON y will REFUTE P(y|x) as a measure.
!!! Much more complicated REFUTATIONS of all single P(.|.)'s or P(.)'s as
measures of confirmation or corroboration are in { Popper 1972, Appendix
IX, pp.390-2, 397-8 (4.2) etc, and p.270 }. P(.|.)'s should be viewed as
NAIVE, CRUDE, MOST SIMPLISTIC measures : rel. = relatively
P(x|y) = Pxy/Py = a measure of (y implies x) ie rel. how many y are x
= a measure of (x includes y) ie rel. how many y in x ;
P(y|x) = Pxy/Px = a measure of (x implies y) ie rel. how many x are y
= a measure of (y includes x) ie rel. how many x in y ;
draw a Venn diagram of targets being hit by arrows.
P(y|x)*P(x|y) = a measure of (x Sufficient for y) & (x Necessary for y)
= a measure of (y Necessary for x) & (y Sufficient for x)
= ((Pxy)^2)/(Px*Py) ; its symmetry makes it worthless as
a measure of causal tendency.
Pxy/(Px*Py) has range [0..1..oo) and measures stochastic dependence;
oo unbounded POSitive dependence of x, y
1 iff independent x, y
0 bounds NEGative dependence of x, y
0 iff disjoint x, y ; do not confuse disjoint with independent !
A fresh alternative look at old stuff ( Px*Py is as-if independence ) :
Pxy/(Px*Py) = (Pxy/Px)*(1/Py) =
= (x implies y)*( steepSurprise by y )
= (Sufficiency of x for y)*( steepSurprise by y )
= ( Necessity of y for x)*( steepSurprise by y )
= (Pxy/Py)*(1/Px) = (y implies x)*( steepSurprise by x )
= (Sufficiency of y for x)*( steepSurprise by x )
= ( Necessity of x for y)*( steepSurprise by x )
= [0..1]*[1..oo) = [0..1..oo) is the range; 1 iff independent
= symmetrical wrt x,y which may be good for coding but poor for a directed
ie oriented eg causal inferencing, hence I created :
!!
(Pxy/Px)*(1-Py) = P(y|x)*(1-Py) = (x implies y)*(linearSurprise by y )
(Pxy/Py)*(1-Px) = P(x|y)*(1-Px) = (y implies x)*(linearSurprise by x )
= [0..1]*[0..1] = [0..1] is very reasonable
! is asymmetrical wrt x, y hence is capturing causal tendency better.
I created these new measures because trivial ie unsurprising implications
are of little interest for data miners, doctors, engineers, investors,
researchers, scientists. The next formulas would overemphasize importance of
surprise, because Pxy/Px has range [0..1], while (1-Py)/Py has [0..oo) :
!
(Pxy/Px)*(1-Py)/Py = P(y|x)*(1-Py)/Py = (x implies y)*(bigSurprise by y )
= [0..1]*[0..oo) = [0..oo) { big range }
(Pxy/Py)*(1-Px)/Px = P(x|y)*(1-Px)/Px = (y implies x)*(bigSurprise by x )
= [0..1]*[0..oo) = [0..oo)
Only after this synthesis we may not be surprised that the last lines are
a substantial part of a risk ratio aka relative risk :
RR(y:x) = P(y|x)/P(y|~x) is 0 for disjoint x,y ; is 1 for independent ;
= (Pxy/(Py - Pxy))*(1-Px)/Px = (y implies x)*(bigSurpriseBy x)
= [0..oo)*[0..oo) = [0..oo)
note that :
+ both factors have the same range [0..oo) hence none of them dominates
structurally ie in general;
+ in both factors both numerator and denominator are working in the same
direction for increasing the product of implies * surprise;
+ there is no counter-working within each and among factors.
! P(y|x) > P(y|~x) == P(x|y) > P(x|~y) == Pxy > Px*Py (derive it) which
is symmetrical ie directionless ie not oriented;
the equivalence holds for the < <> = >= <= as well, the = is in all
17 conditions of independence. On human psychological difficulties in
dealing with such causal/diagnostic tasks see { Tversky & Kahneman:
Causal schemas in judgments under uncertainy } in { Kahneman 1982,pp.122-3}
cov(x,y) = Pxy - Px*Py = covariance of events x, y (binary aka indicator)
var(x) = Pxx - Px*Px = Px*(1 - Px) = variance of an event x (autocov )
corr(x,y) = cov(x,y)/sqrt(var(x)*var(y)) = correlation of binary events x,y
>= greater or equal.
=> is meaningless in this epaper, although some use it for an implication,
which is misleading because :
(y --> x) == (y <== x) == (y subset of x) == (y implies x); note that
the <= works on Booleans represented as 0, 1 for False, True
respectively and evaluated numerically. E.g. in Pascal
(y <= x) on Boolean variables means that (y implies x).
In our probabilistic logic ( P(x|y)=1 ) == (y implies x) fully,
ie (y is Sufficient for x), ie to hit y will hit x ,
!! ie (x is Necessary for y), ie to miss x will miss y (just
draw a Venn diagram with a smaller circle y within a larger
circle x , ie with full overlap, and view these circles as
targets to be hit or missed by you, the virtual archer.
My B(y:x), W(y:x), F(y:x) and C(y:x) have been written as ?(x:y) by ancient
authors like I.J. Good, John Kemeny and Sir Karl Popper, who were inspired
by the Odds-forms, which swap x, y via Bayes rule of inversion. However my
notation (I.J. Good used it only in his latest papers since 1992) is much
less error prone as it naturally & mnemonically abbreviates the simplest
straight forms like eg:
RR(y:x) = risk ratio = relative risk = B(y:x) = simple Bayes factor
= P(y|x) / P(y|~x)
ARR(y:x) = P(y|x) - P(y|~x) = absolute risk reduction = risk difference
= attributable risk
= a/(a+b) - c/(c+d) = (ad - bc)/[ (a+b)*(c+d) ]
= (Pxy -Px*Py)/(Px*(1-Px)) = risk increase (or risk reduction )
= cov(x,y)/var(x) = covariance(x,y)/variance(x)
!! = beta(y:x) = the slope of the probabilistic regression
line Py = beta(y:x)*Px + alpha(y:x) for indication events x, y
ie for binary events aka Bernoulli events; -1 <= beta(:) <= 1
! 0.903 - 0.902 = 0.001 is relatively small, but the same difference:
0.003 - 0.002 = 0.001 is relatively large; absolute differences may be
misleading for some purposes, but for practical treatment effects the
RR(y:x) exaggerates risk more, and more often than ARR(:) and 1/|ARR|'s
like NNT, NNH do.
RRR(y:x) = RR(y:x) - 1 = ARR(y:x)/P(y|~x) = [ P(y|x) - P(y|~x) ] / P(y|~x)
= relative risk reduction
= excess relative risk = relative effect
F(y:x) = (P(y|x) - P(y|~x)) / (P(y|x) + P(y|~x)) = factual support
= difference / ( 2*sum/2 ) my 1st interpretation
!! = (difference/2) / arithmetic average of both P(.|.)'s
= deviation / arithmetic average
!! = (slope of y on x ) / (P(y|x) + P(y|~x)) my 2nd interpretation
= beta( y:x ) / (P(y|x) + P(y|~x)) , -1 <= beta(:) <= 1 ,
= [ cov(x,y)/var(x)] / (P(y|x) + P(y|~x))
= (Pxy -Px*Py)/(Px*(1-Px)) / (P(y|x) + P(y|~x))
= rescaled B(y:x) from [0..1..oo) to [-1..0..1] 3rd interpretation
= rescaled W(y:x) from (-oo..0..oo) to [-1..0..1] 4th interpretation
= is a combined (mixed) measure scaled [-1..0..1] of :
- how much (y implies x) , yielding +1 iff 100% implication
- how much y and x are independent, 0 iff 100% independence
= [ ad - bc ]/[ ad + bc + 2ac ]
= CF2(y:x) = ( P(x|y) - P(x) )/( Px*(1 - P(x|y) + P(x|y)*(1 - Px) ) is
a certainty factor in MYCIN at Stanford rescaled by D. Heckerman, 1986,
which I recognized to be F(y:x) via my:
= (Pxy - Px*Py)/(Pxy + Px*Py - 2*Px*Pxy) my 5th interpretation
= [ RR(y:x) -1]/[ RR(y:x) +1 ] my 6th interpretation
F0(:) = [ F(:) + 1 ]/2 is F(:) linearly rescaled to [0..1/2..1] :
F0(y:x) = P(y|x)/[ P(y|x) + P(y|~x) ]
all these measures are changing co-monotonically, and they all measure
- how much the event y implies the x event. This is the directed ie
oriented ie asymmetric component of these measures;
- how much x, y are stochastically dependent ie covariate ie associate.
This is the symmetrical aspect or an association.
No contortion is needed to have events x, y which are almost independent,
if we measure independence by Pxy/(Px*Py) or by (Pxy -Px*Py)/(Pxy +Px*Py)
or by (Pxy - Px*Py)/min(Pxy, Px*Py), and at the same time one event will
strongly imply the other event. But this PARADOX depends on the sensitivity
(wrt the deviations from exact independence) of the measure. Hence our
choice of a single measure should depend on our preference for what the
measure should stress: an implication, or (a deviation from) independence.
E.g. K. Popper's corroboration C(:) stresses dependence over implication,
while Kemeny's factual support F(:) stresses implication over dependence,
but neither of those authors say so, nor anybody has noticed that so far.
Of course, we could always use two measures, one for an implication, and
the other for a deviation from independence, but the Holy Grail is a
single formula, which will inevitably combine ie mix these two aspects,
because they are almost arbitrarily (but not 100%) mixable.
A disclaimer: however impossible it may be to find the Excalibur formula
for causality, I believe it to be possible to identify formulas which come
closer to the Holy Grail than other formulas. I consider the notions of
stochastic DEPENDENCE together with probabilistic IMPLICATION (or my
INHIBITION) and SURPRISE as the key building blocks because they are well
defined (though not understood enough by too many :-(
A claimer: My goal here is to generate knowledge & understanding of the
best & the brightest inferrencing formulas for what i call an INDICATION.
The formulas must provide clear opeRational interpretations, ie they must
make sense out of the data from which they were computed. There is no lack
of formulas which somehow capture an association between events. In fact
there are too many of them, with too many pros & cons.
-.-
+Interpreting a 2x2 contingency table wrt RR(:) = relative risk = risk ratio:
a , b the counts a, d are hits, and b, c are misses
c , d ie a, d concord, and b, c discord
and a+b+c+d = n = the total count of events.
!! It is useful to view such a table as a Venn diagram formed by two
rectangles, one horizontal and one vertical, with partial overlap
measured by n(x,y) = the joint count ie co-occurrence of x and y :
______________________________
| | |
| a = n( x,y) | b = n( x,~y) | n( x) = a+b
| | |
|-------------|--------------.
| | .
| c = n(~x,y) | d = n(~x,~y) . n(~x) = c+d
|_____________|...............
a+c = n(y) b+d = n(~y) N = a+b+c+d
but nothing prevents you from viewing the overlap in any of the 4 corners.
Feel free to rotate or to transpose this standard table at your own peril.
Typical semantics (one quadruple per line) may be, eg:
x ~x y ~y
test+ says K.O. test- says ok disorder not this disorder
exposed unexposed illness not thisillness
risk factor present risk fact.absent outcome present outcome absent
! treatment control non-case case
alleged cause cause absent effect not this effect
symptom present symptom absent possible cause not this cause
conjecture,hypothesis evidence observed
so be careful with assigning your own semantics ! We can avoid mistakes
if we stick here to the first four interpretations just listed.
The 2x2 probabilistic contingency table summarizes the dichotomies :
| y ~y | marginal sums
-----|-----------------------------------|-------------------------
x | a/n = P( x,y) , b/n = P( x,~y) | P( x) = (a + b)/n
~x | c/n = P(~x,y) , d/n = P(~x,~y) | P(~x) = (c + d)/n = i/n
-----|-----------------------------------|-------------------------
Sums | P(y) P(~y) | 1 = P( x,y) +P( x,~y)
| = (a+c)/n = (b+d)/n =f/n | +P(~x,y) +P(~x,~y)
In my squashed Venn diagram in 1D-land, the joint occurrences of (x&y)
ie (x,y) ie "a" are marked by ||| = a/N = Pxy :
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
ffffffffffffffxxxxxxxxxxxxxxxxxxxxxxxxxxxxxffffffffffffffffffffffffffff
-------------------------aaaaaaaaaaaaaaaaaa----------------------------
iiiiiiiiiiiiiiiiiiiiiiiiiyyyyyyyyyyyyyyyyyyyyyyyyyyyyiiiiiiiiiiiiiiiiii
11111111111111111111 A limited 1-verse of discourse 1111111111111111111
---- 1-Px ----xxxxxxxxxxxxxxxxx Px xxxxxxxx---------------- 1-Px ------
---- 1-Pxy --------------|||||| Pxy |||||||---------------- 1-Pxy -----
---- 1-Py ---------------yyyyyy Py yyyyyyyyyyyyyyyyy------ 1-Py ------
From the 4 counts ( a+b+c+d = n ) we easily obtain all P(.)'s.
From the 3 proportions or probabilities Px, Py and Pxy we can obtain
any other P(.,.) and P(.|.) containing any mix of (non)negations,
but without raw counts we cannot compute eg confidence interval CI.
The legality of P's (given or generated) can be checked by the following
Bonferroni / Frechet inequalities:
Max[ Px , Py ] <= P(x or y) <= min[ 1, Px + Py ]
Max[ 0, Px + Py - 1 ] <= Pxy <= min[ Px , Py ]
the lhs of which is the Bonferroni inequality, which becomes nontrivial
only if Px + Py > 1, in which case there will be Pxy > 0.
Pxy <= min[ P(x|y) , P(y|x) ] is my own simple inequality, also useful
for checking, and if violated then for
trimming of eg smoothed estimates.
The inequality for Pxy divided by Py, or by Px, yields my favorites :
Max[ 0, (Px + Py - 1)/Py ] <= P(x|y) <= min[ Px/Py , 1 ]
Max[ 0, (Px + Py - 1)/Px ] <= P(y|x) <= min[ Py/Px , 1 ]
For the union U of m events x_i with probabilities Pi we get :
the simple Max_i:[Pi] <= P( U_i:[x_i] ) <= min( 1, Sum_i:[Pi] )
and if we know P(j,k) ie Pjk of all pairs of joint events then:
Sum_i:[Pi] - SumSum_j= .
I have combined both inequalities into a SuperBonferroni principle :
!!!
Max( Max_i:[Pi] , Sum_i:[Pi] - SumSum_j=< 1 where >=< stands for >, =, <, >=, <=, <>
ie P(x,y) >=< P(~x,y)
ie P(x|y) >=< P(~x|y) , so that eg for the > we say that :
x occurs More Likely Than Not if y occurred, or we say equivalently :
x occurs More Likely Than Not with y , which both capture our thinking.
+ DE/INcreases with IN/DEcreasing Px; this is meaningful, because our
!!! surprise value of x DIScounts the "triviality effect" of Px =. 1 :
!! if Px =. 1 then Pxy = Py too easily occurs, and RR(y:x) = 1/0 = oo.
!! If Py = 1 then Pxy = Px and P(y,~x)=P(~x) hence RR(y:x) = 1/1 = 1,
indeed, if all are ill, there can be no risk of becoming ill.
Surprise value of x DE/INcreases with IN/DEcreasing Px in general;
(1 - Px), 1/Px, hence also my (1 - Px)/Px measures our surprise by x.
!! My new measure P(x|y)*(1 - Px) = (y implies x)*(linearSurprise by x)
= [0..1]*[0..1] = [0..1]
is simpler, but carries less meanings than RR(y:x).
! + is DOMINAted by the factor 1/(Py - Pxy) for a given exposure Px ;
this factor measures how much (y implies x). From this and from
Pxy <= min(Px, Py), but not from "SurpriseBy", follows :
!!! if Py < Px then RR(y:x) >= RR(x:y) ie LR(x:y),
!!! if Py > Px then RR(y:x) <= RR(x:y) ie LR(x:y), where
the = may occur for x,y independent ie RR(:)=1,
or if Pxy=0=RR(:), as my program Acaus3 asserts.
That "SurpriseBy" is not decisive wrt RR(y:x) >=< RR(x:y), follows
from the comparison of:
(y implies x) = 1/(Py - Pxy) vs (1 - Px)/Px = SurpriseBy(x)
ie (1 - 0)/(Py - Pxy) vs (1 - Px)/(Px - 0).
Lets write Px = k*Py to reduce RR(:) to just 2 variables Pxy, Py,
and lets compare RR(y:x) with RR(x:y) ie LR(x:y) :
(Pxy/(Py - Pxy))*(1 - Px)/Px >=< (Pxy/(Px - Pxy))*(1 - Py)/Py
ie:
(k*Py - Pxy)/(Py - Pxy) >=< k*(1 - Py)/(1 -k*Py) = Dy in shorthand
!! Pxy >=< Py*(Dy - k)/(Dy - 1) =
Py*[(1 - Py)/(1 - k*Py) - 1]/[(1 - Py)/(1 - k*Py) - 1/k]
Checked:
Solving for k the RR(y:x) = RR(x:y), where Px = k*Py , yields a
quadratic equation with two distinct real roots k1, k2 :
k1 = 1 ie Px = Py which obviously is correct
k2 = Pxy/(Py*Py) ie Py*Px = Pxy which holds for independent x,y .
+ is oo ie infinite for Py - Pxy = 0 ie P(~x,y) = 0 ie Pxy = Py
in which case y implies x fully, because then y is a SubSet of x,
ie whenever y occurs, x occurs too ; draw a Venn diagram.
+ is ASYMMETRICAL ie directed ie oriented wrt the events x, y (this unlike
correlation coefficients and other symmetrical association measures)
. is a relative measure, a ratio (while differences are absolute measures
which may mislead us since eg 0.93 - 0.92 = 0.03 - 0.02)
- is a combined measure which inseparably MIXes measuring of two key
properties:
- stochastic dependence, which is a symmetrical property, and
- probabilistic implication, which is an Asymmetrical property, which
both I see as necessary conditions for a possible CAUSAL relationship
between x and y. Hence RR(:) INDICATES potential CAUSAL TENDENCY;
+ has range [0..1..oo) with 3 opeRationally interpretable fixed points:
RR(y:x)
= oo iff Pxy = Py ie (y implies x), ie possibly (x causes y) ;
= 1 iff y and x are fully independent ie iff Pxy = Px*Py
= 0 iff (Pxy = 0) and (0 < Px < 1) ie disjoint events x,y
ie RR(:) = 0 means disjoint ie mutually exclusive events x,y
0 < RR(:) < 1 means negative dependence or correlation of x,y
1 < RR(:) <= oo means positive dependence or correlation of x,y
!! ie RR(:) has a huge unbounded range for positively dependent x,y vs
RR(:) has a small bounded range for negatively dependent x,y ,
hence both subranges are not comparable; the positive subrange is
!! much more SENSITIVE than the negative subrange. In this respect
!! F(:) is BALANCED but has no simple interpretation of risk ratio.
= 0/0 if (Py = 0 or Py = 0 hence Pxy = 0 too).
= 1 if (Py = 1 hence Pxy = Px, P(x|y) = Px/1 ie independent x,y)
then RR(y:x) = (1-Px)/(1-Px) = 1 ie independence.
!! = 1 if (Px = 1 hence Pxy = Py, P(y|x) = Py/1 ie independent x,y)
then RR(y:x) = Pxy/(0/0) = Py/(0/0) numerically, which may
!! seem to be undetermined, but as just shown, Px = 1 means that P(y|x)
does not depend on Px, ie that x,y are independent (find 0/0 ).
RR(y:x) = P(y|x)/P(y|~x) where in many (not all) medical applications
y is a health disorder, and x is a symptom. But both { Lusted 1968 }
and { Bailey 1965, p.109, quoted: } noted that :
"P(y|x) will vary with circumstances (social, time, location), however
!! P(x|y) will have often a constant value because symptoms are a
function of a disease processes themselves, and therefore
relatively INdependent of other external circumstances. ...
so we could collect P(x|y) on a national scale, and collect Py on
a [ local/individual ] space-time scale.". The [loc/indiv] is mine.
Therefore we should compute RR(y:x) indirectly via Bayes rule ie via
P(y|x) = Py*P(x|y)/Px where P(x|y) is "global" and more stable.
+ RR(y:x) has an important advantage over its co-monotonic but nonlinear
transform F(y:x). The simple proportionality of RR(y:x) can be used to
(dis)prove confounding. Good explanations of confounding are rare, the
best introduction is in { Schield 1999 } where on p.3 we shall recognize
Cornfield's condition
P(c|a)/P(c|~a) > P(e|a)/P(e|~a) as RR(c:a) > RR(e:a)
and Fisher's
P(a|c)/P(a|~c) > P(a|e)/P(a|~e) as RR(a:c) > RR(a:e).
Be reminded that "contrary to the prevailing pattern of judgment",
as { Tversky & Kahneman 1982, p.123 } point out, it holds, in my more
general formulation :
( P(y|x) >=< P(y|~x) ) == ( P(x|y) >=< P(x|~y) ).
Hence also
( RR(y:x) >=< 1 ) == ( RR(x:y) >=< 1 ),
where >=< stands for a consistently used >, =, <, >=, <=, <> .
+ For 3 more properties see { Schield, 2002, p.4, Conclusions }.
More on probabilities :
Keep in mind that it always holds:
P(.) + P(~.) = 1 eg P(y|x) + P(~y|x) = 1 ; hence also:
P(x or y) + P(~(x or y)) = 1 from which via DeMorgan's rule follows:
P(x or y) + P(~x,~y) = 1
P(x or y) + Pxy = Px + Py see the overlap of 2 Pxy in a Venn diagram
hence P(~x,~y) = 1 -(Px + Py - Pxy)
P(~x,~y) = P(~(x or y)) by DeMorgan's rule; he died in 1871; "his" rule
has been clearly described by Ockham aka Occam
aka Dr. Invincibilis in Summa Logicae in 1323 !
= 1 - P(x or y) = 1 - (Px + Py - Pxy) and, surprise :
! Pxy - Px*Py = Pxy*P(~x,~y) - P(x,~y)*P(~x,y) from 2x2 table's diagonals
{ with / the rhs would be Odds ratio OR , find below }
= Pxy*(1 -Px -Py +Pxy) - (Px -Pxy)*(Py -Pxy)
= cov(x,y) = covariance of 2 "as-if random" events x,y , or
indicator events aka binary/Bernoulli events.
from which follows for independent events only :
iff Pxy - Px*Py = 0 ie cov(x,y) = 0 ie
iff Pxy = Px*Py (this is equivalent to 16 other equalities)
! then Pxy*P(~x,~y) = P(x,~y)*P(~x,y) ie products on 2x2 table's diagonals
are equal; this I call the 17th condition of independence (find 17 below),
which is equivalent (==) to any of the other 4 + 3*(8/2) = 16
mutually equivalent (==) conditions of independence, like eg:
( Pxy = Px*Py ) == ( P(x|y) = Px ) == ( P(y|x) = Py ) ==
( P(y|x) = P(y|~x) ) == ( P(~y|~x) = P(~y|x) ) ==
( P(x|y) = P(x|~y) ) == ( P(~x|~y) = P(~x|y) ) == etc
Only for independent x, y it holds, via Occam-DeMorgan's rule :
P(~(~x,~y)) = 1 - (1 -Px)*(1 -Py) = Px + Py - Px*Py = P(x or y) for indep.
More of the mutually equivalent conditions of independence are obtained by
changing x into ~x, and/or y into ~y, or vice versa. Any consistent mix
of such changes will produce an equivalent condition of independence for
events, negated or not, simply because AN EVENT IS AN EVENT IS AN EVENT
(with apologies to Gertrude Stein who spoke similarly about a rose :-)
Changing the = into < or > in any of the 17 conditions of independence
will create corresponding and mutually equivalent conditions of dependence
which obviously are necessary but far from sufficient conditions for a
causal relation between 2 events x, y. For example :
( Pxy > Px*Py ) == ( P(y|x) > Py ) == ( P(x|y) > Px )
!! == ( P(y|x) > P(y|~x) ) == ( P(x|y) > P(x|~y) ) == etc.
From all these 17 inequalities of the generic form lhs > rhs we can obtain
some 6*17= 102 measures of DEPENDENCE simply by COMPARING or CONTRASTING :
Da = lhs - rhs are ABSOLUTE DEPENDENCE measures, eg P(e|h) - P(e|~h)
Da is scaled [0..1] for lhs > rhs , or [-1..1] in general,
with 0 iff x,y are fully independent
Dr = lhs / rhs are RELATIVE DEPENDENCE measures, eg P(e|h) / P(e|~h)
Dr is scaled [0..1..oo) with 1 iff x,y are fully independent.
Rescalings : log(lhs / rhs) is scaled (-oo..0..oo) in general;
(lhs - rhs )/(lhs + rhs) is scaled [-1..0..1], I call it kemenization,
= (lhs/rhs -1)/(lhs/rhs +1);
and lhs/(lhs + rhs) is scaled [0..1/2..1].
Odds(.) = P(.)/(1 - P(.)) = 1/( 1/P(.) - 1 )
P(.) = Odds(.)/(1 + Odds(.)) = 1/( 1/Odds(.) + 1 )
P(x| y)/P(~x| y) = P(x| y)/(1 - P(x| y)) = Odds(x| y)
P(x|~y)/P(~x|~y) = P(x|~y)/(1 - P(x|~y)) = Odds(x|~y)
P(x| y)/P( x|~y) = B(x: y) = LR(x: y) = LR+ is a likelihood ratio
where B(x: y) is a simple Bayes factor = RR(x:y)
Bayes rule in odds-likelihood form :
Posterior odds on x if y = Prior odds * Likelihood ratio
= Odds(x|y) = Odds(x) * LR(y:x)
= P(x|y)/ P(~x|y) = ( Px/P(~x) ) * ( P(y|x)/P(y|~x) )
= P(x|y)/(1 -P(x|y)) = Px/(1 -Px) * ( P(y|x)/P(y|~x) )
= 1/(1/P(x|y) - 1)
In our 2x2 contingency table we have Odds ratio OR :
OR = Pxy*P(~x,~y) / [ P(x,~y)*P(~x,y) ] = (a/b)/(c/d) = a*d/(b*c)
SeLn(OR) = sqrt( 1/a + 1/b + 1/c + 1/d ) = standard error of odds ratio OR
cov(x,y) = Pxy*P(~x,~y) - [ P(x,~y)*P(~x,y) ]
= Pxy - Px*Py
but OR <> Pxy /(Px*Py) , except when Pxy = Px*Py , or Pxy=0 .
Relative risks RR(:) for the following 2x2 contingency table:
| e | ~e | e = effect present; ~e = effect absent
----|-----|-----|------
h | a | b | a+b h = hypothetical cause present (eg tested+ )
~h | c | d | c+d ~h = eg unexposed to environment (eg tested- )
----|-----|-----|------
| a+c | b+d | n
RR( e: h) = P(e|h)/ P(e|~h) = (a/(a+b))/(c/(c+d)) = a*(c+d)/((a+b)*c)
= (Peh/Ph)/((Pe-Peh)/(1-Ph)) from which we see that RR(e:h)
! = oo if Pe=Peh ie (a+c)=a ie P(e,~h)=0 ie c=0
Now recall that P(e|~h) + P(~e|~h) = 1, and get:
RR(~e:~h) = P(~e|~h)/P(~e|h) = (d/(c+d))/(b/(a+b))
= (1 - P(e|~h))/(1 - P(e|h))
= (1 - c/(c+d))/(1 - a/(a+b)) = d*(a+b) /(b*(c+d))
! = oo if Peh=Ph ie a=(a+b) ie P(h,~e)=0 ie b=0
RR( h: e) = P(h|e)/ P(h|~e) = (a/(a+c))/(b/(b+d)) = a*(b+d)/((a+c)*b)
= (Peh/Pe)/((Ph-Peh)/(1-Pe)) from which we see that RR(h:e)
! = oo if Ph=Peh ie (a+b)=a ie P(h,~e)=0 ie b=0
RR(~h:~e) = P(~h|~e)/P(~h|e) = (d/(b+d))/(c/(a+c))
= (1 - P(h|~e))/(1 - P(h|e))
= (1 - b/(b+d))/(1 - a/(a+c)) = d*(a+c) /(c*(b+d))
! = oo if Peh=Pe ie a=(a+c) ie P(e,~h)=0 ie c=0
ie:
for c=0 are RR( e: h) = oo = MAXImal = RR(~h:~e)
for b=0 are RR( h: e) = oo = MAXImal = RR(~e:~h)
for a=0 is RR( e: h) = 0 = minimal = RR( h: e)
for d=0 is RR(~h:~e) = 0 = minimal = RR(~e:~h)
! RR(e:h)*RR(~e:~h) = RR(h:e)*RR(~h:~e) = Peh*P(~e,~h)/( P(e,~h)*P(h,~e) )
= Peh*P(~e,~h)/( P(~e,h)*P(~h,e) )
which clearly are identical.
While these equations hold in general, you might like to meditate upon
why the 17th (find above) condition of independent x, y consists from
the same components.
If you search www for "relative risk" , you will get 160k hits;
if you search www for "relative risk" RR , you will get 28k hits;
if you search www for "confidence interval" CI , you will search well.
-.-
+More tutorial notes on probabilistic logic, entropies and information :
Stan Ulam, the father of the H-device (Ed Teller was the mother) used
to say that "Our fortress is our mathematics." I say that here
"Our fortress is our logic." Elementary probability theory is strongly
isomorphous with the set theory, which is strongly isomorphous with logic.
There are 16 Boolean functions of 2 variables, of which 8 are commutative
wrt both variables. For the purposes of inferencing we should use ORIENTED
ie DIRECTED ie ASYMMETRIC functions only. From the remaining 8 asymmetric
logical functions 4 functions are of 1 variable only, so that only 4
asymmetric functions remain for consideration : 2 implications and 2
inhibitions, which are pairwise mutually complementary. ASYMMETRY is
!! easily obtained even from symmetrical measures of association (or
dependence) by normalization with a function of one variable only, eg :
(Pxy -Px*Py)/(Px*(1-Px)) is 0 iff x,y are independent
= cov(x,y)/var(x)
= beta(y:x)
= slope of a probabilistic regression line Py = beta(y:x)*Px + alpha(y:x)
= (P(y|x) - Py)/(1-Px)
= P(y|x) - P(y|~x) is the numerator of F(y:x) below
= ARR(y:x) = absolute risk reduction (or increase if negative).
Many measures of information are easily obtained by taking expected value
of either differences or ratios of the lhs and rhs taken from a dependence
inequality lhs > rhs mentioned above. For example we could create :
SumSum[ Pxy * Dr(y:x) ] where Dr is a relative dependence measure like
eg RR(y:x), but a single Dr = oo would make the
whole SumSum = oo, hence it is better to use
SumSum[ Pxy * Da(y:x) ] where Da is an absolute dependence measure, eg
SumSum[ Pxy *( P(y|x) - P(y|~x) ) ], or SumSum[ Pxy * F(y:x) ] .
Knowing that ( P(y|x) - P(y|~x) ) = beta(y:x) = dPy/dPx, and
knowing that Integral[ dx*( dPx/Px)^2 ] = Fisher's information, I did
realize that my :
!! SumSum[ Pxy * (F(y:x))^2 ] could serve as an
quasi-Fisher-informatized RR [ find "my 1st interpretation" of F(y:x) ].
A particularly nice & meaningfully asymmetrical (wrt variables X, Y)
information is my favorite :
Cont(X;Y) = Cont(X) - Cont(X|Y) == Gini(X;Y) = Gini(X) - Gini(X|Y)
= Sum[ Px*(1 - Px) ] - SumSum[ Pxy*(1-P(x|y)) ]
= 1 - Sum[ (Px)^2 ] - ( 1 - SumSum[ Pxy*P(x|y) ] ) hence :
!! = SumSum[ Pxy*( P(x|y) - Px ) ] my semantically clearest form
!! = Expected[ P(x|y) - Px ] ie average dependence measured
by abs. difference P(x|y) - Px , which is asymmetrical wrt x, y, and
is but 1 of the 2*17 = 34 possible simple measures of association;
= SumSum[ (square(Pxy - Px*Py)) / Py ] , compare it with Phi^2
1-Cont(X) = Sum[ Px*Px] = E[Px] = expected probability of variable X
= expected probability of success in guessing events x
= long-run proportion of correct predictions of events x
= concentration index by Gini/Herfindahl/Simpson (S. was a WWII
codebreaker like I.J.Good and Michie; they called it a "repeat rate")
Cont(X) = 1 - Sum[(Px)^2] = expected improbability of variable X
= expected error or failure rate eg in guessing events x
0 <= Cont(;) and 1 - Cont(;) <= 1 ie they saturate like P(error), while
Shannon's entropies have no upper bound.
Btw, log(.) fits with the physiological Weber-Fechner law. E.g. sound is
measured on a log-scale in decibels, and so is the pH-factor (0..7..14 =
max. alkalic). Logs work even in psychenomics, as you will feel less
than twice as happy after your salary or profits were doubled :-) For more
on infotheory in physiology see the nice book { Norwich 1993 }.
Cont(;) has been called many names, eg quadratic entropy or parabolic
entropy. Cont(;) gives provably better, sharper results than Shannon's
entropy for tasks like eg pattern classification in general, and
diagnosing, identification, prediction, forecasting, and discovery of
! causality in particular. These tasks are naturally ASYMMETRICAL requiring
Cont(X;Y) <> Cont(Y;X), while Shannon's mutual information
I(X:Y) = I(Y:X) = SumSum[ Pxy*( log(Pxy/(Px*Py)) ]
is clearly symmetrical wrt the variables X, Y.
Cont(X:Y)/Cont(X) = TauB in { Goodman & Kruskal, Part 1, 1954, p.759-760 }
where they semantized their TauB as "relative decrease in the proportion
! of incorrect predictions". See my Hint 2 & Hint 7 on WWW.MATHEORY.COM .
{ Agresti 1990, p.75 } tells us that for 2x2 contingency tables Kruskal's
TauB equals Phi^2 :
X^2 = mean square contingency { Kendall & Stuart, chap.33, p.555-557 }
= n * SumSum[ square(Pxy - Px*Py)/(Px*Py) ] is my probabilistic form
Pearson's contingency coefficient = sqrt[ X^2 / ( n + X^2 ) ].
Phi^2 = (X^2)/n compare it with the last lorm of Cont(X;Y)
= SumSum[(square(Pxy - Px*Py))/(Px*Py) ] in my probabilistic form
= SumSum[ Pxy * Pxy/(Px*Py)] - 1 is a symmetrical expected
value like Shannon's mutual information :
I(X:Y) = SumSum[ Pxy*log(Pxy/(Px*Py)) ] = I(Y:X) in general; in particular
= -0.5*ln(1 - corr(X,Y)) iff X, Y are continuous Gaussian variables
[(1 - Cont(X)]/Pz = E[Px]/Pz = surprise index for the event z within the
variable X, as defined in { Weaver, Science and Imagination }. In 1949
Weaver co-authored Shannon's The Mathematical Theory of Communication.
Cont(.) was intended to measure "semantic information content" SIC. The key
idea is that the LOWER the probability of an event, the MORE possibilities
it ELIMINATES, EXCLUDES, FORBIDS, hence MORE its occurrence SURPRISEs us.
{ Kemeny 1953, p.297 } refers this insight to { Popper 1972, pp.270, 399,
400, 402 mention P(~x) = 1-Px as (semantic information) content SIC },
{ Bar-Hillel 1964, p.232 } quotes "Omnis determinatio est negatio", ie
Determinatedness is negation, ie "Bestimmen ist verneinen" by Baruch
Spinoza (1632-1677), in 1656 excommunicated from the synagogue in
Amsterdam. Btw, Occam was excommunicated from the Church in 1328 :-).
Stressing elimination of hypotheses or theories is Popperian refutation-
alism.
In principle any decreasing function of Px will do, but (1-Px) is surely
the simplest one possible, simpler than Shannon's log(1/Px) = -log(Px).
By combining (1 - Px) with 1/Px, I constructed SurpriseBy(x) = (1 - Px)/Px
only to find it implicit or hidden inside RR(y:x), after my rearrangement
of atomic factors in RR(:). Note that Sum[ Px*1/Px] would not work :-)
For more on Cont(;) see { Kahre, 2002 } in general, and my (Re)search
hints Hint2 & Hint7 there on pp.501-502 in particular (also on
www.matheory.info ). I could write a book(let) on Cont(.) but have to cut
it here.
.-
Boolean logic is strongly isomorphous with the set theory wherein
X implies Y whenever X is a subset of Y. Since the probability theory
is also strongly isomorphous with the set theory, we see that
for the < as the symbol for both "a subset of" and for the "less than",
it is obvious that since [ P(x|y) > P(y|x) ] == [ Py < Px ] and v.v. ,
!!! Py < Px makes (y implies x) ie (x necessary for y) more plausible, while
!!! Px < Py makes (x implies y) ie (y necessary for x) more plausible,
which can be easily visualized with a Venn (aka pancakes or pizza) diagram.
0 <= P(y|x) = Pxy/Px <= 1 measures how much the event x implies y
with maximum = 1 for Pxy = Px ;
0 <= P(x,~y) = Px - Pxy <= 1 measures how little the event x implies y
or: 1 - P(y|x) = (Px - Pxy)/Px measures how little the event x implies y
or: 1/(Px - Pxy) measures how much the event x implies y
Also recall that the Bayesian probabibility of a j-th hypothesis x_j ,
given a vector of cue events y_c ie y..y, is (under the assumption of
independence) computed by the Bayes chain rule formula based on the
product of P(y|x) , the higher the more probable the hypothesis x_j :
P(x_j, y..y) =. P(x_j)*Product_c:( P(y_c | x_j ) ; dont swap y, x !
A cue event = a feature/attribute/symptom/evidential/test event
" x implies y " in plaintalk :
" If x then y " is a deterministic rule, which in plain English says that
" x always leads to y " ie Px - Pxy = 0 ie Px = P(x, y); or:
" It is not so that (x and not y) occur jointly" ie P(x,~y) = 0 ;
note that Pxy + P(x,~y) = Px , hence P(x,~y) = 0 and
the Pxy = P(x) are equivalent indeed;
which is the deterministic (extremely perfect or ideal) case
which literally translates into the probabilistic formalisms:
(x implies y) == ~(x,~y) in logic, ie 1 - P(x,~y) = 1 - (Px - Pxy)
or, the smaller the P(x,~y), the more x implies y ,
hence another measure of probabilistic causal tendency (y causes x) :
conv( x --> y) = Px*P(~y)/P(x,~y) = (Px - Px*Py)/(Px - Pxy)
= Px/P(x|~y) = P(~y)/P(~y|x)
is 1 iff x,y independent; and where:
+ the larger the Pxy <= min(Px, Py), the more the (x implies y), and
+ the closer the Pxy is to Px*Py , the more independent are x, y, and
the closer the conv(:) to 1 which is the fixed point for independence
( y --> x) == (~x --> ~y) where --> is "implies" in logic; here too:
conv( y --> x) = Py*P(~x)/P(y,~x) = (Py - Px*Py)/(Py - Pxy) =
= conv(~x --> ~y) = P(~x)*Py/P(~x,y)
so their equality is logically ok, but it is
!!! UNDESIRABLE FOR A MEASURE OF CAUSAL TENDENCY. Q: why? A: because eg:
"the rain causes us to wear raincoat" is ok, but "not wearing a raincoat
causes no rain" makes NO SENSE as the Nobel prize winner Herbert Simon
pointed out in { Simon 1957, p.50-51 }. This undesirable equality does
not hold for LR(:), RR(:) and its co-monotonous transformations like eg
W(:) and F(:).
(x inhibits y) == (~x, y) == ~(~(~x, y)) == ~(y implies x) in logic,
is equivalent to "y does not imply x" ;
== P(~x, y) = Py - Pxy = x inhibits y (probabilistic)
== in plaintalk "Lack of x leads to y ",
because in the perfect case we get
ideally P(~x,~y) = 0 is the deterministic, extreme case;
note that P(~x,~y) = 0 is not equivalent to Pxy = Py, because
by DeMorgan P(~x,~y) = 1 - (Px + Py - Pxy) = P(~(x or y)) always,
hence P(~x,~y) = 0 ie Px + Py - Pxy = 1 ie Px + Py = 1 - Pxy
Recall P(~x,~y) + P(~x, y) = P(~x) always
P(~x,~y) + P( x,~y) = P(~y) always
inh0(x:y) = P(~x,y)/(P(~x)*Py) = (Py -Pxy)/(Py -Px*Py) scaled [0..1..oo)
= 0 iff Pxy = Py
= 1 iff Pxy = Px*Py ie iff x,y independent
= oo iff Px = 1
inh1(x:y) = ( P(~x,y) - ( P(~x)*Py)) / ( P(~x,y) + ( P(~x)*Py) )
= ((Py -Pxy) - (Py -Px*Py)) / ( ( Py -Pxy) + (Py -Px*Py) )
= ( Px*Py - Pxy ) / ( 2*Py -Pxy -Px*Py )
= ( Pxy - Px*Py ) / ( Pxy + Px*Py -2*Py )
= [-1..0..1] by my kemenyzation
inh1(y:x) = ( Px*Py - Pxy ) / ( 2*Px -Pxy -Px*Py )
= ( Pxy - Px*Py)/ ( Pxy +Px*Py -2*Px )
x implies y == ~(y inhibits x), hence it should hold:
-inh1(y:x) = (x implies y) = caus1(x:y), and indeed, it does hold
= ( Pxy - Px*Py ) / ( 2*Px -Pxy -Px*Py ), see caus1(:) below.
Consider again: "Lack of x (almost) always leads to y ". Clearly, it
would be wrong to tell somebody with x and y that x caused y . Hence
Pxy alone cannot measure how much the x causes y, but P(y|x) could.
Alas, P(y|x) = Pxy/Px is not a function of Py, and we believe that it is
wise to have measures which are functions of all 3 Pxy, Px and Py :
conv(x --> y) = Px*P(~y)/P(x,~y) the larger the more causation,
due to small P(x,~y) co-occurence
= (Px - Px*Py)/(Px - Pxy) is 1 if x, y are independent;
= [0..1..oo) , infinity oo iff Pxy = Px, 0 iff independ.
conv2(x --> y) = P(x implies y )/( ~( Px*P(~y)) ) larger implies more
= P(~(x,~y))/( ~( Px*P(~y)) )
= ( 1 - P( x,~y))/( 1 - Px*P(~y) ) is 1 if independent;
= [1/2..1..4/3] , 1 iff x,y independent; 4/3 iff x imp y.
From the P(~x,~y) + P(~x, y) = P(~x)
P(~x,~y) + P( x,~y) = P(~y)
for the case P(~x,~y) = 0 holds P(~y) = P( x,~y) in which case
conv(x --> y) = Px <= 1 = independence,
conv2(x --> y) = ( 1-P(~y) )/( 1 -P(~y)*Px ) <= 1 = independence;
<= 1 is due to *Px , which always is 0 <= Px <= 1 .
<= 1 in this case is good, because P(~x,~y) = 0 was shown to be
!!! equivalent to the (x inhibits y), hence x cannot imply y ,
not even a little bit, ie causation must not exceed the point
of no dependence ie point of independence, and indeed both
conv(:) and conv2(:) are <= 1 in this case, which is good.
An explanation and justification of the conv(:) measures:
+ conv(:) = fun( Px, Py, Pxy ) ie fun of all 3 defining probabilities.
+ conv has a fixed value if x, y independent , and also
has a fixed value if x implies y 100% ,
hence conv(:) has a decent opeRational interpretation.
+ conv(x --> y) = extreme when x implies y 100%
!!! ie when Pxy = Px regardless of Py (draw a Venn)
(x implies y) = ~(x,~y) in logic
= 1 - P(x,~y) in probability
{ Brin 1997 } got rid of the outer negation ~ by taking the reciprocal
value. On one hand this trick is not as clean as
!!! conv2(:), but on the other hand this trick makes the
!!! 100% implication value an extreme value REGARDLESS of Py :
conv(x --> y) = Px*P(~y)/P(x,~y) , the larger the more (x implies y)
= Px/P(x|~y) , is 1 iff independent x,y
= Px*(1 - Py)/(Px - Pxy)
= (Px - Px*Py)/(Px - Pxy) is 1 if Pxy = Px*Py ie 100% independence,
is oo if Pxy = Px ie 100% (x implies y)
oo needs a precheck for an overflow;
numerically is 0/0 if Py = 1 ie Pxy = Px (overflow) but:
correct logically is 1 if Py = 1 as Pxy = Px*Py ie x,y indep.
Or its reciprocal (since Pxy = Px is possible, while Py < 1) :
(Px - Pxy)/(Px - Px*Py) , the smaller the more (x implies y) :
is 1 if Pxy = Px*Py ie 100% independence,
is 0 if Pxy = Px ie 100% (x implies y)
numerically is 0/0 if Py = 1 ie Pxy = Px (overflow) but:
correct logically is 1 if Py = 1 as Pxy = Px*Py ie x,y indep.
Conv(:) kemenyzed by me to the scale [-1..0..1] becomes
conv1(x --> y) = ( Px*P(~y) - P(x,~y) ) / ( Px*P(~y) + P(x,~y) )
= ( Px - P(x|~y) ) / ( Px + P(x|~y) )
= ( Pxy - Px*Py)/(2*Px - Pxy - Px*Py) = -inh1(y:x) above;
= ( P(~y) - P(~y|x) ) / ( P(~y) + P(~y|x) )
which kemenyzed to the scale [0..1/2..1] becomes :
conv3(x --> y) = Px*P(~y) / ( Px*P(~y) + P(x,~y) )
or based on counterfactual reasoning (cofa) IF ~x THEN ~y :
cofa1(~x --> ~y) = ( P(~x)*Py - P(~x, y) ) / ( P(~x)*Py + P(~x, y) )
= ( Py -Px*Py - Py +Pxy ) / ( Py -Px*Py + Py -Pxy )
= ( Pxy - Px*Py )/( 2*Py -Pxy -Px*Py) = -inh1(x:y) above,
ie not( x inhibits y ).
cofa0(~x --> ~y) = P(~x)*Py / P(~x, y)
= (Py -Px*Py)/(Py -Pxy) = conv(y --> x) above,
and indeed, in logic (~x <== ~y) == (y <== x); the <== means "implies"
(and also it means "less then" if applied to 0 = false, 1 = true)
F(~x:~y) == F(~x <== ~y)
= ( P(~x|~y) - P(~x|y) ) / ( P(~x|~y) + P(~x|y) ) = -F(~x:y)
They all look reasonable, and all are scaled to [-1..0..1].
Q: which one do you like, if any, and why (not) ?
A mathematically more rigorous alternative to conv(x --> y) is my
conv2(x --> y) which does not suffer from the dangers of an overflow,
employs the exact probabilistic (x implies y) = 1 - P(x,~y) derived
from the exact logical (x implies y) == ~(x,~y).
Since we wish to have a fixed value for the independence of events x, y,
the exact implication form 1 - P(x,~y) suggests to compare it with
the negation of the fictive ie as-if independence term as follows:
conv2(x --> y) = P(x implies y)/( x,y independ ) larger implies more
= P(~(x,~y))/( ~( Px*P(~y)) )
= ( 1 - P( x,~y))/( 1 - Px*P(~y) ) is 1 if independent
= ( 1 -(Px - Pxy))/( 1 - Px*(1-Py) )
= ( 1 - Px + Pxy )/( 1 - Px +Px*Py ) is 1 if Pxy = Px*Py
This is ( 1 - Px + Px )/( 1 - Px +Px*Py ) if Pxy = Px
= 1/( 1 - Px*(1 -Py)) >= 1 if Pxy = Px,
the larger the Px and the smaller the Py, the >> 1 is conv2(x;y).
When Px = Pxy (draw a Venn diagram) ie if 100% implies
then the numerator is 1 ie maximal, but unlike in
!! conv(x --> y) , the denominator depends on Px and Py
-.-
+Rescalings important wrt risk ratio RR(:) :
For positive u, v : u/v is scaled [ 0..1..oo) v <> 0
and W = ln(u/v) is scaled (-oo..0..oo) v <> 0
and F = (u - v )/(u + v ) is scaled [ -1..0..1 ], allows u=0 xor v=0
= (1 - v/u)/(1 + v/u) handy for graphing F=f(v/u) u <> 0
= (u/v - 1)/(u/v + 1) handy for graphing F=f(u/v) v <> 0
= (u - v )/(u + v ) rescaling I call "kemenyzation" to honor
the late John Kemeny, the Hungarian-American
co-father of BASIC, and former math-assistant to Einstein;
= tanh(W/2) = tanh(0.5*ln(u/v)) due to { I.J. Good 1983, p.160
where sinh is his mistake }
Since
atanh( z ) = 0.5*ln( (1+z)/(1-z) ) for z <> 1,
W = 2*atanh( F ) = ln( (1+F)/(1-F) ) for F <> 1
F0 = (F+1)/2 is linearly rescaled to [0..1/2..1], 1/2 for independence.
W(y:x) = ln( P(y|x)/P(y|~x) ) is an information gain [see F(:) ]
= ln( P(y|x) ) - ln(P(y|~x) ) is additive
= ln( B(y:x) ) ie logarithmic Bayes factor
= ln(RR(y:x) ) = ln( relative risk of y if x )
= ln(Odds(x|y)/Odds(x))
is I.J. Good's "weight of evidence in favor of x provided by y".
The advantage of oo-less scalings like [-1..0..1] or [0..1/2..1] is that
they make comparisons of different formulas possible at all and more
meaningful, though not perfect. E.g. we may try to compare a value of
F(:) with that of conv1(:) which is conv(:) kemenyzed by me.
W(:)'s logarithmic scale allows addition (of otherwise multiplicable
ratios) under the valid assumption of independence between y, z :
W(x: y,z) = W(x:y) + W(x:z)
but when y,z are dependent we must use { I.J. Good 1989, p.56 } :
W(x: y,z) = W(x:y) + W(x: y|z)
F(:)'s cannot be simply added, but can be combined (provided y, z are
independent) according to { I.J. Good, 1989, p.56, eq.(7) } thus :
F(x: y,z) = ( F(x:y) + F(x:z) )/( 1 + F(x:y)*F(x:z) )
but when y,z are dependent we must use :
F(x: y,z) = ( F(x:y) + F(x: z|y) )/( 1 + F(x:y)*F(x: z|y) )
Seeing this, physicists, but not necessarily physicians, might recall
that 2 relativistic speeds are combined into the resultant one by means
of a regraduation function for relativistic addition of velocities u, v
into a single rapidity rap :
rap = ( u + v )/( 1 + u*v/(c*c) ) where c is the speed of light.
P(.|.)'s maximum = 1 corresponds to the unexceedable speed of light,
in which case rap simplifies to our ( u + v )/( 1 + u*v ).
This relativistic addition appears in:
- { Lucas & Hodgson, pp.5-13 } is the best on regraduation (no P(.)'s )
- { Yizong Cheng & Kashyap 1989, p.628 eq.(20) }, good;
- { Good I.J. 1989, p.56 }
- { Grosof 1986, p.157 } last line, no relativity mentioned;
- { Heckerman 1986, p.180 } first line, no relativity mentioned.
-.-
+Correlation in a 2x2 contingency table is scaled to [-1..0..1] :
corr(x,y) = [ a*d - b*c ]/ sqrt[ (a+b)*(a+c) * (b+d)*(c+d) ]
= [ Pxy*P(~x,~y) - P(~x,y)*P(x,~y) ] / sqrt[ Py*Px*P(~x)*P(~y) ]
= [ Pxy - Px*Py ] / sqrt[ Px*(1-Px) * Py*(1-Py) ]
= cov(x,y) / sqrt( var(x) * var(y) )
= correlation coefficient of binary ie Bernoulli ie indicator events
x, y is symmetrical wrt x, y
r2 = square(corr(x,y))
= ( cov(x,y)/var(x)) * ( cov(x,y)/var(y))
= beta( y:x ) * beta( x:y ) -1 <= beta <= 1
= (slope of y on x ) * (slope of x on y)
= ( P(y|x) - P(y|~x) * ( P(x|y) - P(x|~y) ) for events x, y
= coefficient of determination aka r^2 or r2
= ( explained variance ) / ( explained var. + unexplained variance )
= ( variance explained by regression ) / ( total variance )
= 1 - ( variance unexplained ) / ( total variance )
r2 is considered to be a more realistic (because less inflated) measure
of correlation than the corr(.,.) itself (except for the sign).
The key mean squared error equation from which the above follows is :
MSE = variance explained + variance unexplained aka residual variance
This MSE equation I call Pythagorean decomposition of the mean squared
error MSE into its orthogonal partial variations.
It is a sad fact that very few books on statistics and/or probability
show the correlation coefficient between events.
Yule's coefficient of colligation { Kendall & Stuart 1977, chap.33
on Categorized data, p.539 } is also symmetrical wrt x, y:
Y = ( 1 - sqrt(b*c/(a*d)) ) / ( 1 + sqrt(b*c/(a*d)) )
= ( sqrt(a*d) - sqrt(b*c) ) / ( sqrt(a*d) + sqrt(b*c) ) kemenyzed
= tanh( 0.25*ln( a*d/(b*c) ) ) my tanhyperbolization a la I.J. Good
The formula for chi-squared (findable as X^2 , chisqr , chisquared ) :
X^2 = Sum[ ( Observed - Expected^2 ) / Expected ]
= [ ( a - (a+b)(a+c)/n )^2 +
( b - (a+b)(b+d)/n )^2 +
( c - (a+c)(c+d)/n )^2 +
( d - (b+d)(c+d)/n )^2 ] is exact,
=. n*(|ad - bc| -n/2)^2 /[ (a+b)(a+c)(b+d)(c+d) ] Yates' correction
=.. n*( ad - bc )^2 /[ (a+b)(a+c)(b+d)(c+d) ] may be good enough
.-
Although a meaningful interpretation of values is very important, it is
equally important how it orders the values obtained from a data set, since
we want a list of the pairs of events (x,y) sorted by the strength of their
potential causal tendency :
Note that :
P(x|y) is the diagnostic predictivity of the hypothesis x from y effect
P(y|x) is the causal predictivity of the effect y from x
P(y|x) = Py*P(x|y)/Px = Pxy/Px is the Bayes rule.
The likelihood ratio aka Bayes factor in favor of the outcome (or hypothesis)
x provided by the evidence (or predictor or cue or feature or effect) y,
aka relative risk RR is :
RR(y:x) == B(y:x)
= P(y|x) / P(y|~x) = (Pxy/Px)/( (Py - Pxy)/(1 - Px) )
= Pxy*(1 - Px) / (Px*Py - Px*Pxy) caution! /0 if Pxy = Py :-(
= (1 - Px) / (Px*Py/Pxy - Px)
= (Pxy - Px*Pxy)/( Px*Py - Px*Pxy ) which shows that:
B = 1 for Px*Py=Pxy ie for independent x,y ; and
B = oo for Py=Pxy ie "if y then x" ie y implies x
= relative odds on the event x after the event y was observed
!! = Odds(x|y)/Odds(x) = posteriorOdds / priorOdds ; { odds form }
= ( P(x|y)/(1-P(x|y)) )/( Px/(1-Px) )
= ( P(x|y)*(1-Px) )/( Px*(1-P(x|y)) )
{ note that (x|y) inverts into (y|x) via
Bayesian P(x)*P(y|x) = Pxy = P(x|y)*Py }
= P(y|x) / P(y|~x) q.e.d.
!! = ( Pxy/(Py - Pxy) )*( (1-Px)/Px )
!! shows that Py = Pxy does mean that (y implies x) so that B(y:x) = oo
!! note that when (x causes y) then (y implies x) but not necessarily
!! vice versa; the y is an effect or outcome in general;
= P(y|x)/( 1 - P(~y|~x) ) = B(y:x) because,
= P(y|x)/P( y|~x) q.e.d.
Lets compare RR(y:x) = P(y|x) / P(y|~x) = P(y|x) * (1-Px)/(Py - Pxy)
!!
with conv(y --> x) = Py / P(y|~x) = Py * (1-Px)/(Py - Pxy)
= Py * P(~x)/P(y,~x)
clearly RR(y:x) is more meaningful than the "conviction" by { Brin 1997 },
though conviction is no nonsense either :
+ both RR(y:x) and conv(y:x) equal oo if Py=Pxy ie if y implies x
+ both RR(y:x) and conv(y:x) equal 1 if Pxy=Px*Py ie y, x are independent
+ both RR(y:x) and conv(y:x) equal 0 if Pxy=0 ie if y is disjoint with x
+ RR(y:x) is relative risk, used within other meaningful formulas
+ RR(y:x) <> RR(~x:~y) which is good, while
- conv(y:x) == conv(~x:~y) which is NO GOOD (find UNDESIRABLE above)
B(~y:~x) = P(~y|~x) / P(~y|x) == RR(~y:~x)
= [1 - P( y|~x)] / [ 1 - P(y|x)] = [ P(~y,~x)/P(~y,x)]*Px/(1 - Px)
= [ (1 -Py -Px +Pxy)/(Px -Pxy) ]* Px/(1-Px)
= [ (1 -Py)/(Px-Pxy) -1 ]* Px/(1-Px)
!! when Px=Pxy ie when (x implies y) then B(~y:~x) = oo
hence if we wish to use a B(~.:~.) instead of B( .: .),
then we must swap the events x and y. For example instead of
B( y: x) we might use :
B(~x:~y) = P(~x|~y) / P(~x|y) == RR(~x:~y)
= [ (1 -Px)/(Py-Pxy) -1 ]* Py/(1-Py)
!! when Py=Pxy ie when (y implies x) then B(~x:~y) = oo
W(y:x) = the weight of evidence for x if y happens/occurs/observed
= ln( P(y|x)/P(y|~x) ) = Qnec(y:x) { I.J. Good 1994, 1992 }
= ln( B(y:x) ) = logarithmic Bayes factor for x due to y
= ln(RR(y:x) )
= ln(Odds(x|y)/Odds(x))
W(~y:~x) = the weight of evidence against x if y absent { I.J. Good }
= ln( P(~y|~x)/P(~y|x) ) = Qsuf(y:x) { I.J. Good 1994, 1992 }
= ln( B(~y:~x) )
= ln( (1 - P( y|~x)) / (1 - P( y|x)) )
= -W(~y:x)
W(:) = 2*atanh(F(:)) = ln((1+F)/(1-F)) for abs(F) <> 1
B(a:b) = P(a|b)/P(a|~b) = (Pab/Pb)/((Pa-Pab)/(1-Pb))
= oo iff Pab=Pa ie iff (a implies b)
B(b:a) = P(b|a)/P(b|~a) = (Pab/Pa)/((Pb-Pab)/(1-Pa))
= oo iff Pab=Pb ie iff (b implies a)
Q:/Quiz: could comparing (eg subtracting or dividing) B(a:b) with B(b:a)
show the DIRECTION of a possible causal tendency ??
P(~b,~a) = 1 - (Pa + Pb - Pab) = P(~(a or b)) by DeMorgan's rule
B(~b:~a) = P(~b|~a)/P(~b|a) = (1 - P(b|~a))/(1 - P(b|a))
= [P(~b,~a)/(1-Pa)] / [(Pa-Pab))/Pa]
= oo iff Pab=Pa ie iff (a implies b) like for B(a:b) or W(a:b)
which speaks against comparing ?(a:b) with ?(~b:~a) for the purpose of
deciding the direction of possible causal tendency.
C(y:x) = a measure of corroboration { by Karl Popper }
= (P(y|x) - Py )/( P(y|x) + Py - Pxy ) { C-form 1 }
= ( Pxy - Px*Py )/( Pxy + Px*Py - Pxy*Px ) { C-form 2 }
= (P(y|x) - P(y|~x))/( P(y|x) + Py/P(~x) ) { compare w/ F-form 1 }
= (cov(x,y)/var(x) )/( P(y|x) + Py/P(~x) ) { C-form 3a }
= beta(y:x)/( P(y|x) + Py/P(~x) ) { C-form 3b }
F(y:x) == F(y <== x) = degree of factual support of x by y
= primarily a measure of how much y implies x { by John Kemeny }
= (P(y|x) - P(y|~x))/( P(y|x) + P(y|~x) ) { F-form 1 }
= ( Pxy - Px*Py )/( Pxy + Px*Py - 2*Pxy*Px ) { F-form 2 }
= (cov(x,y)/var(x) )/( P(y|x) + P(y|~x) ) { F-form 3a }
= beta(y:x)/( P(y|x) + P(y|~x) ) { F-form 3b }
= tanh( 0.5*ln(P(y|x) / P(y|~x)) ) { F-form 4 }
= tanh( W(y:x)/2 ) { by I.J.Good, fixed }
= (difference/2) / average = deviation/mean { F-form 5 }
= ( B(y:x) - 1 ) / ( B(y:x) + 1 ) is handy also for graphing F = fun(B)
= -F(y:~x)
= (Pxy/Px - (Py-Pxy)/(1-Px)) / (Pxy/Px + (Py-Pxy)/(1-Px)) hence:
= 1 iff P(y|~x) = 0
ie iff P(y,~x) = 0 ie iff Pxy = Py
ie iff y implies x deterministically
ie iff y leads to x (always)
ie IF y THEN x (always holds)
= 0 iff x, y are independent ;
= -1 iff P(y|x) = 0
ie iff P(y,x) = 0 ie iff x, y are mutually exclusive
where :
the F-form 1 is the original one by { Kemeny & Oppenheim 1952 },
my F-form 2 is the de-conditioned one, and it does reveal that
iff Pxy=Py (see the -2* ), ie iff y implies x, then F(x:y)=1.
my F-form 3 reveals an important hidden meaning: beta(y:x) is the slope
of the implicit probabilistic regression line
of Py = beta(y:x)*Px + alpha(y:x) ;
the F-form 4 reveals that F(:) and Turing-Good's weight of evidence W(:)
are changing co-monotonically
my F-form 5 provides the most simple interpretation of F(:)
Unlike B(:) or W(:), the F(:) will not easily overflow due to /0.
The numerators tell us that for independent x, y it holds C(:) = 0 = F(:).
C(:) stresses the near independence, while F(:), W(:), B(:) stress near
implication more than near independence. Try out an example with a near
independence and simultaneously with near implication.
F(x:y) == F(x <== y) = degree of factual support of y by x
= primarily a measure of how much x implies y
= (P(x|y) - P(x|~y))/( P(x|y) + P(x|~y) ) { F-form 1 }
= ( Pxy - Px*Py )/( Pxy + Px*Py - 2*Py*Pxy ) { F-form 2 }
= (Pxy/Py - (Px-Pxy)/(1-Py)) / (Pxy/Py + (Px-Pxy)/(1-Py))
= -F(x:~y)
note that iff Px=Pxy then F(x:y) = 1 == (x implies y) fully, and that
this matches Pxy/Px = 1 as maximal possible contribution to the product
for P(x_j | y..y) computed by the simple Bayesian chain rule over y..y cues
for P(x_j , y..y). Clearly a product of Pxy/Px terms over a vector of cues
y..y may be viewed as a product of the simplest (x implies y) terms.
Rescaling F(:) from [-1..0..1] to [0..1/2..1] :
F0(:) = ( F(:) + 1 )/2 , so that
F0(x:y) = P(x|y)/( P(x|y) + P(x|~y) )
F0(y:x) = P(y|x)/( P(y|x) + P(y|~x) )
Before we go further, we recall that F(:) is co-monotonical with B(:), and
B(y:x) = P(y|x) / P(y|~x)
= (Pxy/Px)/( (Py - Pxy)/(1 - Px) ) {now consider INCreasing Pxy:}
wherein Pxy/Px measures how much (x implies y) up to maximum = 1
while 1/(Py - Pxy) measures how much (y implies x) up to maximum = oo
hence iff Py = Pxy then B(y:x) = oo ie y implies x is measured by B(y:x)
and:
B(y:~x) = P(y|~x) / P(y|x)
= ((Py - Pxy)/(1 - Px)) /(Pxy/Px) {now consider DECreasing Pxy:}
wherein Py - Pxy measures how much (y implies x) with maximum = Py
while 1/(Pxy/Px) measures how much (x implies y) with maximum = oo
for Pxy=0
hence iff Pxy = 0 then B(y:~x) = oo
F(y:x)
= (P(y|x) - P(y|~x))/(P(y|x) + P(y|~x))
= ( Pxy - Px*Py )/( Pxy + Px*Py - 2*Px*Pxy )
= -F(y:~x)
F(y:~x)
= (P(y|~x) - P(y|x))/(P(y|~x) + P(y|x))
= ( Px*Py - Pxy )/( Px*Py + Pxy - 2*Px*Pxy )
= -( Pxy - Px*Py )/( Pxy + Px*Py - 2*Px*Pxy );
= -F(y:x)
F(~y:x)
= (P(~y|x) - P(~y|~x))/(P(~y|x) + P(~y|~x))
= ( Pxy - Px*Py )/( Pxy + Px*Py - 2*Px*(1 - Px + Pxy) )
= -F(~y:~x)
F(~y:~x)
= (P(~y|~x) - P(~y| x))/(P(~y|~x) + P(~y| x))
= ( Px*Py - Pxy )/( Px*Py + Pxy - 2*Px*(1 - Px + Pxy) )
= -( Pxy - Px*Py )/( Pxy + Px*Py - 2*Px*(1 - Px + Pxy) )
= -F(~y:x)
and the remaining 4 mirror images are easily obtained by swapping x and y.
Note that W(:) = 2*atanh(F(:)) = ln[ (1 + F(:))/(1 - F(:)) ] .
For example F(~y:x) = -F(~y:~x) would measure how much the hypothesis
x explains the unobserved fact y , like e.g. in common reasoning :
"if (s)he would have the health disorder x
(s)he could NOT be able to do y (eg a body movement (s)he did)",
so that from a high enough F(~y:x) we could exclude the disorder x as
an unsupported hypothesis.
-.-
+Example 1xy:
for Px=0.1 , Pxy=0.1 , Py=0.2 , visualized by a squashed Venn diagram
xxxxxxxxxx
yyyyyyyyyyyyyyyyyyyy
are P(x|y)=0.5 ie "50:50" ; P(x|~y)=(0.1 -0.1)/(1 -0.2) = 0 ie minimum
P(y|x)=1.0 ie maximum ; P(y|~x)=(0.2 -0.1)/(1 -0.1) = 1/9
and
corr(x,y) = cov(x,y)/sqrt[ var(x) * var(y) ]
= (Pxy - Px*Py)/sqrt[ Px*(1-Px) * Py*(1-Py) ]
= 0.7 is the value of the correlation coefficient between
the events x, y
caus1(x:y) = ( Pxy - Px*Py ) / ( 2*Px -Pxy -Px*Py ) = 1
F(x:y) = (0.5 - 0)/(0.5 + 0) = 1
B(x:y) = P(x|y)/P(x|~y) = 0.5/0 = oo = infinity
clearly the rule IF x THEN y cannot be doubted ;
but what do we get when we swap the roles of x, y ie when our observer
will view the situation from the opposite viewpoint ? This can be done
by either swapping the values of Px with Py, or by computing F(y:x) :
+Example 1yx:
is F(y:x) = (1 -1/9)/(1 + 1/9) = 0.8 is too high for my taste
caus1(y:x) = (Pxy - Px*Py)/(2*Py -Pxy -Px*Py) = 0.29 is more reasonable,
as Pxy/Py = 0.5 hence y doesnt imply x much (although x implies y fully
as Pxy/Px = 1 );
B(y:x) = P(y|x)/P(y|~x) = 1/((0.2 - 0.1)/0.9) = 9 is (too) high.
!!! Conclusion: for measuring primarily an implication & secondarily
dependence, B(y:x) and F(y:x) are not ideal measures.
!!! Note: if Px < Py then x is more plausible to imply y, than vice versa;
if Py < Px then y is more plausible to imply x, than v.v.
+Example 3: x = drunken driver ; y = accident
P(y| x) = 0.01 = P( accident y caused by a drunken driver x )
P(y|~x) = 0.0001 = P( accident y caused by a sober driver ~x )
is how { Kahre 2002, p.186 } defines it; obviously P(y|x) > P(y|~x).
Note that without knowing either Px or Py or Pxy, we cannot obtain the
probabilities needed for caus1(x:y), F(x:y) and F(~x:~y), ie we can
compute B(y:x), F(y:x) and caus1(y:x) only :
beta(y:x) = P(y|x) - P(y|~x) =. 0.1 is the regression slope of y on x ,
is misleadingly low.
B( y: x) = P(y|x) / P(y|~x) = 100 ie (y implies x) very strongly.
F( y: x) = (B-1) / (B+1) = 0.98 =. 1 = F's upper bound
F( y: x) measures how much an accident y implies drunkenness x
(obviously an accident cannot cause drunkenness).
B(~y:~x) = (1-P(y|~x))/(1-P(y|x)) = 1.01
F(~y:~x) = (B-1)/(B+1) = 0.005 =. 0 = F's point of independence
F(~y:~x) measures how much an absence of an accident y implies that
a driver is not drunk. Here is my CRITICISM of such formulas:
According to I.J. Good, "The evidence against x if y does not happen" can
also be considered as a possible measure of x causes y . It is based on
COUNTERFACTUAL reasoning "if absent y then absent x", which I denote as
!! "Necessitistic" reasoning. I am dissatisfied with the sad fact that his
formulation leads to formulas which are not zero when Pxy = 0 ie when x, y
are DISjoint. If the above explained notion of Necessity is to be taken
seriously, and I think it should be, then Good's formulation is not good
enough.
F(~y:~x) == F(~y <== ~x)
= measures how much ~y implies ~x =
= (P(~y|~x) - P(~y| x))/(P(~y|~x) + P(~y| x)) (1)
= ( Px*Py - Pxy )/( Px*Py + Pxy - 2*Px*(1 -Px + Pxy) )
= -( Pxy - Px*Py )/( Pxy + Px*Py - 2*Px*(1 -Px + Pxy) )
= (B(~y:~x) - 1)/(B(~y:~x) + 1)
= ((1 - P(y|x~)) - (1 - P(y|x )))/((1 - P(y|x~)) + (1 - P(y|x))) from (1)
= ( - P(y|x~) + P(y|x ) )/( 2 - P(y|x~) - P(y|x) )
= ( P(y|x ) - P(y|x~) )/( 2 - P(y|x~) - P(y|x) )
= -F(~y:x)
F(~x:~y) == F(~x <== ~y)
= ( P(~x|~y) - P(~x|y) ) / ( P(~x|~y) + P(~x|y) ) = -F(~x:y)
Kemenyzed are:
Knec( y: x) = ( P(y|x) - P(y|~x) ) / ( P(y|x) + P(y|~x) )
= ( cov(x,y)/var(x) ) / ( P(y|x) + P(y|~x) )
Ksuf( y: x) = ( (1 - P(y|~x)) - (1 - P(y|x)) )
/( (1 - P(y|~x)) + (1 - P(y|x)) )
= ( P(y|x) - P(y|~x) ) / ( 2 -(P(y|x) + P(y|~x)) )
= ( cov(x,y)/var(x) ) / ( 2 -(P(y|x) + P(y|~x)) )
-.-
+Folks' wisdom :
!!! Caution: causation works in the opposite direction wrt implication. This
is so, because ideally an effect y implies a cause x, ie
a cause x is necessary for an effect y. See the short +Introduction
again. In what follows here it may be necessary to swap e, h if we want
causation. Since different folks had different mindwaves I tried not to mess
with their formulations more than necessary for this comparison.
The notions of probabilistic Necessity and Sufficiency on the event level
have been quantified differently by various good folks' wisdoms.
E.g. from { Kahre 2002, Fig.3.1, Fig.13.4 + txt } follows :
- if X is a subset of Z ie Z is a SuperSet of X ie all X are Z
then X is Sufficient (but not necessary) for Z ie X implies Z.
ie Z is a consequence of X ie IF X THEN Z rule holds, I say.
- if Y is a SuperSet of Z ie Z is a subset of Y ie all Z are Y
then Y is Necessary (but not sufficient) for Z ie Z implies Y
ie Y is a consequence of Z ie IF Z THEN Y rule holds, I say;
For 2 numbers x, y it holds (x < y) == ( y > x) , or v.v.
For 2 sets X, Y it holds (X subset of Y) == ( Y superset of X) , or v.v.
For 2 events x, y it holds (x implies y) == (Py > Px) , or v.v.
For 2 events we may like to answer the Q's (and from the above follow A's) :
Q: How much is x Sufficient for y ? A: as much as y is Necessary for x .
Q: How much is x Necessary for y ? A: as much as y is Sufficient for x .
Q: If Pxy=0 ie x, y are disjoint ? A: then Suf = 0 = Nec must hold.
Lets use: e = evidence, effect, outcome; h = hypothesised cause (exposure)
Hence eg P(e|h) is a NAIVE, MOST SIMPLISTIC measure of Sufficiency of h for e
because P(e|h) = 1 = max iff Peh = Ph ie Ph - Peh = 0 = P(~e,h) ie
iff h is a subset of e, ie iff h implies e then is h Sufficient for e.
Note that (h Sufficient for e) == (h subset of e), and
(h Necessary for e) == (e subset of h), hence :
P(h|e) measures how much is h Necessary for e, and
P(e|h) measures how much is h Sufficient for e.
Lets compare these with those now corrected in { Schield 2002, Appendix A } :
P(e|h) = S = "Sufficiency of exposure h for case e"
P(h|e) = N = " Necessity of exposure h for case e" (find NAIVE , SIMPLISTIC )
!! Caution: the suffixes nec, suf in ?nec, ?suf as used by various authors
say nothing about which event is necessary for which one, if
the authors do not use ?(y:x) and do not specify what these parameters
mean. I recommend the ?(y:x) to mean that (y implies x) ie
(y suffices for x) which is equivalent to (x necessary for y).
Folk1: { Richard Duda, John Gaschnig & Peter Hart: Model design in the
PROSPECTOR consultant system for mineral exploration,
in { Michie 1979, pp.159 } , { Shinghal 1992, chap.10, pp.354-358 } and
in { Buchanan & Duda 1983, p.191 } :
Lsuf = P( e| h)/P( e|~h) = RR( e: h) = Qnec by I.J. Good
Lnec = P(~e| h)/P(~e|~h) = 1/RR(~e:~h) = [1 - P(e|h)]/[1 - P(e|~h)]
= Qsuf by I.J. Good
iff Lnec = 0 then e is logically necessary for h
iff Lnec = large then ~e is supporting h (ie absence of e supports h)
Lsuf = Qnec , but in fact there is no semantic confusion, since
Lsuf denotes how much is e sufficient for h (ie h necessary for e), and
Qnec denotes how much is h necessary for e (ie e sufficient for h).
Folk2: { Brian Skyrms in James Fetzer, ed., 1988, p.172 }
Ssuf = P( e| h)/P( e|~h) = RR( e| h) = Lsuf interpreted as follows:
iff Ssuf > 1 then h has a tendency towards sufficiency for e
Snec = P(~h|~e)/P(~h| e) = RR(~h:~e) = [1 - P(h|~e)]/[1 - P(h|e)]
iff Snec > 1 then h has tendency towards necessity for e
iff Ssuf*Snec > 1 then h has tendency to cause the event e
Folk3: { I.J. Good 1994, pp.306, + comment by P. Suppes on p.314 } :
Qnec = P( e| h)/P( e|~h) = RR( e| h) = Lsuf , see Folk1 ;
Qsuf = P(~e|~h)/P(~e| h) = RR(~e|~h) = [1 - P(e|~h)]/[1 - P(e|h)]
= weight of evidence against h if e does not happen
= a measure of causal tendency, according to I.J. Good.
Iff Qnec*Qsuf > 1 then h is a prima facie cause of e, adds Suppes.
In { I.J. Good 1992, p.261 } his new insight is formulated thus :
!! "Qsuf(e:h) = Qnec(~e:~h). This identity is a generalization of the
fact that h is a STRICT SUFFICIENT CAUSE of e if and only if
~h is a STRICT NECESSARY CAUSE of ~e, as any example
makes clear." ( I.J. Good's emphasis )
!! Qnec(e:h) = Qsuf(~e:~h) in { I.J. Good 1995, p.227 }
= RR( e: h) in { I.J. Good 1994, p.314 }, in my notation;
!! Qsuf(e:h) = RR(~e:~h) is NOT ZERO if e,h are DISjoint, as my
common sense requires.
Folk4: S = P(e|h) = " sufficiency of exposure h for effect e "
N = P(h|e) = " necessity of exposure h for effect e " { Schield }
See Appendix A in { Schield 2002 } , his first lines left & right.
There in his section 2.2 on necessity vs. sufficiency, Milo Schield
nicely explains their contextual semantics and applicability thus:
"Unless an effect [ e ] can be produced by a single sufficient cause [ h ]
(RARE!), producing the effect requires supplying ALL of its necessary
conditions [h_i], while preventing it [e] requires removing or eliminating
only ONE of those necessary conditions." (I added the [.]'s ).
Q: Well told, but do Schield's S and N fit his semantics ?
A: No. While S is unproblematic, his N is not.
Q: What does it mean that h is strictly sufficient for an effect e ?
A: Whenever h occurs, e occurs too. This in my formal translation
means that h implies e ie Peh = Ph ie P(e|h) = 1.
Hence S = P(e|h) measures sufficiency of h for e ,
or his necessity of e for h (formally, I say).
Note: if h = bad exposure and e = bad effect, than all above fits;
if h = good treatment for e = better health, than all above fits;
other pairings would not fit meaningfully.
!! My view is this : we are interested in h CAUSES e (potentially).
P(h|e) = Sufficiency of e for h ie e implies h ,
or P(x|y) = Sufficiency of y for x ie y implies x .
Sufficiency is unproblematic, so I use it as a fundament.
Q: What is Nec = necessity of h for e , really ?
A: I derive Nec from the semantical definition in { Schield 2002, p.1 }
where he writes: "But epidemiology focuses more on identifying
a necessary condition [h] whose removal would reduce undesirable outcomes
[e] than on identifying sufficient conditions whose presence would produce
undesirable outcomes." His statement between [h] and [e] I formalize (by
relying on the unproblematic Sufficiency ie on implication) thus:
(no h) implies (no e) ie "no e without h" :
~h implies ~e, hence P(~e|~h) = 1 in the ideal extreme case.
Note that generally P(~e|~h) = [ 1 - (Ph + Pe - Peh) ]/[ 1 - Ph ] = 1 here
!! ie Peh = Pe ie P(h|e) = 1 in this IDEAL extreme case ONLY, while
N = P(h|e) is Schield's general necessity of h for e.
!! But in general N = P(h|e) <> P(~e|~h) which is [ see P(e|h) above ]
SUFFICIENCY of ~h for ~e, which better captures Schield's semantics.
Q: Do we need his N = P(h|e) ??
A: Not if we stick to his more meaningful (than his N ) requirement on p.1
just quoted, and opeRationalized by me thus :
!!! (Necessity of h for e) I define as (Sufficiency of ~h for ~e) == P(~e|~h)
which is a COUNTERFACTUAL: IF no h THEN no e ie "no e without h"
which is close in spirit to I.J. Good's ( see Folk3 ) verbal definition,
except for the swapped suffixes nec and suf :
Qsuf(e:h) = Qnec(~e:~h) = RR(~e:~h) in my notation as shown at Folk3 above.
"Qsuf(e:h) = Qnec(~e:~h). This identity is a generalization of the
fact that h is a STRICT SUFFICIENT CAUSE of e if and only if
~h is a STRICT NECESSARY CAUSE of ~e, as any example
makes clear." ( I.J. Good's emphasis; it took him 50 papers in 50 years ).
The semantical Necessity of h for e as removed h implies absence of e,
is my NecP = P(~e|~h), and not Schield's necessity N = P(h|e) not fitting
his opeRational definition and containing no negation ~ as a COUNTERFACTUAL
should. Hence Schield's N is now deconstructed, and can be replaced by
my constructive NecP = P(~e|~h).
Summary of SufP = S , and of my NecP constructed from Schield's
!!! opeRationally meaningful verbal requirements :
SufP = P( e| h) = sufficiency of h for e defined as h implies e
NecP = P(~e|~h) = necessity of h for e defined as ~h implies ~e
hence:
SufP = P( e| h) = necessity of ~h for ~e is h implies e
Q: is my NecP ok ?
A: Not yet, since for Peh = 0 my common sense requires S=0=N ie zero, which
precludes all P(.,.) or P(.|.) containing ~ ie a NEGation.
Fix1: IF Peh = 0 THEN NecP1 = 0 ELSE NecP1 = P(~e|~h).
Fix2: Like Suf, Nec should have Peh as a factor in its numerator, so eg:
SufP2 = P(e|h) = SufP = sufficiency of h for e ie h implies e
NecP2 = Peh*NecP = Peh*P(~e|~h) = Peh*P(~(e or h))/P(~h)
= Peh*[ 1 - (Ph +Pe -Peh) ]/(1 - Ph)
= necessity of h for e
which seems reasonable, since without Peh* my original NecP will be
too often too close to 1 = max(P), hence poor anyway, as its form
[ 1 - P(e or h) ]/[ 1 - Ph ] is near 1 for small P's .
Folk5: Jan Hajek's RR-formulas (derived as my criticism of Folk4):
RRsuf = indication of how much the presence of h implies e
= RR(h:e) = RR-sufficiency of h for e
= P(h|e)/P(h|~e)
= [Peh/(Ph - Peh)] * (1 - Pe)/Pe (zzz)
= Peh*( h implies e) / Odds(e) , note that the "implies" factor
1/(Ph - Peh) comes from P(h|~e), and that it is more
influential than P(h| e).
RRnec = indication of how much absence of h implies absence of e
= how much ~h implies ~e
= derived from my "an event is an event is an event" & (zzz)
= RR(~h:~e)
= [ P(~e,~h)/( P(~h) - P(~e,~h) ) ] * (1 - P(~e))/P(~e)
= [ P(~e,~h)/ P( e,~h) ] * Pe/P(~e)
= [ P(~e,~h)/P(~e) ]/[ P( e,~h)/Pe ]
= P(~h|~e)/P(~h|e)
= [1-P(h|~e)]/[1-P(h|e)]
!! RRnec should = 0 if Peh = 0, yet RRnec <> 0 here, but it will if we use
RRnec*Peh in analogy to NecP2 at the end of Folk4 .
Since (h causes e) corresponds to (e implies h) we may have to swap e with h
in some formulas above to get the (h causes e). I have not always done it
so to keep other authors' formulas as close to their original as reasonable.
.-
Finally lets look critically at the relation between causation and logical
implication. "Rain causes us to wear a raincoat" does make sense, while
"NOT wearing a raincoat causes it NOT to rain" is an obvious NONSENSE, even
in a clean lab-like context with no shelter and our absolute unwillingness to
become wet. Let x = rain and y = wearing a raincoat.
The 1st statement translates to ( x causes y );
the 2nd statement translates to (~y causes ~x ). Because nobody knows
how to formulate a perfect operator "x causes y", we substitute it with
"y implies x" (the swapped x, y is not the point, it doesnt matter just now).
Now the 1st statement translates to ( y implies x );
the 2nd statement translates to (~x implies ~y ).
But now we are in trouble, as ( y implies x ) == ( ~x implies ~y ) in logic,
and ideally in probabilities : ( Py = Pxy ) == ( P(~x) = P(~x,~y) ) ie:
in imperfect real situations : ( Py - Pxy ) = ( P(~x) - P(~x,~y) ) ie:
Py - Pxy = (1 - Px) - (1 -(Px + Py - Pxy))
= 1 - Px - 1 + Px + Py - Pxy = Py - Pxy q.e.d.
Hence such a simple difference doesnt work as we would like it did for
a cause.
So what about the corresponding relative risks ? Lets check:
RR(y:x) <> RR(~x:~y) ie they are not equal
P(y|x)/P(y|~x) <> P(~x|~y)/P(~x|y) ie:
(Pxy/(Py-Pxy))*((1-Px)/Px) <> ( (1-(Px+Py-Pxy))/(Py-Pxy) ) * (Py/(1-Py))
where we see the (y implies x) factor 1/(Py - Pxy) on both sides of the <> .
Hence despite the <> both RR's will become oo ie infinite if (y implies x)
perfectly whenever Pxy = Py. Otherwise, RR(y:x) is quite well behaved:
RR increases with Pxy, and decreases with Px, which is reasonable as
explained far above.
Conclusion: an implication cannot substitute causation in all its aspects,
at least not in this case, hence not in general. But I dont know any other
necessary (but not always sufficient) indicators of causal tendency than :
+ dependence (ie symmetrical association like eg correlation),
+ implication (ie asymmetrical association which is a transitive operation,
a subset in a subset in a subset in a subset ..etc).
Caution: repeatedly find UNDESIRABLE above.
+ SurpriseBy
+ and time-ordering (a cause prior to the effect).
++ find key construction principles above for more and sharper formulations
-.-
+Acknowledgements (in alphabetic order) :
Jan Kahre of Aland & Helsinki, Finland, is an excellent discussion partner;
Leon Osinski of NL, is the best imaginable chief of a scientific library;
Mari Voipio of Helsinki is the best webmistress thinkable (multilingual too).
-.-
+References { refs } :
Q: Which refs are books and which are papers ?
A: Unlike in the titles of (e)papers here listed, in the titles of books and
periodicals all words start with a CAP, except for the insignificants
like eg.: a, and, der, die, das, for, from, in, of, or, to, the, with, etc.
Recalling Goethe's wisdom "In der Beschraenkung zeigt sich der Meister" I say:
A too long list of refs would be almost as worthless as a too short list of
refs. Therefore I did not include many (historically) relevant authors. If
some are unreferred, then I mention one key group of German speaking Jewish
emigre philosophers of scientific explanation/confirmation ( & = co-op'ed ) :
C.G. Hempel & Paul Oppenheim & John G. Kemeny (Hungarian co-father of BASIC)
& Olaf Helmer (he fathered the Delphi method of forecasting at RAND Corp.),
Hans Reichenbach, Rudolf Carnap & Y. Bar-Hillel, Herbert Feigl, Kurt Grelling,
and Karl R. Popper (Austrian) who played apart and refuted them all :-)
Computer Journal, UK, no.4, 1999, is a special issue on:
- MML = minimum message length (by Chris Wallace & Boulton, 1968)
- MDL = minimum description length (by Jorma Rissanen, 1977)
- MLE = minimum length encoding (by Pendault, 1988)
These themes are very close to Kolmogorov's complexity, originated in the
US by Occamite inductionists (as I call them) Ray Solomonoff in 1960, and
Greg Chaitin in 1968.
Arendt Hannah: The Human Condition, 1959; on Archimedean point see
pp.237 last line - 239, 260, more in her Index.
Agresti Alan: Analysis of Ordinal Categorical Data, 1984;
see p.45 for a math-definition of Simpson's paradox for events A, B, C.
Agresti Alan: Categorical Data Analysis, 1st ed. 1990;
see pp.24-25 & 75/3.24 on Goodman & Kruskal's TauB, Gini and Theil.
Agresti Alan: An Introduction to Categorical Data Analysis, 1996.
Alvarez Sergio A.: An exact analytical relation among recall, precision,
and classification accuracy in information retrieval, 2002, see
http://www.cs.bc.edu/~alvarez/APR/aprformula.pdf
Anderberg M.R.: Cluster Analysis for Applications, 1973.
Bailey N.T.J.: Probability methods of diagnosis based on small samples; in
Mathematics and Computer Science in Biology and Medicine, Oxford 1964,
2nd printing 1966, pp.103-110
Blachman Nelson M.: Noise and Its Effects on Communication, 1966.
Blachman Nelson M.: The amount of information that y gives about X,
IEEE Transactions on Information Theory, IT-14, Jan. 1968, 27-31
Blalock Hubert M.: Causal Inferences in Nonexperimental Research, 1964;
start on p.62, on p.67 is his partial correlation coefficient.
Blalock Hubert M.: An Introduction to Social Research, 1970; on p.68 starts
Inferring causal relationships from partial correlations.
Bar-Hillel Yehoshua: Language and Information, 1964, Addison-Wesley;
the key paper is on pp.221-274. The Introductory chapter tells that its
original & 1st author in 1952 was Rudolf Carnap.
Bar-Hillel Y., Carnap Rudolf: Semantic information, pp.503-511+512, in
the book Communication Theory, 1953, Jackson W. (Willis) editor ; also
in the British Journal for the Philosophy of Science, Aug. 1953. It is
much shorter than the 1952 paper reprinted in Bar-Hillel, 1964, pp.221-274
Brin Sergey, Motwani R., Ullman Jeffrey. D., Tsur Shalom: Dynamic itemset
counting and implication rules for market basket data, Proc. of the 1997
ACM SIGMOD Int. Conf. on Management of Data, 255-264. Sergey Brin is the
co-founding father & CEO of Google, Inc.
Buchanan Bruce G., Duda Richard O.: Principles of rule-based expert systems;
in Advances in Computers, vol.22, 1983, Yovits M.(ed).
Cheng Patricia W.: From covariation to causation: a causal power theory;
(aka "power PC theory"), Psychological Review, 104 (1997), 367-405 = 39pp!
Also see { Novick & Cheng 2004 }
Cheng Yizong, Kashyap Rangasami L.: A study of associative evidential
reasoning, IEEE Trans. on Pattern Analysis and Machine Intelligence,
vol.11, no.6, June 1989, pp. 623-631.
Cohen L. Jonathan: Knowledge and Language, 2002, Kluwer Academic Publishers;
! on p.180 in the eq.(13.8) both D should be ~D i.e. complements of D.
DeWeese M.R., Meister M.: How to measure the information gained from one
symbol, Network: Computation Neural Systems 10, 1999, p.328.
They partially reinvented Nelson Blachman's fine work (see these refs).
Duda Richard, Gaschnig John, Hart Peter: Model design in the Prospector
consultant system for mineral exploration; see pp.159 in { Michie 1979 }.
Eddy David M.: Probabilistic reasoning in clinical medicine: problems and
opportunities, in Kahneman D., 1982, Judgment Under Uncertainty, 249-267.
Eells Ellery: Probabilistic Causality, 1991.
Fano Robert M.: Transmission of Information, 1961.
Feinstein Alvan R.: Principles of Medical Statistics, 2002, 701 pp; professor
Feinstein, M.D. (1925-2001), at Yale (medicine), studied math & statistics;
chap.10, pp.170-175 are on proportionate increments, on "excellent" NNE ie
NNT, NNH; on honesty vs deceptively impressive magnified results.
Chap.17, pp.332,337-340 are on fractions, rates, ratios OR(:), risks RR(:).
! On p.340 the etiologic fraction should be e(r-1)/[ e(r-1) +1 ] ;
! on p.444, eq.21.15 for negative likelihood ratio has swapped numerator
and denominator: it should be (1-sensitivity)/specificity; above
it should be (c/n1)/(d/n2) instead of the swapped ratio.
Fitelson Branden: Studies in Bayesian confirmation theory, Ph.D. thesis,
University of Wisconsin-Madison, 2001, where I.J. Good's sinh should be
tanh as I told him. Easily found on www.
Gigerenzer Gerd: Adaptive Thinking ; and his other fine books & papers.
Good I.J. (Irving Jack), born 1916 in London as "Isidore Jacob Gudak"
who unlike Good is findable on WWW. He has been Alan Turing's main stats
assistant during WWII when they were busily decoding other gentlemen's
emails in Bletchley Park, UK. To understand my strange phrase "other
gentlemen", you should know that a pre-WWII US Minister of War Henry
Stimson has said that "Gentlemen do not read each other's mail". Tell it
to you favorite 3-letter-word gov. agency :-) With Turing they were
codebreaking secret codes produced by the German Enigma machine, of
which some were passed to the British by the Polish resistance and
Polish cryptologists who have done useful preanalysis (eg Rejewski).
I.J. has published over 1900 notes and papers (numbered 1-1900, and then
not to confuse their numbers for dates, 2000-etc :-), of which some 50+
are on his favorite weight of evidence W(:).
! In my notation W(y:x) is his old W(x:y), and similarly with B(:), F(:).
Only since 1992 in his newest papers he switched to my notation (y:x).
Good I.J.: The mathematics of philosophy: a brief review of my work; in
Critical Rationalism, Metaphysics and Science, 1995, Jarvie I.C. & Laor N.
editors, pp.211-238.
Good I.J.: Legal responsibility and causation; pp.25-59 in the book Machine
Intelligence 15, 1999, K. Furukawa, ed. Also see Michie in this volume 15.
Good I.J.: Causal tendency, necessitivity and sufficientivity: an updated
review; pp.293-315 in "Patrick Suppes: Scientific Philosopher", vol.1,
P. Humphreys, ed., 1994, Kluwer Academic Publ.
I.J.'s explains his fresh but surprisingly late insights (delayed 50 years)
into the semantics of two W(:)'s, renamed by him to Qnec(y:x) , Qsuf(y:x),
like mine ?(y:x)'s here, ie no more as his old W(x:y)'s.
On pp.312-315 Patrick Suppes' comments on I.J.'s paper.
Good I.J.: Tendencies to be sufficient or necessary causes, pp.261-262 in
Journal of Statistical Computation and Simulation, 44, 1992. This is a
preliminary note on Good's belated insight. 1992 - 1942 = 50 years
of delay. Delay is the deadliest form of denial :-).
Good I.J.: Speculations concerning the future of statistics, Journal
of Statistical Planning and Inference, vol. 25 (1990), 441-66.
Good I.J.: Abstract of "Speculations concerning the future of statistics",
The American Statistician, May 1990, vol. 44/2., 132-133.
Good I.J.: On the combination of pieces of evidence; Journal of Statistical
Computation and Simulation, 31, 1989, pp.54-58; followed by "Yet another
argument for the explicatum of weight of evidence" on pp.58-59.
Good I.J.: The interface between statistics and philosophy of science;
Statistical Science, 1988, vol.3, no.4, pp.386-412;
for W(:) see pp.389-390, 393-394 left low! + discussion & rejoinder p.409.
Good I.J.: Good Thinking - The Foundations of Probability and Its
Applications, 1983, University of Minnesota Pres. It reprints but a
fraction of his some 1500 papers and notes written until 1983.
On p.160 Kemeny & Oppenheim's degree of factual support F(:) is discussed;
! on p.160 up: I.J. Good's sinh(.) should be tanh(.).
Goodman Leo A., Kruskal William H.: Measures of Association for Cross
Classifications, 1979, 146 pp. Originally published under the same title
in the Journal of the American Statistical Association (JASA), parts 1-4:
part 1 in vol.49, 1954, pp.732-764; on TauB see pp.759-760
part 2 in vol.54, 1959, pp.123-163;
part 3 in vol.58, 1963, pp.310-364; on TauB see pp.353-354
part 4 in vol.67, 1972, pp. , on TauB see sect. 2.4
On ordinal measures see { Kruskal, 1958 } in JASA 53, 1958, pp.814-861.
Goodman Steven N.: Toward evidence-based medical statistics. Two parts:
1. The P value fallacy, pp. 995-1004, 2. The Bayes factor, pp.1005-1013;
discussion by Frank Davidoff: Standing statistics right up, pp.1019-1021;
all in Annals of Internal Medicine, 1999, very good, Goodman :-)
Grosof Benjamin N.: Evidential confirmation as transformed probability;
pp.153-166 in Uncertainty in Artificial Intelligence, L.N. Kanal
and J.F. Lemmer (editors), vol.1, 1986. I found that on p.159 his
! B == (1+C)/2 is in fact the rescaling in { Kemeny 1952, p.323 }, the
last two lines lead to F(:) rescaled on the first lines of p.324,
here & now findable as F0(:)
Grune Dick: How to compare the incomparable, Information Processing Letters,
24, 1987, 177-181.
Heckerman David R.: Probabilistic interpretations for MYCIN's certainty
factors; pp.167-196 in Uncertainty in Artificial Intelligence, L.N. Kanal
and J.F. Lemmer (eds), vol.1, 1986. I succeeded to rewrite his eq.(31) for
! the certainty factor CF2 on p.179 to Kemeny's F(:).
Heckerman has more papers in other volumes of this series of proceedings.
Hempel C.G.: Aspects of Scientific Explanation, 1965; pp.245-290 are chap.10,
Studies in the logic of explanation, reprinted from Philosophy of Science,
15 (reprinted paper of 1948 with Paul Oppenheim).
Hesse Mary: Bayesian methods; in Induction, Probability and Confirmation,
1975, Minnesota Studies in the Philosophy of Science, vol.6
Kac Mark: Enigmas of Chance, an autobiography, 1985.
Kahneman D. (ed): Judgment Under Uncertainty, 1982. He has won Nobel Prize
(economics, 2002) for this kind of work done with the late Amos Tversky.
Kahre Jan: The Mathematical Theory of Information, 2002, Kluwer Academic;
to find in his book formulas like eg Cont(.) use his special Index on
pp.491-493. See www.matheory.com or www.matheory.info for Errata + more.
! on p.120 eq(5.2.8) is P(x|y) - Px = Kahre's corroboration, x = cause,
! on p.186 eq(6.23.2) is P(y|x) - Py, risk is no corroboration; y = evidence
Kemeny John G., Oppenheim Paul: Degree of factual support; Philosophy of
Science, vol. 19, issue 4, Oct. 1952, 307-324. The footnote 1 on p.307
tells that Kemeny was de facto the author. Caution: on pp.320 & 324 his
! oldfashioned P(.,.) represents the modern P(.|.). On p.324 the first two
! lines should be bracketized and read P(E|H)/[ P(E|H) + P(E|~H) ], which is
findable here & now as F0( . An excellent paper worth of (y)our attention !
Kemeny John G.: A logical measure function, Journal of Symbolic Logic, 18/4,
December 1953, 289-308. On p.307 in his F(:) there are missing negation
! bars ~ over H's in both 2nd terms. Except for p.297 on Popperian eliminat-
ion of models (find SIC here & now), there is no need to read this paper
if you read his much better one of 1952.
Kendall M.G., Stuart A.: The Advanced Theory of Statistics, 1977, vol.2.
Kruskal William H.: Ordinal measures of association, JASA 53, 1958, 814-861.
Lucas J.R., Hodgson P.E.: Spacetime and Electromagnetism, 1990;
see pp.5-13 on regraduation of speeds to rapidities.
Lusted L.B.: Introduction to Medical Decision Making, 1968.
Michie Donald: Adapting Good's Q theory to the causation of individual
events; pp.60-86 in Machine Intelligence 15, Furukawa K., Michie D. and
Muggleton S. (eds). During WWII at the age of 18, Michie was the youngest
codebreaker assisting I.J. Good who was Alan Turing's statistical assistant
Michie Donald (ed): Expert Systems in the Micro Electronic Age, 1979;
Norwich Kenneth: Information, sensation, and perception, 1993, Acad. Press.
Novick Laura R., Cheng Patricia W.: Assesing interactive causal influence;
Psychological Review, 111/2, 2004, pp.455-485 = 31 pages! Also see
{ Cheng Pat.W. 1997 }
Pang-Ning Tan, Kumar Vipin, Srivastava Jaideep: Selecting the right
interestingness measure for association patterns; kdd2002-interest.ps
is a comparative study of 21 measures of "interestingness"
Pearl Judea: Causality: Models, Reasoning, Inference, 2000; see at least
pp.284, 291-294, 300, 308; his references to Shep should be Sheps, and on
! p.304 in the Note under Tab.9.3 ERR = 1 - P(y|x')/P(y|x) would be correct
Popper Karl: Conjectures and Refutations, 1963, Routledge and Kegan Paul.
Popper Karl: The Logic of Scientific Discovery, 6th impression (revised),
March 1972; new appendices, Appendix IX (on corroboration) to his original
Logik der Forschung, 1935 (in his Index his Gehalt means SIC here & now).
His oldfashioned P(y,x) actually means modern P(y|x).
Renyi Alfred: New version of the probabilistic generalization of the large
sieve, Acta Mathematica Academiae Scientiarum Hungaricae, vol. 10, 1959,
217-226, his correlation coefficient R between events is on p. 221 is
also found in Kemeny & Oppenheim, 1952, p.314, eq.(7).
Renyi Alfred: Selected papers of Alfred Renyi, 1976, in 3 volumes.
Renyi Alfred: A Diary on Information Theory, 1987. The 3rd lecture
discusses asymmetry and causality on pp.24-25+33.
Rescher N.: Scientific Explanation, 1970. See pp.76-95 for the chap.10 =
The logic of evidence, where his Pr(p,q) actually means P(p|q). Very nice
methodology of derivation, but the result is not spectacular :-)
! Note that on p.84 he suddenly switches from Pr(p|q) to Pr(q|p). Why ?
Romesburg H.C.: Cluster Analysis for Researchers, 1984.
Sackett David L., Straus Sharon, Richardson W. Scott, Rosenberg William,
Haynes Brian: Evidence-Based Medicine - How to Practice EBM, 2nd ed, 2000.
There is a Glossary of EBM terms, and Appendix 1 on Confidence intervals
( CI ), written by Douglas G. Altman of Oxford, UK.
! Errors and typos reported by me:
p. 73: in the footnote d/(c+d) should be d/(b+d)
p. 73: prevalence 32% should be 31%
p. 76: < 95 should be >= 95
p. 79: in the nomogram 1000 & 0.001 should be moved toward their neighbours
p. 80: CGPs should be CPGs
p.236: RRR = 1 - RR = 1 - p2/p1 should be RRR = 1 - RR = 1 - p1/p2
p.237: Odds ratio's SE of logOR should contain only + + + no - -
p.238: specificity is b/(b+d) should be d/(b+d)
p.238: Table 3.5 should be Table 3.3 (what is /82 ?? )
p.239: likelihood ratios LR+ LR- SE are all wrong (several typos, results)
p.250: journals journals should be journals
p.252: the first increase should be increases
some 30+ typos are listed at http://www.cebm.utoronto.ca/search.htm ,
yet there is hope: the 3rd edition is in the works.
Schield Milo, Burnham Tom: Algebraic relationships between relative risk,
phi and measures of necessity and sufficiency; ASA 2002; on www too.
Find NAIVE , SIMPLISTIC here & now.
Schield Milo: Simpson's paradox and Cornfield's conditions; ASA 1999; on www
too; an excellent multi-angle explanation of confounding, which is a very
important subject, yet seldom & poorly explained in books on statistics.
His section 8 can be complemented by reading { Agresti 1984, p45 } for a
definition of Simpson's paradox for events A, B, C.
Schield Milo, Burnham Tom: Confounder-induced spuriousity and reversal
for binary data: algebraic conditions using a non-iteractive linear model;
2003, on www too (slides nearby).
Shannon C.E., Weawer W.: The Mathematical Theory of Communication, 1949;
different printings differ in page numberings. Here I refer to the 4th
printing of the paperback edition, Sept. 1969, Univ. of Illinois Press.
Sheps Mindel C.: An examination of some methods of comparing several rates
or proportions; Biometrics, 15 (1959), pp.87-97.
Shinghal R.: Formal Concepts in Artificial Intelligence, 1992; see chap.10
on Plausible reasoning in expert systems, pp.347-389, nice tables on
! pp.355-357, where in Fig.10.3 there are two typos in the necessity N which
should be N = [1 - P(e|h)]/[1 - P(e|~h)];
! on p.352 just above 29. in the mid term (...) of the equation, both
~e should be e like in the section 10.2.11.
Simon Herbert: Models of Man, 1957. See pp.50-51 & 54.
Stoyanov J.M.: Counterexamples in Probability, 1987.
Suppes Patrick: A Probabilistic Theory of Causality, 1970.
Tversky Amos, Kahneman Daniel: Causal schemas in judgments under
uncertainty; in Kahneman D.: Judgment Under Uncertainty, 1982, 117-128.
Vaihinger Hans: Die Philosophie des Als Ob - Volks-Ausgabe, 1923.
Van Rijsbergen C.J.: Information Retrieval, 2nd ed., 1979.
Weaver Warren: Science and Imagination, 19??; the section on "Probability,
rarity, interest and surprise" has originally appeared in Scientific
Monthly, LXVII ie 67, no.6, December 1948, pp.390 ??
Woodward P.M.: Probability and Information Theory, with Applications to
Radar, 1953, Pergamon Press, 128 pages only. The 2nd edition of 1964 has
136 pages as it contains an additional chapter 8.
-.-