-.-
FlashHash & BackHash for all-efficient multiple recycled hashes
e.g. for dispersed virtual hashing in Bloom filters (tutorial + math)
Copyright (C) June 2002 - 2004, Jan Hajek, Netherlands
Version 1.21 of April 28, 2004, 1686 lines of < 78+CrLf chars in ASCII,
written with "CMfiler 6.06f" from WWW; submitted to the webmaster of
http://www.matheory.info aka www.matheory.com
This epaper has facilities for fast finding and browsing. Please save the
last copy and use a file differencer to see only how versions differ.
Your comments (preferably interlaced into this .txt file) are welcome.
NO part of this document may be published, implemented, programmed,
copied or communicated by any means without an explicit & full
reference to this author together with the full title and the website
WWW.MATHEORY.INFO plus the COPYRIGHT note in the texts and in ALL
references to this. An implicit, incomplete, indirect, disconnected or
unlinked reference (in the text and/or on www) does NOT suffice.
New terms: FlashHash , BackHash ,
recycled randomness , rotated/recycled hash , combinatorial implosion
combinatorially imploded , hashcrosstalk , paired hashes , hashpair ,
dispersed hasher , dispersed virtual hashing , distributed hashing ,
distributed virtual hasher , monolithic Bloom filter , Bloom filter of
the 1st/2nd/3rd kind , virtually stored , cumulated false positives ,
bitsharing , bitsharing factor , reproducible acyclic pseudo-entropy ,
imported/injected pseudo-entropy , hashclash , clashhash ,
and ALL inflections of ALL these new terms are all also subject to my
conditions above. ALL based on experience. ALL rights reserved.
Browsers may like to search here for the following markers :
!!! !! ! ?? ? { comments } -.- separates sections |- ------- framed stuff
[ refs ]
-.-
+Contents:
+Executive summary = the beef at fiesta entropicana
+Mottos
+Basic & new terms, abbreviations & symbols
+Bloom filters (BFs) = a tutorial + a quantitative design study
+A comedy of errors = compact virtual hashing CH or HC
vs. distributed / dispersed virtual hashing DH or BF
+The proof of the pudding = Tab. 3
+FlashHashing & BackHashing = (m)eat the beef again
+More explanations and hints for programmers
+Bitwise rotations aka circular shifts aka barrelshifts
+On hashing functions
+Summary of flashhashing
+From mission impossible via combinatorial implosion
to mission accomplished
+References
-.-
+Executive summary = the beef at fiesta entropicana :
FlashHash can be easily understood in intuitive terms thus: An item, state,
or a string-of-bits S is laboriously and irreversibly reduced by a HashFun(S)
to a number called hash h0 viewed as a string of 0's and 1's 0011...01011
(or call them pearls and beads :-). By repeatedly closing the string h0
into a necklace, rotating it by few bits, and reopening or cutting it, k
different open-string designs are easily produced. If the original h0 is
remembered i.e. saved/stacked in a store, then all k strings can be almost
effortlessly reproduced or regained from h0 by BackHash. The net effect
is that k sufficiently stochastically independent hashes are obtained at
almost no expense from a single call of a HashFun(S). This metaphor was
inspired by recalling the poems by Hermann Hesse (1877-1962) near the end
of his 1946 Nobel Prize winning novel Das Glasperlenspiel, or The Glass
Bead Game, aka Magister Ludi (= The Master of the Game), which some folks
see as a precursor of computing culture, algorYthmics, and WWWeb (see WWW).
But I designed also non-rotated versions based on "injected pseudo-entropy".
Doubting Thomases may now like to find 22 occurrences of "independ" here.
This new class of FlashHash algorithms is all-efficient i.e. extremely fast,
space saving (all [re]generations are done by blitzy incremental bitshifting
fewliners), short and simple, hence easy to do for both (wo)men and
(wo)machines when generating k > 1 hashes necessary e.g. for well designed
& well working Bloom filters. FlashHash was created especially for, but
not only for, CPU-bound apps like e.g. Gerard Holzmann's (Bell Labs) SPIN's
bitstate model checking. Dispersed virtual hashers (DH) like Bloom filters
(BF) are explained first and compared with compact virtual hashers CH, all
without a full collision resolution which would require full storage of all
distinct S's during hasher's life-cycle. The main purpose of a virtual
hasher is to avoid storage of full S's. Both dynamic and static design
formulas for (non)optimized Bloom filters are given (find also "IT & TI"):
f(n) = (1 - (1 - 1/M)^(k*n))^k is the instantaneous probability of a
false positive AFTER the n-th item was "stored" in a
BF[0..M-1] of bits; f(n) may be called a false+rate or a failrate.
f(n) is the "general dynamics" formula for BFs.
fo = 1/2^k and k*n = M*ln(2) or k*n = M*2/3 are for a BF filled
optimally wrt memory utilization (useful in quasistatic apps
when the final n can be estimated or fixed beforehand). At such
a load, f has dropped exponentially with k increased linearly.
fo is the "optimal statics" formula for BFs.
Fn = Sum_i=1..n( f(i) ) is the cumulated number of false positives
(for large n use compiled code on a fast PC);
= Integral_i=1..n[ f(i) ] for fast numerically evaluated approximation,
which is very good for very large n .
Fn < n*f(n) ; display Fn as a float, as Fn < 1 or Fn > 1 may result.
Fn is directly verifiable or falsifiable empirically (see below) in easy
experiments based on simple I/O through a BF and counting the I-O's:
Fn = (number of unique items in) - (number of unique items out)
= number of unique items lost. (unique = distinct = no duplicate).
Pnoloss = Prod_i=1..n( 1 - f(i)) < 1 probability of NO loss at all;
if too near to 1 then better reads:
Ploss = 1 - Prod_i=1..n( 1 - f(i)) < 1
= 1 - probability of NO loss of any item or state
= probability of ONE OR MORE items or states lost
= probability of AT LEAST ONE item or state lost
= 1 - exp( Integral_i=1..n[ ln(1 - f(i)) ] ) for fast numerically
evaluated approximation which is very good for very large n.
(all AS IF during insertions of n unique i.e. distinct items)
Ploss is a computable theoretical number which (unlike Fn ) is neither
directly verifiable nor falsifiable in the spirit of Karl Popper's
philosophy of science. Hence Ploss is a quasi-Platonic quantity.
Ploss is introduced only for comparisons with other work (find Pnc , Pno ).
Lets start with a 2-liner version in edu-C of FlashHash for monolithic BF1
with hashcrosstalk, which is the simplest version. Put overflow checking off
especially if you have to use signed integer h0 in 2-complement arithmetic
which has asymmetric subranges of absolute values for integers < 0 and > 0 .
Choose your moduli M1, M well; they should either be prime numbers, or if
M1, M are powers of 2, use M-1, M1-1 as masks i.e. bitwise AND-ed with h0.
When trying out FlashHash, do not optimize prematuraly; first try with the
% M operator with M prime, then with & Mmin1 [i.e. & (M-1) ], compare and
check. Start with a simple version. Don't be pound foolish and penny wise :-)
1st version of FlashHash for a monolithic BF1 employs rotation of bits in h0:
d = w - c; h0 = HashFun(S); h = h0 % M; /* use h or h[1] = h; */
for j = 2 to k do begin h0 = (h0 << c) | (h0 >> d); h = h0 % M; end;
2nd version of FlashHash for a monolithic BF1. Unlike the 1st version, the
if j > 1 then ... allows concentration at one place of an application
dependent blitzy additional in-line code following each h = h0 % M;
typically the in-line code will test/check or set bits in a BF[0..M-1].
Like the 1st version, this 2nd one also saves precious bits from being
skipped and wasted, although a c prime wrt w would finally visit all
w bits in h0 . Wasting means that the c distance between bits will
have to be smaller for the given w and k , if k <= 1 + (w-1) div c
is wanted. Smaller distance c makes hashes less independent.
d = w - c; h0 = HashFun(S);
for j = 1 to k do
begin
if j > 1 then /* may take this line out, but will waste the 1st c bits */
h0 = (h0 << c) | (h0 >> d); /* bits rotated by c bits up ie to msb */
h = h0 % M; /* Here test/set h-th bit BF1[h], or do h[j] = h; */
end;
3rd version of FlashHash for monolithic BF1 (I call it Fiesta enTropicana):
/* Initially fill an array ENTROPIC[1..kmax] of integers with arbitrary,
preferably huge integers typed e.g. by Eddington's monkeys. These huge
integers must remain CONSTant during the whole lifecycle of a BF as an
"entropic database" (which, btw, cannot be read by any 3-letter word
agency like, say, KGA [find below]). In BPascal I just declared &
banged in these "pseudo-entropic numbers" :
here kmax = 11
Const ENTROPIC : array [ 1..11 ] of LongInt = ( 194758204,
293745367, 389486698, 406837398, 585769263, 669402120,
734068134, 836141173, 928486384, 035590389, 107805073 );
Note that these are NOT numbers from a random generator, because
it is desirable to obtain entropic bitstrings by means of blitzy
xor which flips bits over the whole width of a LongInt h0 , hence
we need huge integers within the range of LongInt h0 . Quiz: why
my entropic numbers have their first digits 1, 2, 3, .. etc ?
*/
h0 = HashFun(S); /* Use this version if c too low, e.g. if k > w/3 */
for j = 1 to k do /* use h or do h[j] = h; */
begin
if j > 1 then /* ..PIC[1] above allows to take THIS one line out */
h0 = h0 xor ENTROPIC[j]; h = h0 % M; /* put inline use of h */
end;
4th version of FlashHash for a monolithic BF1 is just another implementation
of the 3rd version (also with IMPORTED/INJECTED pseudo-ENTROPY ):
h0 = HashFun(S);
for j = 1 to k do /* Use this version if k > w/3 ie if c too low */
begin /* if j > 1 then... is not needed anymore, but allowed */
/* xor with k distinct, arbitrary, huge numbers but */
case j of /* all k nrs within the range of integer h0 */
begin /* These nrs may be typed by Eddington's monkeys */
1: h0 = h0 xor 132950380;
2: h0 = h0 xor 293745367; 3: h0 = h0 xor 389486698;
4: h0 = h0 xor 406815298; 5: h0 = h0 xor 585021763;
6: h0 = h0 xor 669402120; 7: h0 = h0 xor 734481234;
8: h0 = h0 xor 836141173; /* ... etc up to kmax ever needed */
else: h0 = h0 xor 007805073; /* Correctness insurance */
end case;
h = h0 % M; /* or: h = offset + (h0 % M1) for BF3 or BF2 */
/* Use h or do h[j] = h; i.e. your in-line code */
end for;
5th version of FlashHash for a monolithic BF1 combines the 2nd & 3rd version
so that higher number of hashes k are easily obtained without c too
low and without too many entropic numbers typed by Eddington's monkeys:
/* c prime is recommended as it will visit all w bits wrap around */
c = 13; d = w - c; /* may now be constants eg 13 and 19 = 32 - 13 */
h0 = HashFun(S); i = 0;
for j = 1 to k do /* no if j > 1 then.., as we have enough entropy now */
begin /* declare and fill ENTROPIC[1..kmax]; may recycle % Emax */
if (j & 1)= 0 then h0 = h0 xor ENTROPIC[++i] /* or: PIC[(++i) % Emax] */
else h0 = (h0 << c) | (h0 >> d);
h = h0 % M; /* Here inline test/set h-th bit BF1[h], or do h[j] = h; */
end;
The 5 versions above are readily adopted to my favorite BF3 versions below.
|--- No overkill & no overhead with blitzy FlashHash & BackHash :
!
! - No more than one call by FlashHash of a HashFun(S) per item, or
! string, or state S is needed. The received wisdom was k calls.
! - No other call on any other function is needed, not even on a slooow
! pseudorandom generator, also not desirable wrt XOR above.
! - No slooow floating point arithmetic needed.
! - No slooow multiplication needed, not even in integer arithmetic!
! - No division or the mod-op % is needed iff M or M1 is a power of 2.
!
! Making k calls of a HashFun(S) looks now hopelessly inefficient.
! Even k calls of a Random( h0 ) would be hopelessly inefficient,
! simply because k is a small number, at most few dozens
! (because we want use our bits well), hence we do not need a computa-
! tionally expensive pseudorandom sequence of numbers with a long period.
! Non-rotational versions of FH use a small set of entropic numbers.
! Non-rotational and rotational versions were combined into a HYBRID
! version nr. 5 which will RECYCLE (via % Emax) the INJECTED ENTROPY if
! h0 = h0 xor ENTROPIC[ (++i) % Emax] where Emax < k is allowed.
!
! By now even entrenched fundamentalists will have to admit that their
! traditional computations of k separate HashFuns per S are about as
! efficient as shooting at flies (or bits :-) with a nuclear howitzer,
! at least in the context of BFs and other non-Manichean (find below)
! apps i.e. where no enemies or adversaries are expected (find KGA ).
!
|------------
The latter versions (re)produce REPRODUCIBLE ACYCLIC PSEUDO-ENTROPY,
which may become useful when we want
larger k , so that c would become too small and bit-patterns less
independent when c < 3 or 4. In fact any reproducible set or sequence of
remappings of h0 will do, provided they will be sufficiently pseudo-
random, pseudo-chaotic, deterministically chaotic i.e. pseudo-entropic
(ideally MaxEntropic) w.r.t. the Bloom filter BF[0..M-1]. Perfectionists'
unattainable Platonic ideal of k stochastically independent hashing
functions laboriously recomputing from S uniformly distributed hash-
values is now not unlike the monuments of the Ideal Society Leaders (read
Plato's Republic). Where are those Ideal Ideologists and their monuments
now ? On graveyards together with their victims: 100000000 = 100 MegaDead
murdered by the Kommies! That's the result of an ideology applied to
people. An ideology in computer science does not waste lives, just cycles.
But waiting in a queue for food at a store in a Kommie kantry was also
wasting life-cycles of people who were there buried alive behind the
barbed wire, watch towers manned by the subservient opportunists trying
hard to obtain favors from their Ideal Leaders. Now with FlashHash you do
not have to waste your life-cycles anymore when waiting for a result of,
say, a model checker like SPIN, and you do not have to rerun it and wait
again and again as you can get now 1000 times higher quality of a result
in a single run at no extra cost of your boringly repeated sequential
re-running, re-waiting, re-staring at crt i.e. your re-wasting your
limited biocycles.
FlashHash in edu-C for monolithic BF1 with hashcrosstalk :
/* 2nd version with comments : 2nd version allows concentration of in-line
code following the h = h0 % M;
/* M = k*M1; for comparisons between BF1 and BF3 or BF2 */
/* Mmin1 = M - 1; for faster % iff M is a power of 2 ; see below */
d = w - c; /* w = nr bits in h0; use max c < w/2 for k <= 1 +(w-1)/c */
h0 = HashFun(S); /* Store THIS h0 now iff needed */
for j = 1 to k do /* generates k hashes per item or state S : */
begin /* max odd or prime c <= (max c allowed) will be the best c */
if j > 1 then h0 = (h0 << c) | (h0 >> d); /* rotate bits */
h = h0 % M ; /* put HERE inline code using h */
/* or: h = h0 & Mmin1; this is faster; use it iff M power of 2 */
end; /* h = might be h[j] = ... ; use h before and in the loop */
FlashHash in edu-C for BFs without hashcrosstalk i.e. for BF3 and BF2 :
/* FlashHashing is likely to work better on partitioned / segmented
BFs. FlashHashing can work almost as well as k independently
computed HashFun(S). "Works" means "achieves low Fn", see Tab. 3.
Huge "addressing bitspace" M = k*M1 bits may need larger w in
bits, which also allows larger distance c bits i.e. independence.
*/
M = k*M1; M1min1 = M1 - 1; /* or: M1 = M div k i.e. int/int */
d = w - c; /* w = nr bits/word; use max c < w/2 for k <= 1 +(w-1)/c */
offset = -M1; h0 = HashFun(S); /* Store THIS h0 now iff needed */
for j = 1 to k do /* generates k hashes per item or state S : */
begin /* max odd or prime c <= (max c allowed) may be the best c */
if j > 1 then h0 = (h0 << c) | (h0 >> d); /* rotate bits */
offset += M1; /* For j = 1 is offset = 0 */
h = offset + (h0 % M1); /* Put HERE inline code using h */
/* or: h = offset + (h0 & M1min1), iff M1 power of 2 , is faster */
end; /* h = might be h[j] = ... ; Use h within loop or store it */
BackHash is always like FlashHash without h0 = HashFun(S) but with
h0 = FetchFromStore; (e.g. from a stack) in its place. For easy
maintenance BackHash can be integrated with FlashHash by means of a
parametrized condition :
if wantFH then h0 = HashFun(S) else h0 = FetchFromStore;
BackHash regenerates from one hash h0 by bitshifts all k hashes per S.
As long as h0's are stored & available e.g. on a stack, only one h0 per
each S, no HashFun(S) needs to be recomputed for the S's done so far.
HashFun(S) may be your favorite, or this Bytewise 1-liner (or better):
h0 = 1; for ix = loB to upB do h0 = (h0 << 8) + h0 + S[ ix ];
The logically partitioned BF3[0..M-1] of bits, with M = k*M1, is my favorite
for the following reasons :
- The declaration & access is as simple as for a monolithic BF1[0..M-1], yet:
- There is no direct hashcrosstalk between pairs of distinct states S1, S2.
Alas, there is an indirect hashcrosstalk among 3-tuples of states.
Combinatorial implosion works via the (k! - 1) nonmatching permutations as
at least an insurance or last ditch defense against direct hashcrosstalk.
- Smaller modulus M1 will distribute more uniformly than M = k*M1.
- The h = h0 % M1 uses less bits from h0 than the h = h0 % M does.
- If we wish to use a prime modulus M1 , then there is no need for finding
primes M =~ k*M1 for a BF1. With a single prime M1 we get k*M1 partitions,
each with a prime number of bits. This means easier and less error prone
changes during tests and maintenance.
Many technical details are pointed out, quantitative evaluations and
comparisons are tabulated, and ready to use algorithmic auxiliaries are
worked out including "hashpairs" i.e. paired hashes as if from a 64 bit
machine, probably better than a single 64 bit hash. Clearly, the bigger
the M , the more bits w >> c in a hash will be better for lower Fn.
This I call the "reversed/inverted stochastic pigeonhole principle".
Hasty cognoscenti or unbelieving connoisseurs may now jump to my term
"hashcrosstalk" and to the "k!" permutations argumentation for logically
segmented "BF3" of the 3rd kind. For quality-performance numbers find
"Tab.", "Fn" and "w = 64" without quotes. The proof of the pudding is in
"Tab. 3". The cardinals of fundamental computing science are invited to
look through the telescope (translation: run FlashHash before you judge).
-.-
+Mottos :
Simple, few parts, easy to maintain.
[ Gen. Chuck Yeager, USAF, tested Bell X-1; The Right Stuff ]
"The idea is crazy, but not crazy enough."
[ Niels Bohr (1885-1962), 1922 Nobel Prize for physics ]
"When a distinguished but elderly scientist (JH: over 30 in physics :-)
states that something is possible, he is probably right.
When he states that something is impossible, he is very probably wrong."
"The only way to discover the limits of the possible
is to go beyond them to the impossible."
"Any sufficiently advanced technology is indistinguishable from magic."
[ 3 laws of Arthur C. Clarke; invented a geostationary satellite in 1945 ]
- Expect nothing short of a miracle, as only where miracles are expected,
miracles are common.
- Never say never.
- Don't say it with flowers. Do IT with Bloom filters!
- Bloom Bit said: Just turn me On.
-.-
+Basic & new terms, abbreviations & symbols (henceforth in edu-C+BP/Ada) :
bag a multiset (often implemented with hopefully small counters).
item a chunk of data, e.g. a data vector, record, state i.e. a string S .
S may be a state vector of a finite state machine, or a state S
in a model checker or a protocol verifier using a combinatorially
explosive reachability analysis to systematically generate and check
as many reachable states S as feasible (ideally all S's) like e.g.
started by [Hajek 1977, 1978a, 1978b, 1979], [Holzmann 1988, 1990].
hashfun = a short word for a hashing function HashFun(S) which reduces an
item i.e. a string-of-bytes-ie-of-bits S to a number called hash.
Hashing is an irreversible reduction stronger (= produces less bits)
than a reversible compression. The original S cannot be recovered
from its hash by an inverse operation like a decompression would be.
An ideal, perfect hashfun would provide one-to-one mapping of unique
S's into unique i.e. distinct numbers called hashes. Alas ...... :
hashclash = clashhash = my new words for a hashing collision when two or
more hashes clash i.e. become hashing synonyma i.e. become equal
while they ideally should differ. Hashclashing is an undesirable but
generally unavoidable many-to-one mapping. Only for particular,
small sets of fixed i.e. a priori known items (typically the
reserved words in a programming language) a perfect hashing i.e.
one-to-one mapping may be possible.
false positive = an error of the Fn-kind: let after n trials or inserts
Fn be called "cumulated false positives". Then (# means number of):
Fn = # wanted (to get or to know) misidentified as unwanted
= # forecasts which did not come out { who wants non-forecasts? }
= # irrelevant documents retrieved { who wants irrelevantia? }
= # healthy persons diagnosed as ill { Docs see ill as + }
= # foes misidentified as friends { if you want to id foes }, or
= # friends misjudged as foes { if you want to id friends }
= # new i.e. unseen states or strings S misidentified as old
i.e. as if already present i.e. as if "stored" in a BF, so
that consequently they will be lost because discarded as if
(pseudo)duplicates.
Bloom filter (BF) = a dispersed virtual hasher for probabilistic
set membership testing i.e. for checking if some S almost
certainly is/was present or seen before (deja vu). The items or
states S are not actually stored in a BF; a new S sets
k bits in a BF i.e. it bitmarks a BF-bitmap to look AS IF stored,
hence the items are "stored" only VIRTUALLY. A new S sets
k bits, but some of them may be set already, hence insertion of
n distinct S's into a BF sets less than k*n distinct bits,
so that bits become shared i.e. "overloaded". Filtering refers
to BF's discrimination capability to filter out duplicates.
Unlike the monolithic BF1, slotted i.e. segmented BF2 or BF3 are
partitioned either "physically" or logically and therefore do
not suffer from hashcrosstalk like BF1 does.
Unlike compact virtual hashers which keep their bits together in
adjacent positions, dispersed virtual hashers distribute their bits
over the whole monolithic BF1[0..M-1] or over its segments of M1
bits in logically partitioned BF3[0..M-1] or in BF2.
Note0: a BF[0..M-1] of bits must be initialized to all bits = 0.
Of course we could also initialize all bits to 1 and do resetting
of bits to 0 instead of setting bits to 1 .
Note1: ideally the k bitindexes per item, string, or state S should
be all distinct, but it is not a must. If they are distinct,
less gamble will take place at the cost of some programming and
runtime overhead spent on (usually superfluous) checking that
they indeed are distinct k bitindexes per S. By distinct per S
I do not mean all k*n bitindexes being distinct (which would
be the unattainable Platonic ideal of a perfect Ideal Hasher
playing in the same league as Maxwell's demon, Laplace's
omniscient demon, Ideal Observer, or Carnap & Bar-Hillel's
Ideal Receiver [Kahre 2002]. Moreover even an Ideal Being cannot
violate the pigeonhole principle (from combinatorics) in its
elementary form, i.e. (s)he cannot put 1001 distinct pigeons into
M = 1000 pigeonholes without at least one pair of pigeons sharing
a hole. The same applies to setting M+1 bits in a BF[0..M-1], but:
Note2: Due to bitsharing, k hashfuns applied to n distinct states will
always set less than k*n bits in a well filled BF.
In fact on one hand bitsharing saves precious memory, but on the
other hand increases the chances of faults of the false positive
type only which will lead to some losses (omissions, collisions).
See the "optimal statics" formula for the fo . Hence :
- pessimists see negatively sounding clashes :-( , while
- optimists see positively sounding bitsavings due to bitsharing :-)
Note3: not even an Ideal Being can individually address more than (i.e.
separately each of all), say, 2^32 bits with a single hash h0
which is an unsigned integer of 32 bits. Therefore BFs with huge
M may need two hashes h0 and h00 forming a hashpair of a width
w = 64 bits (find "w= 64").
This is my "reversed/inverted pigeonhole principle", as trivial
as the pigeonhole principle (from combinatorics) in its elementary
form.
hashcrosstalk = my new term for the nowhere even mentioned hashclashes of
the 2nd/3rd kind. There are 3 kinds of clashes between S1, S2 in BFs :
Explained in shorthand (here S2 and S1 are distinct items or states) :
If HashFunA(S1) = HashFunA(S2) then hashclash of the 1st kind;
If HashFunA(S1) = HashFunB(S2) then hashclash of the 2nd kind in BF1;
If HashFunA(S1) = HashFunB(S1) then hashclash of the 3rd kind in BF1;
If HashFunA(S1) <> HashFunA(S1) then irreproducible nondeterminism due
to a bug (e.g. an uninitialized variable).
There are no other possibilities to be considered when only a single
pair of S1, S2 is considered in splendid isolation. Alas, as Dr.
Dragan Bosnacki has pointed out, a 3rd state S3 may be lost due to a
hashcrosstalk via S1, S2 over the (logical) boundaries of the segments
of partitioned BFs i.e. BF2 and BF3. Quiz: how ? Hint: create a case.
Hence "BFs without hashcrosstalk" holds only for pairs of states in
splendid isolation from the third state S3 . Therefore a partitioning
of BFs has more limited, yet non-negligible practical consequences
(find "favorite" here).
Clashes between splendidly isolated S1, S2 explained in longhand :
- Clashes of the 1st kind (in BF1, BF2, BF3) : distinct states
S1, S2 to which the exactly same hashfun is applied may produce
the same hash-address (e.g. due to the % M1 operation);
- Clashes of the 2nd kind (in BF1 only) : different hashfuns on
S1, S2 may produce the same hash-address. This cannot happen in
BF2 or BF3 which are partitioned, hence different hashfuns
work only their own strictly separate parts of BF[0..M-1].
- Clashes of the 3rd kind (in BF1 only) : k different hashfuns on
S1 only may produce less then k different hashes. This can
happen only on BF1 with a poor hashfun, or too small M .
Note: by hashcrosstalk I will mean that of the 2nd or 3rd kind in BF1,
unless specified otherwise.
FlashHash is a multihashing function employing a single HashFun(S) to
produce a single hash h0 from which all k hashes are obtained
i.e derived (e.g. for a BF by e.g. stepwise rotations of hashbits
or by injecting entropy ).
BackHash is like FlashHash, but without any hashfun. BH recovers hashes
by bitwise rotations from the stored 0-th flashhash h0.
msb the most significant bit (in a number called hash here).
{ } embrace comments like /* */ in the C programming language.
^ raises to power neither in C nor in BPascal.
2^32 = 4294967296; 2^24 = 16777216; 2^20 = 1048576; 2^16 = 65536;
2^31 = 2147483648 = MaxLongInt + 1 in BPascal ( = a Mersenne prime + 1 )
= abs(MinLongInt) in 2-complement arithmetic will overflow
ln(2) = 0.6931 is e-based Napier's logarithmus naturalis e.g. log in C
:= assignment operator in ADA, BP; i.e. = in C mixes with math
= equality operator as in math i.e. == in C
<> not equal relational operator i.e. != in C
% modulus operator i.e. mod in BP aka BPascal
div integer division operator i.e. / in C is "overloaded"
| bitwise inclusive OR operator i.e. OR in BP is "overloaded"
& bitwise AND operator i.e. AND in BP is "overloaded"
xor bitwise XOR operator i.e. ^ in C mixes with math
<< bitwise leftshift operator i.e. shL in BPascal on integers
>> bitwise rightshift operator i.e. shR in BPascal on integers
Note: the % and << are also used a couple of times each in their usual
mathematical meanings of "percent" and "much smaller" respectively.
-.-
+Bloom filters (BFs) = a tutorial + a quantitative study :
BFs typically work as approximate set/bag membership testers/checkers.
Membership testing means checking the current/past presence/occurrence,
or its opposite i.e. absence/nonoccurrence by testing/checking k bits.
BFs employ probabilistic bitmapping technique of multiple hashes per
item (e.g. a state S) for marking/setting and testing/checking multiple
bits scattered (with semantic overload or overlap) over a bitmap i.e.
in one or more bitarrays. The three versions of BFs are:
Bloom filter of the 1st kind (BF1) is monolithic:
BF1 is a single bitmap BF1[0..M-1] of bits, where all k hashfuns
per each state or string S operate i.e. pseudorandomly set and
test/check k bits per S over the whole range of a bitmap:
h2 h1 - - - hk h3 all k bits set by one S
v v v v
01111001010001....011110010001100100001
^ ^ ^ ^
h3 h1 - - - h2 hk all k bits set by another S
Bloom filter of the 2nd kind (BF2) is partitioned / segmented "physically":
BF2 consists of k smaller bitmaps, each BF2[0..M1-1] of bits, so that
k*M1 = M bits, each reserved for only one j-th hashfun:
0110010100101 M1 bits used by hashfun nr j = 1 only
0010010000100 M1 bits used by hashfun nr j = 2 only
- - - - - - - - - - - -
0100010100100 M1 bits used by hashfun nr j = k only
BF2 is handy when e.g. a compiler does not allow to declare,
or OS to allocate, a single huge hashtable as big as needed.
Bloom filter of the 3rd kind (BF3) is partitioned / segmented logically:
BF3 is BF2 in a single bitarray BF3[0..M-1] of bits where each hashfun
works its own reserved subrange of M1 bits so that k*M1 = M bits.
|.........|.........|..........|.........|
For BF3 the j-th hash is then finalized thus :
Offset is the absolute bitindex of the 0-th bit in a segment of BF3
as-if BH2[0..M1-1]; the 1st segment (for j=1) has offset = 0;
2nd offset = M1, etc, i.e. :
offset = M1*(j - 1); { in loops reduce * to repeated + M1 }
h = offset + ( h % M1 ); /* in C */
or faster iff M1 is a power of 2, and M1min1 = M1 - 1 :
h = offset + ( h & M1min1); /* in C */
h := offset + (abs(h) AND M1min1); { in BPascal }
offset := M1*LongInt(j - 1); { safely typecast in BPascal }
h := offset + (abs(h) mod M1); { safe for any M1 in BPascal }
The abs(h) must be used and overflow checking must be off if a
signed integer h is in 2-complement arithmetic (with asymmetric
subranges). An overflow of the offset M1*(j-1) must be avoided
even if the overflow checking, if any, is off. This bad bug did
hit me when my M was decreased! Quiz: How come that smaller
numbers could cause an overflow, but larger did not (in BPascal)?
Although I do not make many bugs anymore, or see them instantly,
it took me hours to kill this badbug. Hint: M1 was a constant, j
was a byte. Despite of such badbugs, BP was a love at dz 1st byte.
In all three BF-schemes the same bit can be shared i.e. set/tested by
hashfuns generated from different items or states. However unlike in BF1,
there is no "hashcrosstalk" in slotted i.e. partitioned i.e. k-segmented
Bloom filters like BF2 and BF3, where a j-th hashfun cannot share a bit
with an i-th hashfun if i <> j. Hence BF2 and BF3 are much less
semantically bit-mixed than a monolithic BF1. This will turn out to be the
conditio sine qua non for quality flashhashing.
BFs are resilient (wrt external errors) due to the dispersion of bits
(not unlike holograms) belonging to each particular item. The bits' overlap
i.e. bitsharing i.e. semantic overloading of individual bits shared by
different hashfuns (originated from different hashed items S) makes BFs
memory-efficient but also (internally or inherently) error-prone like any
imperfect hashing scheme without a collision resolution backup. However
the false positive rate is calculable hence under math control at any
moment, and can be updated and reported in-flight.
BFs are dispersed bithashers without collision resolution, hence with
calculable risks of errors of the false positive kind only. Approximate
set membership testing and multiset/bag membership checking is an
operation classically used by spelling checkers without a fully stored
dictionary, or by model checking programs like SPIN without explicitly
storing whole generated states of, say, computational communicating
processes [Holzmann 1988, 1990, Bell Labs] who calls Bloom filter a
bitstate hashing. Nowadays model checking means and does what automated
verification or validation of protocols (e.g. for networks, communication,
security, passwords, etc) used to mean and to do earlier [Hajek 1977, 1978a,
1978b, 1979].
BFs fall into the category of VIRTUAL HASHing algorithms [Morris 1968],
[Murray 1970, p. 180-183], [Bloom 1970, p. 423, Methods 1 & 2] which form
a subclass of hashing algorithms. Virtual hashing without any collision
resolution was invented and loosely described at the end of the classical
paper by [Morris 1968, Bell Labs], where on pp. 42-43 he used the terms
virtual scatter, and also virtual hash 3 times on p. 43. Virtual scatter
or hashing has been mathematically analyzed for the first time by a
computer scientist from Cornell University [Murray 1970].
BFs clearly fall, in my view, under the general paradigm of HINTing
algorithms. Note that this insight is not a commonly received wisdom
as it is hard to find, but see [Lampson & Sproul 1979, pp. 100-102, 3.6].
BFs can be viewed as algorithmic versions of ancient notched-edge cards
manual filing systems i.e. needle-card based manual documentation &
retrieval systems where card columns were at premium, so the ancients
had to be extremely inventive. Read on the method of SUPERIMPOSED CODING
[Knuth vol.3], or on SUPERIMPOSED CODES [Roberts 1979, Bell Labs],
[McIlroy 1982, p. 95, Bell Labs], also called Zatocoding or Peekaboo,
no joke!
BFs operate optimally (w.r.t. the memory use under the given load i.e.
at their capacity to store items with the required false positive rate)
under the maximum entropy condition (for BFs simply the equal chances
that a bit is On or Off on average), hence can be understood in terms
of (non)classical information theories like e.g. Shannon's classical
Mathematical Theory of Communication [Shannon 1948], and Jan Kahre's
unifying neoclassical Mathematical Theory of Information [Kahre 2002].
BFs being virtual hashers, behave virtually i.e. almost AS IF they would
be really encompassing much larger "name-space" of a set/bag with much
larger cardinality i.e. number of possible names, states, strings or items.
In this respect, like all virtualities, BFs have their philosophical patron
in the little known German philosopher of ALS OB (= AS IF) Hans Vaihinger
[Vaihinger 1923], [Kahre 2002, see his Indexes].
BFs are predictable:
In this section all explanations pertain to the BF1 scheme only, but apply
also to the BF2 and BF3 schemes (unless stated otherwise) which behave
virtually w.r.t. BF1 i.e. almost AS IF they were BF1. All formulas are
derived under the assumption of stochastic independence, as usual.
After n items were stored in BF1 by setting k bits (some may have been
1 already and will be shared) per item, the false positive rate is:
f(n) = (1 - (1 - 1/M)^(k*n))^k is the instantaneous probability of a
false positive AFTER the n-th item was "stored" in a
BF[0..M-1] of bits; f(n) may be called a false+rate or a failrate.
f(n) is the "general dynamics" formula for BFs.
For large enough M bits is f(n) very well approximated by
f(n) = (1 - exp(-k*n/M))^k , k*n < M; optimal k*n = M*ln(2) or M*2/3
From the latter formula for f(n) obtains
fo = 1/2^k and k*n = M*ln(2) or k*n = M*2/3 for a BF filled
optimally wrt memory utilization (useful in quasistatic apps
when the final n can be estimated or fixed beforehand). At such
a load, f has dropped exponentially with k increased linearly.
fo is the "optimal statics" formula for BFs.
From the instantaneous f(n) obtains the so far cumulated number of
false positives Fn (not explained anywhere and misunderstood everywhere):
Fn = Sum_i=1..n( f(i) ) is the cumulated number of false positives,
= Integral_i=1..n[ f(i) ] for fast numerically evaluated approximation
which is very good for very large n .
Fn < n*f(n) ; display Fn as a float, as Fn < 1 or Fn > 1 may result.
Fn is directly verifiable or falsifiable empirically (see below) in easy
experiments based on simple I/O through a BF and counting the I-O's:
Fn = (number of distinct items In) - (number of distinct items Out)
= number of distinct items lost. (distinct = unique = no duplicates)
Ploss = 1 - Prod_i=1..n( 1 - f(i) ) < 1 , check it in programs;
= 1 - probability of NO loss of any item or state
= probability of a loss of ONE OR MORE items or states
= probability of a loss of AT LEAST ONE item or state
= 1 - exp( Integral_i=1..n[ ln(1 - f(i)) ] ) for fast numerically
evaluated approximation which is very good for very large n .
(all AS IF during insertions of n unique i.e. distinct items)
Ploss is a computable theoretical number which (unlike Fn) is neither
directly verifiable nor falsifiable in the spirit of Karl Popper's
philosophy of science. Hence Ploss is a quasi-Platonic quantity.
Ploss is introduced only for comparisons with earlier work by others.
Fn and Ploss are as exact as only reasonably possible and cannot be
approximated by nice analytical formulas without making large errors, but
can be easily computed numerically. Fn may be approximated as an integral
and also simply counted in practice, so that this theoretical model Fn
can be easily verified by empirical "measurement" thus: Input & "store"
n unique (i.e. known to be nonduplicate) items (e.g. strings) into a well
filled BF with not too large M and count how many items have NOT passed
thru this BFilter. Uniqueness is necessary for a simple & fast yet reliable
measurement, otherwise one might be losing all but the 1st from just few
bad-luck "muplicates" if each reinserted many times. This method was used
for creating a reliable Tab. 3. Be aware of the fact that the differences
between the theoretical Fn and the empirical Fn are due to the total
systemic properties of the "Grey Box = HashFun(S) + items S input", i.e.:
1. The quality of the hashes used, which interacts with:
2. the quality of unique items (their differences, size and redundancy).
3. The amount of unique items (the more items the lower the variance of Fn)
Note that from nonunique items any Fn or Ploss may result :-)
The criterion of optimality for BFs is usually memory utilization when
the tolerable f is given i.e. specified beforehand. The approximate
f-formula (without the optimality condition) is briefly mentioned
in [Knuth, vol.3, any ed., just see Bloom]. In the next e-paper I will
derive these design formulas in several ways, including my derivation
from the very 1st principles. Here it suffices to say that an up to
4-way trade-off is involved among f, M, k, n :
f = false positive rate (AS IF present while absent) i.e. false+rate
(errors, omissions, collisions are not specific hence ambiguous).
M = memory space in bits available in a single BF[0..M-1] of bits.
n = number of items or states (virtually AS-IF-)"stored" in BF.
k = number of hashes and bits (0 or 1) set per item;
proportionally increases CPU-time spent on hashing, and in some
applications like e.g. model checking with SPIN, larger k may
need extra memory for additional data structures. It is this
k-dependent CPU-time (and probably an additional space overhead)
which FlashHash & BackHash can drastically reduce to almost zero.
A quantitative example:
M = 400 MBytes * 8 bits/Byte = 3200 Megabits are available;
f = 0.001 is acceptable i.e. false positive rate of a BF
AFTER n items were stored is one lost item per
1000 "stored" (the AFTER is important!)
Q: How should we design our BFs ? asks a programmer.
k = ? i.e. how to choose the number of hashfuns k ?
n = ? i.e. how many unique items S can we store for the k chosen ?
A: You may repeatedly solve one of the above formulas numerically
for different pairs of k and n. Or, if you believe information
theoreticians, you just use the formula for optimally utilized BF
i.e. operating at its "optimally filled" (though not full) capacity:
fo = (0.5)^k = 1/2^k where 0.5 is the probability of an average
bit being On (this is the true meaning), or Off,
because 0.5 happens to mean equal chances of 1 vs. 0).
Hence fo is calculable without a calc and even without a piece of paper
if you manage powers of 2: 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024,
so that for our task at hand we have
fo = 1/1024 = 1/2^10 < 0.001 required, hence our optimized k is
k = 10 hashes per item or state S "stored" in a BF.
The computation of the capacity of n items of our optimally filled
(yet far from full) BF may be obtained by solving for n one of the
general f-formulas far above, which yields
k = ln(2)*M/n = 0.6931*M/n > 0.5*M/n ; hence
n = ln(2)*M/k = 0.6931*M/k hence in our case:
n = 0.6931*3200 Megabits /(10 bits/item stored)
= 222 MegaItems > 0.5*3200/10 = 160 MegaItems (or Mstates in SPIN).
The > is due to the semantic overload or overlap i.e. bitsharing.
Note that 222/160 = 1.39, and 160/222 = 0.72 bits. This means that
setting 100 bits actually sets i.e. turns On only 72 bits on average.
k*n = M*2/3 is my engineering rule of thumb with a small extra
safety margin. This together with fo = 1/2^k allows
us to design near optimally operating Bloom filters by hart.
The expected cumulated number of false positives is easily computed
numerically, e.g. for the M = 3200 Mega, n = 80 Mega as above, we get:
f = (1 - (1 - 1/M)^(k*n))^k = 0.0000002804 if nr. hashfuns k = 10;
Fn = 2.29 states missed out of 80 MegaStates "stored" with k = 10; this
Fn << 22.43 = f*n = 0.0000002804*80000000 iff 80M random membership tests
would be done AFTER 80 MegaStates were "stored" already, the
latter being the correct interpretation of what would be an improper
calculation of Fn .
Tab. 1: M = 400 MBytes, n = 80 MegaStates; cumulated losses Fn :
k = 2 3 4 5 6 7 8 9 10 11 12 13 14
Fn = 64372 7743 1363 314 89 30 11 4.847 2.288 1.176 0.652 0.386 0.248
Ploss = 1 1 1 1 1 1 1 0.992 0.899 0.692 0.479 0.320 0.215
This table says a lot:
- It desperately cries for k > 2 for BFs (e.g. bitstate hashing in SPIN ).
- It shows that Fn provides much more information than Ploss does,
especially when more than 1 item or state is expected to be lost. To
tell that bitstate in SPIN will lose 64372 states is about 60k times more
informative than to say that at least one state will be lost.
- In fact Ploss approximates Fn, albeit a bit roughly when losses exceed
1 state :-) Then Ploss tells that 1 or more states will be lost, while
Fn quantitatively informs us how many states or items will be lost.
-.-
+A comedy of errors = compact virtual hashing CH or HC
vs. distributed / dispersed virtual hashing DH or BF :
Medical warning! This section is for healthy hearts & strong nerves only!
Lets now compare a BF with the classical virtual hasher (which does not
disperse individual bits over a bitmap) invented by [Morris 1968], analyzed
by [Murray 1970], and reinvented by [Wolper & Leroy 1993] who have called
it inventively HASHCOMPACT, not to be confused with a similar though not
identical COMPACT HASH by [Cleary 1984].
Caution! The curves in fig. 2 in [Wolper & Leroy 1993] are "approximate"
by the factor 17.32 i.e. are 1632 % off mark. Lets look at one
clear point there: for n = 10^7 , k = 10 hashfuns and M = 1 Gigabits my
numerical integrations as well as the straight summation both yield Fn ,
and straight numerical computations yield the product Ploss :
Fn = 0.0000577302 = expected cumulated number of lost items or states
Ploss = 0.0000577285 = probability of ONE OR MORE losses i.e.
= probability of AT LEAST ONE loss
= 1 - probability of NO loss (omission, failure, fault)
during insertions of n unique i.e. distinct items.
Ploss is semantically equivalent to 1 - Pnc = 0.001 = 10^-3 proclaimed for
the fig. 2 in [Wolper & Leroy 1993]. But here they are not equal, since
(1 - Pnc)/Ploss = 0.001/0.0000577285 = 17.32 is the factor by which their
equations and fig.2 are "approximate" wrt The Truth of Ploss & Fn.
That Ploss will approximate my new Fn was pointed out to me by Dr. Ulrich
Stern (ex-Stanford University), when he noticed my misinterpretation of Pnc
and a small typo in the previous version of this paper, all after I kept
insisting for weeks on the incorrectness of the graphs in fig. 2 in
[Wolper & Leroy 1993] which has been accepted as ok for almost 10 years.
Due to the continuities and orderings of the curves in their fig. 2, the
single point here identified as 1632 % in error makes their fig. 2 invalid.
However, Ploss cannot approximate Fn in general because it always holds
Ploss <= 1 , while Fn > 1 is quite a common number of lost items.
Quiz: Why then is Ploss = 0.0000577 so close to Fn = 0.0000577 above ?
Ploss (always <= 1) can be computed theoretically but not easily falsified
as the Popperian philosophy of science requires, nor directly counted like
Fn > 1 which can always be enforced by decreasing M or by increasing k*n.
Moreover Ploss contains much less information than Fn does (see Tab. 1).
For the same point k = 10 , n = 10^7 , M = 10^9 bits I computed three
values which must be semantically equivalent (all under the assumption of
stochastic independence) :
Pnc = probability of NO collision i.e. NOT a single state S lost;
Pno = probability of NO omission i.e. NOT a single state S lost;
Ploss = probability of AT LEAST ONE state lost i.e. of ONE OR MORE items
lost, which is the complement of Pnc as well as of Pno .
Since the numbers too close to 1 do not read well (we are used to
ignore chains of 00000's but not of 99999's), complements are
preferably computed and displayed on screen and here :
Ploss = 0.0000577 =~ 6^-5 introduced here above is the True Value,
unlikely to be improved upon, hence the claimed
0.001 = 10^-3 is far from True Value for the given k, n, M ;
1 - Pnc = 0.0009995 =~ 10^-3 from the equation (4) in [Wolper & Leroy, 1993]
is erroneous but consistent with their fig. 2;
1 - Pno = 0.0000909 =~ 10^-4 from the e-paper [Wolper, Stern, Leroy, Dill ]
is 11 times smaller then in their fig. 8 which
is identical with fig. 2 in [Wolper & Leroy 1993].
Their equation has improved since 1993 but the graphs have not :-) because:
(1 - Pnc)/Ploss = 17.32; (1 - Pnc)/(1 - Pno) = 11
(1 - Pno)/Ploss = 1.58
Another clearly identifiable point in their identical figs. 2 and 8 is for
k = 5 , n = 10^4 , M = 10^6 bits which yields:
Fn = 0.000468157 by numerical integration
0.000468293 by numerical summation which is the True Value
Ploss = 0.000468047 by numerical integration of the logarithmicized formula
0.000468183 by numerical multiplication which is the True Value
1 - Pnc = 0.00312 by the approx-approx equation (4) [Wolper, Leroy 1993]
which is 312 % of their proclaimed 0.001 for fig. 2.
1 - Pno = 0.00052 by the approx-approx equation after their equation (5)
in [Wolper, Stern, Leroy, Dill ],
which is 52 % of their proclaimed 0.001 for fig. 8.
The value 0.001 is too far from the True Value for the k, n, M given above.
Their equations (3) for Pnc and their identical (5) for Pno have built in a
1st order approximation of i*k bits being On after i items or states were
inserted i.e. virtually "stored" in a BF or bitstate hasher. This fundamental
weakness cannot be removed by improving their 2nd order approximations.
Moreover I could not get a meaningful result from a straight numerical
evaluation of their equations (3) or (5), despite my careful logarithmization
with overflow checking On, extended floating point type of 10 bytes
BPascal and using relatively small numbers like k = 5, n = 10^4 and
M = 1 Megabit, which still form a clear point in their fig. 2.
Note that here I tear down the formulas and fig. 2 in [Wolper & Leroy 1993]
and fig. 8 in [Wolper, Stern, Leroy, Dill ] concerning Holzmann's "bitstate
hashing" [Holzmann 1988,1990] aka Bloom filtering [Bloom 1970], [Knuth 1973],
aka "probabilistic hashing" [Sincovec & Wiener 1986, p.396] aka "multiple
hash functions" [Reingold & Hansen 1986, p.410-413]. I did not check yet
Wolperian theories behind hashcompact HC [Wolper & Leroy 1993] aka compact
hashing CH [Cleary 1984] aka virtual hashing since [Morris 1968, Bell Labs],
virtual scatter [Murray 1970, Cornell Univ.] and virtual hash [Wiener 1986].
Since one comparison assumed that 400 MBytes of fast memory are available
and accessible, saving it makes no sense, and since hashcompact cannot take
more than up to 80 MegaStates, only one nontrivial comparison makes sense :
Q: Which fault rates f are achievable for M = 400 MB and n = 80 MegaStates?
A: BF cannot win over HC for the given M, n because for k = 28 BF's lowest
here achievable losses are Fn = 0.017 =~ Ploss = 0.017 > 0.0013 = 0.13 %
claimed for HC in [Wolper, Stern, Leroy, Dill ].
On the other hand, at the time when HC with k = 40 bits/state has filled all
the available bitspace of M = 3200 Megabits, a BF with k = 28 ( k <> 28
would worsen Fn and PLoss ) has used only about M/2 = 1600 Megabits.
The optimal bitsharing factor Obsf = 0.5*M/(k*n) = 0.5*3200/(28*80) = 0.71
i.e. from the k*n bits ever set 29 % were already On hence were set again
so that 29 % of the On-bits are bitshared and thus semantically overloaded.
When HC has used (almost) all M bits, HC will start to lose each & every new
state simply because there will be no single bit left to store, while BF can
go on quite comfortably much further, until it will start losing much more
often than acceptable. Note the minimal Fn = 0.017 =~ 0.017 = Ploss for k=28:
Tab. 2: Fn = the smaller the better; M = 3200 Megabits
k Ploss =~ Fn for Fn < 1, but here BF cannot achieve Fn < 0.017
HC: 40 Ploss = BEST 0.0013 n < 80 MegaS = HC's abs.limit, HC used ALL bits
BF: 40 Fn = 0.036 n = 80 MegaS, BF far from full (bitsharing)
BF: 28 Fn = best 0.017 n = 80 MegaS, BF used only M/2 = 50% of bits
BF: 20 Fn = 0.038 n = 80 MegaS, BF far from full (bitsharing)
BF: 10 Fn = 2.26 n = 80 MegaS, BF far from full (bitsharing)
BF: 2 Fn = 64372 lost; n = 80 MegaS, BF far from full (bitsharing)
Here BF is the loser in general, and for k = 2 a Huge Loser in particular.
But again, when HC uses all M bits available, HC with k = 40
will turn into a Henceforth-All-Loser long before BF with k = 28 (or less)
will start to lose unacceptably. So there is no clear cut criterion or rule
of thumb for situations when n is not known beforehand.
One clear advantage of classical virtual COMPACT HASHers (CH like HC i.e.
hashcompact) over neoclassical virtual DISPERSED HASHers (DH like BF)
with k > few has been until now their speed, but this difference is
virtually annihilated by the new algorithm called FlashHash for the first
time described here above and in more detail below.
-.-
+The proof of the pudding = Tab. 3 :
Here are some numbers for a BF (operating with my 1-liner hashfun) on
n = 63609 real-world strings (all unique!, many similar, ave.length 19)
M1 = 116663 bits/segment in a partitioned BF3, and M = k*M1 in BF1
hence (k*n)/M = (k*n)/(k*M1) = n/M1 = 63609/116663 = 0.545, well filled.
As said above, Fn = number of lost strings during 1 to n distinct
insertions (i.e. no duplicates [re]inserted).
Tab. 3: BF1 is monolithic; BF3 is partitioned; cumulated losses Fn:
FlashHash Yes/No ; k = 2 3 4 5 6 7 8 9 10
Theor. num. sum BF1: No Fn = 4205 1353 460 162 59 22 8 3 1
Theor. integral BF1: No Fn = 4272 1385 474 169 61 23 9 3 1
Really counted BF1: No Fn = 4259 1390 434 151 64 17 6 2 1
Really counted BF1: Yes Fn = 4316 1411 460 159 64 33 25 7 7
rotated by c < w=32 bits c = 13 13 7 7 5 5 4 3 3
Really counted BF1: Yes Fn = 4360 1387 503 160 59 33 11 8 3
= version with h0 := h0 xor EnTroPic[j];
Further for BF3 all really counted cumulated losses Fn:
- k hashfuns/S :
NO FlashHash No Fn = 4331 1395 482 178 64 24 8 4 0
- 1 hashfun/S only: Yes Fn = 4259 1366 465 161 56 24 12 5 3
= version with h0 := h0 xor EnTroPic[j];
- 1 hashfun/S only:
FlashHash w = 32 Yes Fn = 4312 1399 488 175 70 30 13 11 5
rotated by c bits c = 13 13 7 7 5 5 4 3 3
- 2 hashfuns/S only:
FlashHash w = 64 Yes Fn = 4354 1354 465 164 70 20 10 2 1
rotated c <= 13 c = 13 13 13 13 12 10 9 7 7
- 2 hashfuns/S only:
FlashHash w = 64 Yes Fn = 4311 1390 503 174 65 20 10 2 1
rotated c <= 11 c = 11 11 11 11 11 10 9 7 7
Number of hashes: k = 2 3 4 5 6 7 8 9 10
The table empirically validates that a k-flashhashed single h0 := HashFun(S)
can compete with k separately (re)computed HashFun(S). The theoretical
k!-based reason follows further below (find "k!" in this text).
FlashHash with 1 hashfun was competitive up to k = 5 and only then
the decresing c has begun to take toll when h0, h had w = 32 bits/word.
Of course the word width w > 32 bits would help, and so did "hash pairing"
of two 32 bit results h0, h00 of two hashfuns into one final "hashpair"
of w = 64 bits. The results were close to those of k separate hashfuns
and also close to the theoretical cumulated losses Fn for BF1.
Like it is impossible to address large spaces with too few bits, larger M
may require larger word width w > 32 bits for the hashes (find "w = 64").
The IT & TI engineering rule of thumb for Fn in a flashhashed BF3 or BF2:
For your M, n and k compute my new theoretical summing formula for BF1
Fn = Sum_i=1..n( fi ) with f(i) = (1 - (1 - 1/M)^(k*i))^k and M = k*M1
(do not forget to initialize Sum := 0 :-) and this Fn multiply by the
following corrective ratio from the Tab. 3 :
( flashhashed Fn[k] ) / ( theoretical numerical sum Fn[k] ) .
The safety margin built into this engineering rule is that the Tab. 3 is
based on my simple (good enough for my apps) 1-liner HashFun(S) :
h0 := (h0 << 5) + h0 + (S[ix] & 31); for subset-of-ASCII strings.
Common technical sense tells us that: the better the HashFun(S) and the
larger the distance c and the width-bits w of the 0-th hash h0 (as if
concatenated and rotated with h00 ), the lower the actual Fn will be.
Competent readers using better hashfuns should achieve better results,
especially if their S is densely packed i.e. without too many redundant
bits contributing no information (hence my << 5, & 31 ; find these below).
Competent users of FlashHash will obtain their custom designed corrective
ratios based on their versions of my Tab. 3 obtained for their own hashfuns
working on their typical sets of states or items.
The tab. 3 above shows that there is no need to compute k HashFun(S)
iff k > 1 hashes are obtained by means of FlashHash :
1. The k > 2 greatly decreases BF's losses Fn and the failrate f ;
2. for k < 6 only 1 HashFun(S) will suffice iff FlashHash is employed;
3. for k > 6 only 1 HashFun(S) may suffice iff FlashHash is employed,
but 2 HashFun(S) will further decrease Fn and f .
Inevitably, like with all algorithms, there surely will be some range(s)
of parameters for which a compact hash CH or hashcompact HC will be worse
or better than a dispersed hash a la BF. To determine these ranges is left
as an exercise for the spinners (= kind, thoughtful readers in general and
the reinventors of compact hash or hashcompact in particular). This paper
gives them proper formulas for the cumulated losses Fn.
-.-
+FlashHashing & BackHashing = (m)eat the beef again :
The basic idea of flashhashing is to compute only one hash h0 (as good as
possible or available; here good means that its bits are quite independent)
and then to derive all k hashes from it by means of one stepwise rotation
of h0's bits per each hash. Since the above analysis was done under the
assumption of k independent & uniform hashing functions I am now
trespassing into the holy ground of those theoreticians who would claim
that my flying machine cannot fly, as Professor Simon Newcomb, then
the US "astronomer-royal" residing at Harvard, and Professor Lord Kelvin in
the UK have claimed [Newcomb 1903]. To take their fear of flying away, I can
say more than "Sirs, with all due respect, I have flown my FlashHash". Lets
switch from BF1 to BF3 (or BF2), to really appreciate why BF3 is our favorite
scheme. As said, BF3 or BF2 do not suffer from hashcrosstalk i.e. no bits are
shared by j-th and i-th hashfuns if i <> j. This observation makes FlashHash
and BackHash workable. As all k hashes are generated in FIXED ORDER, each
works only its own j-th segment of BH3. It is possible that states S1 and S2
will generate the same k hashes, but the higher the k, the more unlikely
it is that hashes will be generated in the same order, afterall the number of
their permutations is k! (this is factorial, not an exclamation! :-) and
only one of the k! permutations will match. The same in still other words:
When two different states Si, Sj produce two different 0-th hashes h0i, h0j
and FlashHash rotated h0i by Zi steps and h0j by Zj steps, then both hashes
may become identical, and they both would match the same bits in a monolithic
BF1, but not in a partitioned / segmented BF3 or BF2 where each hash has its
own segment or partition. Against such mismatches in BF1 resulting from the
rotated h0 works the COMBINATORIALLY IMPLOSIVE nonmatching logic of k! .
The flashhashing fewliner is presented in edu-C+BP, the lingua franca of
ancient algorithmicians, defined above as a mixture of BP-Sanskrit and
C-Latin:
Unsigned integers :
If you have to use signed integers e.g. in a 2-complement arithmetic
i.e. if abs(MinInt) <> abs(MaxInt) then you must put overflow checking
(if any) off, and for speed put it off anyway, since flying as fast as
possible without a parachute is the right fun, isn't it ?
M1 = nr. of bits in a BF[0..M-1]; for BF1 is the modulus; h0 := h0 % M
M = nr. of bits in each segment of a BF3 or BF2, where h0 := h0 % M1
j = the order number of the j-th hash produced per one call of HashFun(S)
h0 = the 0-th hash produced by HashFun(S) called only once per S
h = any hash subsequently (re)generated from h0 ; h could be h[j]
w = number of bits in h0 and h i.e. hash-word width in bits;
often w = 32 bits, or w = 64 bits for h0 hashpaired with h00.
k = total number of hashes (not hashfuns' calls) to be used per item S ;
c = number of bits shifted; 0 < c < w , the larger the c the better
because mutually more distant bits will be more stochastically
independent. Pre-check to make sure that:
Use c such that k <= 1 + ((w-1) div c) and odd c < (w div 2)
c prime w.r.t. w would guarantee that all w bits will be
visited even in the case of i.e. c*k > w - 1.
An item S is a chunk of data to be hashed, e.g. a string of variable size,
or a fixed sized data vector like e.g. a bitstring or a bytestring
representing anything, e.g. a state S of a model, mail or femail :-)
FlashHash for BF3 or BF2 (i.e. no hashcrosstalk) to be run only once per S.
This version and that in the executive summary above are both
functionally equivalent but coded differently :
d := w - c; { use max c < w/2 for k <= 1 +((w-1) div c), w = #bits in h0 }
M1 := M div k; { i.e. M := k*M1; }
M1min1 := M1 -1; h0 := HashFun(S); { store THIS h0 now, iff needed }
offset := 0; h := h0 % M1; { h := h0 & M1min1; is faster }
{ Use/test/set BF3[h] for j = 1 , or store in h[1] := }
for j := 2 to k do { generates k > 1 hashes per item or state S : }
begin { max prime or odd c <= (max c allowed) will be the best c }
h0 := (h0 << c) | (h0 >> d); { h0 rotated c bits to the left }
offset := offset + M1;
h := offset + (h0 % M1); { Use/set BF3[h] for k=j, or: h[j]:= }
{ or: h := offset + (h0 & M1min1); is allowed iff M1 power of 2 }
end;{ h := might be h[j] := ... ; use h before and in the loop }
BackHash is like FlashHash with h0 := HashFun(S) replaced with
h0 := FetchFromStore; (e.g. from a stack)
Iff h0's are stored (e.g. stacked, only one h0 per each S), no HashFun(S)
has to be redone for S's done so far, as BackHash can regenerate all k
hashes per S from its single h0 by bitshifts or, surprise, even
without h0 if flashhashing was done with k*c = w e.g. 32, and if no
final transformation like h := offset + (h0 % M1) was applied, which
would mean that a nonsegmented BF1 would have to be used. This seductive
save-no-h0 trick is not recommended and therefore not discussed.
Quiz: Why? Having read so far, the answer should not be too difficult.
FlashHash for monolithic BF1 (with hashcrosstalk) to be run only once per S:
{ M := k*M1; for comparisons of this BF1 with BF3 or BF2 }
{ Mmin1 := M - 1; for faster % iff M is a power of 2 ; see below }
d := w - c; { w = #bits in h0; use max c < w/2 for k <= 1 +((w-1) div c) }
h0 := HashFun(S); { store THIS h0 now, iff needed }
h := h0 % M; { h := h0 & Mmin1; is faster; use iff M power of 2 }
{ Use/check/set BF1[h] for j = 1 , or store in h[1] := }
for j := 2 to k do { generates k > 1 hashes per item or state S : }
begin { max prime or odd c <= (max c allowed) will be the best c }
h0 := (h0 << c) | (h0 >> d); { h0 rotated by c bits to the left }
h := h0 % M; { Here use ie test/set BF1[h] for k=j, or: h[j] }
{ or: h := h0 & Mmin1); is faster but allowed iff M is a power of 2 }
end;{ h := might be h[j] := ... ; use h before and in the loop }
FlashHash in edu-C for monolithic BF1 with hashcrosstalk; run it once per S:
/* This version allows concentration of an inline code at one place */
/* M = k*M1; for comparisons between BF1 and BF3 or BF2 */
/* Mmin1 = M - 1; for faster % iff M is a power of 2 ; see below */
d = w - c; /* w = nr bits in h0; use max c < w/2 for k <= 1 +(w-1)/c */
h0 = HashFun(S); /* Store THIS h0 now iff needed */
for j = 1 to k do /* generates k hashes per item or state S : */
begin /* max odd or prime c <= (max c allowed) will be the best c */
if j > 1 then h0 = (h0 << c) | (h0 >> d); /* rotate bits */
h = h0 % M ; /* put HERE inline code using h */
/* or: h = h0 & Mmin1; this is faster; use it iff M power of 2 */
end; /* h = might be h[j] = ... ; use h before and in the loop */
-.-
+More explanations and hints for programmers :
HashFun(S) above may be your favorite, or this Bytewise 1-liner (or better):
h0 := 1; for ix := loB to upB do h0 := (h0 << 8) + h0 + S[ ix ];
h0 := 0 would be ok, but h0 := 1 may be better. Quiz: why ?
FH & BH employ a repeated barrelshift which cyclically step-wise rotates
bits in the last word h0 by c bits to the left i.e. by (j-1)*c bits
w.r.t. the original hash h0. Each turn of the loop yields one of the k
hashes per item S so that FlashHash works as an ultraefficient
multihashing function. BackHash is even faster.
If you need to remember but do not want to store all k hashes per item
( or per state when generating too many states e.g. for model checking
in SPIN ), then you can do with storing only the 0-th hash h0 and
recover i.e. reconstruct all k hashes with BackHash which is much
faster than FlashHash, because no hashfun needs to be ever recomputed
again as long as the 0-th hash h0 from FlashHash is still stored and
readily accessible e.g. on a stack or in a queue (LIFO / FIFO).
Here is how to generate c and d for the chosen k :
Either precompute c for h0's word width w = 32 bits thus:
for ctry := 1 to w do { huge M=k*M1 may need w = 64 bits i.e. h0 and h00 }
if k > ( 1 + ((w -1) div ctry) )
then begin c := ctry -1; Goto cLa end;
cLa:
{ c = 13 or 11 are large enough distances for bits' independence }
if c > 13 then c := 13; { prevents c > 32/2, e.g. c = 31 for k = 2 }
if (c % 2) = 0 then c := c - 1; { odd c is better than even c }
d := w - c;
Or pre-set c as follows, with odd c preferred:
if k > 1 then
case k of { the best c seems to be c = prime or odd: }
2, 3: c:=13; 4: c:=9; 5: c:=7;
6, 7: c:= 5; 8: c:=4; 9, 10: c:=3;
else: { here can be the "Either" loop from above }
end;
{ c = 13 or 11 are sufficient distances for bits' independence }
if c > 13 then c := 13; { prevents c > 32/2, e.g. c = 31 for k = 2 }
if (c % 2) = 0 then c := c - 1; { odd c is better than even c }
d := w - c;
Check( 1 < k <= 1 + ((w - 1) div c ) );
Check( 1 < c ); { 1 << c makes hashes less dependent }
By now it should be clear that FlashHash deserves its name as its blitzy
performance in time & space derives from the "incremental" re-computation
based on computationally simple & blitzy bitshifting which can be justified
by not too simplistic reasoning about k! permutations of k hashes for
k-segmented hence "combinatorially imploded" Bloom filters of the 2nd and
3rd kind i.e. for BF2 and BF3 only. Negative tests on BF1 mentioned above
have empiricaly confirmed this conclusion. Positive results are in Tab. 3.
By negative test results I mean that a flashhashed BF1 will almost always
make more false positives than a flashhashed BF3 or Bf2, but the
difference does not need to be large.
So far on FlashHash. No more spinning about how to obtain k hashes in
general and how to do it fast & memory efficient in particular i.e. how
(not to have) to save them all in some applications like e.g. in SPIN.
Despite the "Never say never", I firmly believe that there never was
and never will exist more efficient multihash algorithms than FlashHash
& BackHash which have reached the absolute limit of what is possible.
-.-
+Bitwise rotations aka circular bitshifts aka barrelshifts :
The << , >> are bitwise shiftLeft, shiftRight ( shL , shR in BPascal);
the filling bits are zeroes.
The | is a bitwise OR in C i.e. bitwise OR on integers in BP.
BarrelshiftLeft by c bits of an unsigned 32 bit integer h0:
d := 32 - c;
h := (h0 << c) | (h0 >> d );
e.g. h := (h0 << 8) | (h0 >> 24);
e.g. h := (h0 << 1) | (h0 >> 31);
BarrelshiftLeft between two 32 bit words h0 and h00 taken AS IF they
form a one w = 64 bit word (see its lower losses Fn in Tab. 3 above):
d := 32 - c;
h0 := HashFun1( S ); h00 := HashFun2( S );
t0 := h0 >> d; { t0 is a temporary aka shunt like for swap }
h0 := (h0 << c) | (h00 >> d);
h00 := (h00 << c) | t0;
h := offset + (h0 % M1); { for BF3. For BF1 use h := h0 % M; }
{ h := offset + (h0 & M1min1) iff M1 = 2^i ; is fastest }
BarrelshiftLeft by cL = 5 bits of a w = 20 bits integer bitstring
within a 32 bits word h0 so that 12 msbs are 000000000000 :
offbits := 32 - w; {= 12 bits to be put Off i.e. zeroed }
offvalue := 1048575; {= 2^w - 1 = 00..12x..000111...20x..111 }
offpluscL := offbits + cL; {= 12+5 = 17 }
h := (((h0 & offvalue) << offpluscL) >> offbits) { safe Version 1 }
| (h0 >> (32 - offpluscL));
or: iff we know that the 32-w msbits are all 0 then it suffices :
h := ((h0 << offpluscL) >> offbits) | (h0 >> (32 - offpluscL)); { V2 }
E.g. when we computed a h0 of 32 bits and we want to test AS IF h0 has
w = 20 bits only, we use Version 1 for j = 1st hash, and V2 for all
further hashes j = 2 to k, per state S. Of course we could use the
Version 1 for all j as well. A new S must start with Version 1.
-.-
+On hashing functions :
If you know how to obtain multiple mutually almost independent hashes, then
Bloom filters are quite easy to program, but nevertheless less easy to
understand i.e. to derive BF-formulas and to choose design parameters in
order to be able to exploit their engineering trade-offs properly for a near
optimal design and operation. Good hashes are the keys to good BFs.
To obtain excellent HashFun(S) just search the WWWeb for: "Bob Jenkins"
and hashing. Bob Jenkins works for Oracle and his hashfun is a funnel-
less miracle. Please consult the WWWeb on how hashing funnels may harm.
To try out FlashHash you need just one hashfun. If you do not have one
handy and you do not want to study & test hard, you can use the quick
& clean (yet good enough for many not too critical applications) 1-liner
operating Bytewise on an item or a dataslab S[loB..upB] treated as a
string or an array of bytes to be hashed. As words consist from characters
or bytes, and bytes consist from bits, this hashfun is a pretty flexible
1-liner. When hashing ASCII strings, masking away 1,2,3 msb's will improve
the quality of hashes and decrease the losses Fn anf the failrate f.
h0 is an unsigned integer of, say, 32 bits. Put off overflow checking,
if any, to make sure that no overflow happens, and for speed anyway:
h0 := 1; for ix := loB to upB do h0 := (h0 << 8) + h0 + S[ ix ];
h0 := 0 might be worse. Quiz: guess why?
Specialized 1-liners: { Masking msb's off with & : }
For strings in ASCII < 128 decimal: (h0 << 7) + h0 + (S[ix] & 127);
for smaller subsets of chars, e.g.: (h0 << 6) + h0 + (S[ix] & 63);
for even smaller subsets, e.g.: (h0 << 5) + h0 + (S[ix] & 31);
where some chars may be pre-mapped (i.e. pre-converted to other,
preferably unused or rarely used characters) which does not harm and
actually is likely to improve the hashes and lower the faults Fn, f .
On an unsigned integer my 1-liner is just a fast way of repeatedly computing
its mathematical equivalent h0 := h0*257 + S[ix]; note that 257 is a prime
number (this is not so important as long as it is an odd number. Quiz:
why odd ?), but do not forget that the goal of hashing is not math, but an
(ir)reversible compression of a chunk S into its much shorter hence
generally non-unique signature aka fingerprint called hash i.e. here
the final numerical value of h0 which is used for accessing-by-indexing
an array called hash table (of bits for BFs).
If the range of h0 (e.g. 2^32 -1 = 4294967295 bits = 536870912 bytes)
is larger than the highest index number M-1 of the last bit in the
hashtable[ 0..M-1 ] of bits, then:
If M is a power of 2 i.e. M = 2^Int where Int > 0
then the final hash value of h0 can be reduced to h0/2 (or h0/4 etc)
by zeroing its msb(s); the AND-masking will do it:
h0 := h0 & mask; { where mask = 0..0111..111 }
else clip the final h0 with the post-final h0 := h0 % M1;
where M1 is preferably a prime not too close to such "round numbers"
like powers of 2, 3,.., 10, etc.
If k different hashfuns are needed then we may obtain them as follows:
Precomputations (done only initially, only once per session):
1st: Generate a sequence of numbers loB..upB and
store it in Seq[1, loB..upB];
2nd: Initialize a pseudorandom generator of integers;
3rd: Generate k pseudorandomly permuted sequences of the nrs loB..upB:
for j := 1 to k do pseudorandomly Swap the elements in
Seq[j, loB..upB] until k pseudorandom sequences (each
containing all numbers loB..upB permuted) are stored in Seq.
Note: shake well, mix well i.e. do rather more swaps than less,
say, 10*(upB - loB + 1) or more swaps.
The hashing proper (done repeatedly, k times per hashed item):
4th: The j-th hash is obtained by scanning & hashing the item S[loB..upB]
in the order given by its precomputed pseudorandom Seq[j, loB..upB].
This simple process yields k fairly independent hashfuns, albeit at a
price of considerable CPU-time spent. Such an overhead may be small for
I/O-bound applications which do lots of reading/writing (e.g. of strings)
from/to a disk which is 1000s times slower than a fast memory. However for
CPU-bound "generative" applications, like e.g. model checking in SPIN which
generate millions of states each hundreds of bytes long, their (re)comput-
ation and storage are the dominating performance factors. For such apps
I developed flashhashing which of course can be used in I/O-bound apps too.
-.-
+Summary of FlashHashing (FH) :
FH requires from you to have only one (very) good hashfun h0, and it
will (re)generate more of decent hashes from it in no measurable time.
FH makes it unnecessary to store more than one hash h0, because all k
hashes per state or item can be recovered from h0 in no time measurable
with a fewliner called BackHash which is almost identical to FlashHash,
but much faster, as BackHash does not really execute any hashfun, it just
recovers hashes from one h0 stored per state S.
Of course the hashes will not be as independent as they could be if
generated separately, but this suboptimality can be partially
compensated by a increasing by 1 the number k of hashes used. Just
remember the common-sensical engineering rules of thumb :
"The bigger the M = k*M1 , the larger the w will be better for low Fn.
"The larger the constant c, the more independent the hashes will be"
(it rhymes, so our Alzheimers can easily remember it :-) For example
if SPIN model checkers want k = 10 or 11 hashes, then they can use
c = 3 if the unsigned integers h0, h have 32 bits. The maximal
c can be easily precomputed or predefined for k = 2, 3, 4,.., 10,
.. up to 32 in which case it must obviously be poor c = 1. The general
condition which must be satisfied is k <= 1 + ((w - 1) div c) ,
hence we may do:
if k < 1 + ((w - 1) div c) then
begin Increase k by 1 and use as the last extra hash:
h0 := (h0 << (w - 1)) | (h0 >> 1);
{ e.g. h0 := (h0 << 31 ) | (h0 >> 1); }
end.
In fact c might vary (according to a fixed reproducible pattern!) during
the scan, but constants make our lives easy and computations breezy :-)
-.-
+From mission impossible via combinatorial implosion to mission accomplished
It would be no fun to write & to read this e-paper in the conventional dull,
historically falsified style using reductionistic terms "merely" and
"nothing but" or the phrases I hate, like "it obviously/easily follows",
since not much is obvious to anybody but few genuine geniuses belonging to
the category of Ideal Observers (IO) or Ideal Receivers (IR) [Kahre 2002].
For such Superior Beings (almost) all mathematics is nothing but a rather
trivial tautology not worth of pursuing. Having to rely most of the time
on received wisdoms, I keep being surprised by what finally shows up to be
possible in science and technology in general and in IT in particular, as
IT is less limited by "material" restrictions than most other technologies.
I am writing this section because initially I believed that for high
quality Bloom filtering it is impossible to avoid generating k separate
hashfuns. I believed that, because I had my take of papers by theoreticians.
However, few times before this, I have done things deemed impossible:
- Beating down the combinatorial explosions during protocol verification
by APPROVER = the first Automated & Proven Protocol Verifier introduced
for the first time at the Lake Arrowhead Workshop [Hajek 1977, 1978a].
- In 1988 I have beaten combinatorial explosion down again when generating
one solution of the N nonattacking queens problems for all chessboards
up to N = 1800 (by 1800) on an XT PC, while then the world record (on a
PC ?) was for N = 96 by an ex-Stanford professor Dr. Harold Stone then
the manager of the advanced architectures studies at the IBM Thomas J.
Watson Research Center [Stone & Stone 1987], [Hajek 1990]. My bad luck
was that at about the same time two groups of researchers have succeeded
to turn this NP?-problem (yes, I have asked professor Stephen Cook at
the University of Toronto, who should be called Dr. NP) into N.P. i.e.
into No Problem finally for N = 3M [Sosic & Gu 1990], and the NASA
researchers working on the scheduling problem for the Hubble telescope
[Minton, Johnson, Philips, Laird 1990], albeit on fast workstations with
lots of memory, not on an XT with unexpandable 520 kBytes which were my
absolutely limiting factor. Again a mission impossible has become a
routine (for cognoscenti). Unless you are an expert on CSPs, please feel
free to try N-queens for N > 100 to get a feeling for what I am talking
about.
- About 1990 I decided to try my hand & head on the problem of in-flight
deletions from open addressed hash tables which has been described as if
impossible without their complete rehashing: "The situation can be
improved, by periodically rehashing the table" [Standish 1980, pp.157-9],
"The only solution to the problem of degraded search times is the ultimate
(total) reconstruction of the table by a process called rehashing"
[Reingold 1986, p.406 (1983, p.361) ]. Despite being explicitly warned
by these good professors and implicitly by others who offered no true
solution to such true deletions, I succeeded to construct a simple but
not simplistic algorithm for in-flight deletions. Weeks later, I found
its basic idea described, though not worked out (guess where? :-).
Only 12 years later when adding this section to this e-paper, I noticed
a hint (but no more) in [Reingold 1986 & 1983, section 7.4, Exercise 13].
- These three examples, all somehow related to the task at hand, suffice.
I am writing up these experiences mainly to show that one (wo)man's
impossibility is another (wo)man's possibility or even probability, and
that especially in algorithmics there are few known boundaries to human
ingenuity. Afterall, the Laws of InfoProcessing (LIPs :-) still wait to be
uncovered by an info-Ein Stein, but see [Kahre 2002] for a nice try.
Recalling these hands on-&-in experiences and still believing in the
impossibility of avoiding the computation of k hashfuns as independent
as only possible, I searched for an indirect inspiration by re-scanning
the works referred further in this section. Amazingly, it did not take
more than keeping the problem in mind for a week before I saw the light.
There are impossibilities and "impossibilities"; see [Richards 1975]
who starts with a children's verse: "While one man was proving it couldn't
be done, another man tried and he did IT." See also [Garfield 1977] together
with [Newcomb 1903] reprinted there. More "down to the PC" remarks are
from [Knuth 1977, p.69]: "interpolation ... works better in spite of the
proof that binary search is best", followed on p. 80 by the Conclusions
of this most eminent professor emeritus (Stanford) of algorithmics: "Even
the 'best possible' algorithm can sometimes be improved if we change the
ground rules." FlashHash does change the classical rules; read on to find
out which one. Concerning classical approaches: "The word 'Classical'
means only one thing in science: it's wrong!" is how [Frieden 1983,1991,2001,
the 1st page of the chap. 16] has paraphrased Robert J. Oppenheimer, the
co-father of our nuclear age. Oppenheimer has also said that "There are
children playing in the street who could solve my top problems in physics,
because they have modes of perception that I lost long ago." His statement
fits perfectly with that of a zen master Shunryu Suzuki in his book Zen
Mind, Beginner's Mind: "In the beginner's mind there are many possibilities,
but in expert's there are few. ... You should be rather grateful for the
weeds you have in your mind, because eventually they will enrich your
practice. ... True understanding is actual practice itself."
Om Mani Padme Hum. Om Wagi Shori Mum. Om Vajra Pani Hum.
In this e-paper the "classically impossible" was reached :
- by literally "recycled randomness" (which may seem to be an oxymoron)
i.e. by stepping over the mental threshold of the classical requirement
of k independent hashfuns, by relativization of their independence
w.r.t. the observer, human, animal, machine or an abstract Gedanken-It.
In fact, any programmed randomness is an oxymoron as well, as it is
pseudorandom anyway. It is my firm opinion that the requirement of
randomness for applications should be defined w.r.t. the observer, who
does NOT (!) have to be the omniscient Ideal Observer (IO like e.g. the
well known Maxwell's demon, or Laplace's demon) quite similar to the
Ideal Receiver (IR) of Bar-Hillel & Carnap employed by [Kahre 2002], as
long as we do not deal with such 3-letter word Organizations like e.g.
KGA (= Kommietet of Gruesome Adversaries) and other Manichean devils
[Wiener 1954, chap. II & XI, spelled Manichaean]. By not being paranoid
when not necessary i.e. by assuming only entropic demons of confusion
and not of willful malice, we can compensate for the weak spots in the
"recycled randomness" by employing what I call "combinatorial implosion"
explained next. Manichean demons can be coped with by employing miniMax
strategies which minimize maximal possible losses. The price payed for
such (over)cautious miniMaxing are average losses higher than with other
less conservative strategies.
- by incremental computations pushed to the extreme - to cyclicity, which
perfectly fits with my "recycled randomness" - which allows blitzy &
simple constructions of multiple hashes, and their reconstructibility
or recoverability by means of literal recycling.
- by recognizing that there is "hashcrosstalk" in the usual BF (BF1) and
by the key insight that the "no crosstalk" property of segmented Bloom
filters of the 2nd and 3rd kind (BF2 and BF3) can be exploited because
of k! permutations of potential matches (of which always only one will
actually match i.e. the chances of an unwanted pseudo-match decrease
hyperexponentially with k ). More often than not the combinatorial
expressions like k! (which is about the worst simple one we can get)
are our enemies we have to fight in combinatorially explosive tasks
(e.g. in constraint satisfaction problems aka CSP's like the mentioned
classical N-queens problem). But here the k! is our ally which works
against the weaknesses inherent in "recycled hashes". Hence my new
terminus technicus "combinatorial implosion". It would be interesting
to try to employ it on a wider basis i.e. to try to attack algorithmic
problems by explicit or even systematic reference to "combinatorial
implosion".
The synergistic effect of these elements is that k hashes are recycled,
and if necessary recovered, from a single result h0 of a single call per
S of one HashFun(S) only.
I started this e-paper with the necklace metaphor inspired by Hermann
Hesse's poems, containing such jewels like e.g. the following fragments:
"It is a game whose rules are nice and neat."
.......
"Those hieroglyphs once so significant { Those progs ... }
that now are only colored bits of glass, { bits of chips }
he lets them roll until their force is spent { roll i.e. recycle bits }
and silently they vanish in the sand." { silicon is made from sand }
.......
"Proof that for every color may be found { .. for every state S be found }
in music a proper corresponding key." { in an algorYthm a ... hashkey }
.......
"The pattern sings like crystal constellations, { crystal clear bitpatt }
and when we tell our beads, we serve the whole, ..." { we count our bits }
Telling i.e. counting the beads (or bits :-) is what The Tibetan Book of
the Dead mentions exactly at what I have identified as its culmination
point:
"Even at the time that the pebbles are being counted out, be not frightened,
nor terrified; tell no lies; and fear not the Lord of Death. .....
The Lords of Death (aka the Executive Furies) are thine own hallucinations.
... Apart from one's own hallucinations, in reality there are no such
things existing outside oneself like Lord of Death, or god, or demon, ...
Act so as to recognize this."
-.-
+References [ refs ] :
In the titles of books and periodicals all words start with a CAP (except
for the words like eg: a, the, and, or, in, of, to, with, der, die, das).
Bloom B.H.: Some techniques and trade-offs affecting large data base
retrieval times, Proc. ACM Nat. Conf., 24, 1969, 83-95.
Bloom B.H.: Space/time trade-offs in hash coding with allowable errors,
CACM 13/7, July 1970, 422-426.
Carter L., Floyd R., Gill J., Markowsky G., Wegman M.: Exact and
approximate membership testers, Proc. 10th Annual ACM Symp. on
Theory of Computation, San Diego, 1978, 59-65; they refer to Bloom 1970.
Cleary J.G.: Compact hash tables using bidirectional linear probing,
IEEE Trans. on Computers, C-33/9, Sept. 1984, 828-834.
Czech Z.J. et al: Perfect hashing, Theoretical Computer Science, vol. 182,
1997, 1-143 = a strong but long story for the strong on long weekends :-)
Frieden Roy. B.: Probability, Statistical Optics, and Data Testing,
1st ed. 1983, 2nd ed. 1991 has new chap. 17. In the 3rd ed. of 2001
start reading the vastly expanded chap. 17 on p. 413-etc.
Garfield Eugene: Negative science and "The outlook for the flying
machine", Current Comments in his Current Contents, Nr. 26, June 27,
1977, 155-166; followed by [Newcomb 1903].
Gremillion L.L.: Designing a Bloom filter for differential access,
CACM 25/9 (1982), 600-604.
Hajek Jan: the 1st presentation of APPROVER = Automated & Proven PROtocol
VERifier at the Lake Arrowhead Workshop, California, August 1977.
From its very beginning, APPROVER was sufficiently advanced to perform
reachability analysis of protocols specified as procedures in Burroughs
Algol, so that much more primitive finite state models (FSA) had to be
translated into Algol and not vice versa. ARPA's (later: DARPA) TCP =
Transmission Control Program (later: Protocol) has been specified by Dr.
Carl Sunshine, Rand Corp., and design bugs were found in TCP by APPROVER
[Sunshine 1978]. ARPANET was the mother of Internet still running on TCP.
Hajek Jan: Progress report on the automatic and proven protocol verifier,
Computer Communication Review, ACM SigComm 8/1, January 1978, 15-16.
Hajek Jan: Automatically verified data transfer protocols, Proc. 4th
Intl. Computer Communications Conference, Kyoto, Sept. 1978, 749-756,
with 4 protocols and their 4 Effective Progress/Stagnation graphs aka
P/S-graphs. On p. 752 left column, 2nd line from below: please fix the
otherwise nonexistent ":= phaseA" into the correct ":= phasesA".
Hajek Jan: Protocols verified by APPROVER, Computer Communication Review,
ACM SigComm 9/1, January 1979, 32-34.
Hajek Jan: A new dynamic backtracking algorithm for constraint satifaction
problems, technical report CC-Note 50, 1990, 18 pages.
Holzmann Gerard J.: An improved protocol reachability analysis technique,
Software Practice and Experience, 18/2, Feb. 1988, 137-161, on pp. 144
and 161 refers to [Morris 1968] and to [McIlroy 1982].
Holzmann Gerard J.: Algorithms for automated protocol verification, AT&T
Technical Journal, Jan./Feb. 1990, 32-44.
Kahre Jan: The Mathematical Theory of Information, 2002 book, Kluwer pub.
Knuth Donald: The Art of Computer Programming, vol. 3: Sorting and
Searching, 1973, see his index for Superimposed coding, and for Bloom
method for which the correct approximate yet sufficiently precise
formula is on p. 562 mid in the 1st edition, and on p. 573 up in the
2nd edition.
Knuth D. E.: Algorithms, Scientific American, April 1977, 63-80.
Lampson B.W., Sproul R.F.: An operating system for a single-user machine,
Proc. 7th Symp. on Operating Systems Principles, Dec. 1979, 98-105.
McIlroy M.D.: Development of a spelling checker, IEEE Trans on Commun.,
COM-30/1, Jan. 1982, 91-99; hashing, superimposed codes, Bloom, Morris,
Floyd & Gill on p. 95 and in the references, where [10] is Knuth vol.3.
Minton Steven, Johnston Mark D., Philips Andrew B., Laird Philip: Solving
large scale constraint satisfaction problems using a heuristic repair
method, International Joint Conference on Artificial Intelligence,
1990, IJCAI-90, 17-24.
Morris R.: Scatter storage techniques, CACM 11/1, 1968, 38-44.
Mullin J.K.: A second look at Bloom filters, CACM 26/8 (1983), 570-571.
Murray D.M.: A scatter storage for dictionary lookups, Journal of Library
Automation, Sept. 1970, 173-201, mathematical analysis on pp. 177-183,
virtual hashing analyzed on pp. 180-183 by a computer scientist from
Cornell University.
Newcomb Simon: The outlook for the flying machine, The Independent - A
Weekly Magazine, Oct. 22, 1903. Reprinted in [Garfield 1977, 167-172].
Reco to search WWW for: Newcomb AND flying
Nix Robert: Experience with a space efficient way to store a dictionary,
CACM 24/5, 1981, 297-298; refers Carter et al., describes Bloom filter
without directly referring to Bloom (but Carter et al do refer to Bloom).
Ramakrishna M.V.: Practical performance of Bloom filters and parallel
free-text searching, CACM 32/10 (1989) 1237-1239; the reference 3. to
Gremillion should be Sept. 1982 and not July 1982.
Reingold Ed.M., Hansen W.J.: Data Structures in Pascal, 1986 book (there
was a 1983 edition), section 7.4.3 on Multiple hash functions, 410-413;
their math on pp. 412-413 starts with the assumption that maximum amount
of entropy is good (for Bloom filters, but Bloom is not named).
Richards Ian: Impossibility, Mathematics Magazine, Nov.-Dec. 1975, 249-262
Robert Ch.S.: Partial-match retrieval via the method of superimposed
codes, Proceedings of the IEEE, 67/12, Dec. 1979, 1624-1642.
Shannon C.E.: The Mathematical Theory of Communication, 1948, book 1949.
Sincovec R.F., Wiener R.S.: Data Structures Using Modula-2, 1986 book,
pp. 396-408 on probabilistic hashing and on virtual hashing.
Sosic Rok, Gu Jun: A polynomial time algorithm for the N-queens problem,
ACM Sigart Bulletin, vol. 1, no. 3, 1990, 7-11. Two more articles in
ACM Sigart Bulletin, vol. 2, no. 2, 1991, 8 and 22-24. Further in
ACM Sigart Bulletin, vol. 3, no. 1, 1992, 8-12.
Stallings W.: Password generation by Bloom filters, Dr. Dobbs Journal,
Aug. 1994, 119-etc, formulas are ok, but fig.2 is not.
Standish Th. A.: Data Structure Techniques, 1980 book.
Stone Harold. S., Stone Janice M.: Efficient search techniques - An
empirical study of the N-queens problem, IBM J. Research & Development,
vol. 31, no. 4, July 1987, 464-474.
Private communication by letters with Harold Stone in 1988 & 1989.
Sunshine Carl A.: Survey of protocol definition and verification
techniques, ACM Computer Communication review, vol.8, July 1978, nr.3,
35-41, see Table 1 for APPROVER & Hajek & ARPA TCP on p. 39.
Vaihinger Hans: Die Philosophie des Als Ob - Volks-Ausgabe, 1923 book.
Wiener Norbert: The Human Use of Human Beings - Cybernetics and Society,
2nd (expanded) edition 1954, (1st in 1950).
Wiener Richard S.: An efficient virtual hash algorithm for a spelling
checker, Journal of Pascal, Ada & Modula-2, Jan.-Feb. 1986, 23-29;
with modules programmed in Modula-2.
Wolper Pierre, Leroy Denis: Reliable hashing without collision detection,
in Computer Aided Verification, Proc. Int. Workshop, Elounda, Crete,
Lecture Notes in Computer Science, Vol. 697, Springer-Verlag, book,
June 1993, pp. 59-70.
Wolper Pierre, Stern Ulrich, Leroy Denis, Dill David L.: Reliable
probabilistic verification using hash compaction, on WWWeb only, 30 pp.
(temporarily withdrawn by Ulrich Stern)
-.-