auto92089 at hushmail.com
2001-Jun-05 05:06 UTC
11.82 bpw clarification (was Re: ssh-keygen(1) misinfo: English prose entropy 0.6 - 1.3 b/char!)
>Trolling is such fun, isn't it?I was trying to making merriment, not to provoke emotional responses. I am conflicted on publicly responding, but decided to clarify just in case your confusion is shared by others with better impulse control. Unfortunately, an unchallenged statement made in the company or under the scrutiny of experts is often assumed true, which is really the original problem I sought to fix! I notice that the maintainers have quietly fixed the problem in the web-accessible ssh-keygen manpage and presumably latest code - thanks. One shudders at how many crypto-enthusiasts have read that page and failed to notice or correct it!>Entropy rate does not mean thatIf you read the manpage which I quoted for your convenience, it said "1-2 bits of entropy per word". Nowhere has anyone mentioned "rate". Even so, a rate is the relation between two measurements, and we could easily have bits per character or bits per word, depending on the definition of "symbol" used. As such, my interpretation (and I suspect many other people's) is that the manpage was describing the entropy of an entire word (...given the word's preceeding context). I suspect the original author of the manpage is from Finland also; perhaps the confusion is due to some language difference.>if you are shown an English word, >character by character, how long does it take to guess the whole word? >Try a couple of them: >moro_ >idio_ >trol_What you are alluding to is the conditional probability of the (in your examples, last) character dependent on the preceeding ones. While that is an interesting measurement, it's not the one under discussion, and has units of "bits per character" (or letter). http://cs.fit.edu/~mmahoney/dissertation/entropy1.html "Refining the Estimated Entropy of English by Shannon Game Simulation" A reasonable primer on Shannon's estimates of entropy. http://www.stanford.edu/~vjsriniv/project/entropy_of_english_9.htm Covers both of Shannon's methods for estimating H for English (by character using n-grams, by word using Zipf's relations) http://www.voynich.nu/wordent.html "From digraph entropy to word entropy in the Voynich MS" This is cited here because it relates bits per letter and bits per word by the trivial relation of summing, and because it's very cool. Note that for comparisons of different lengths you need to pad each word with a word-end symbol, the first of which may add some entropy (as in "word" vs. "words"). There's a ton of interesting articles which pop up after searching for "entropy" and "shannon" and ilk, but I must stop somewhere and introduce the coup de grace, courtesy of some obscure man who worked for the telephone company. Note that this is the independent or 0-order entropy; I was unable to find a handy reference with contextual word entropy estimates, but clearly is far from 1-2 bpw: "Prediction and Entropy of Printed English", by C.E. Shannon Bell Syst. Tech. J. 30:50-64 1951 "...Using logarithmic scales for probability and rank, the curve approximates a straight line with slope -1; this, if p_n is the probability of the nth most frequent word, we have, roughly p_n = 0.1/n . Zipf has pointed out that this type of formula, p_n = k/n, gives a rather good approximation to the word probabilities in many different languages. The formula [above] clearly can not hold indefinitely since the total probablility SUM p_n must be unity, while SUM_1^inf 0.1/n is infinite. If we assume (in the absence of any better estimate) that the formula p_n = 0.1/n holds out to the n at which the total probability is unity, and that p_n = 0 for larger n; we find that the critical n is the word rank 8,727. The entropy is then: -SUM_1^8727 p_n log_2 p_n = 11.82 bits per word." YHNBT. YHL. HAND. I, Zone Lee Free, encrypted, secure Web-based email at www.hushmail.com
Markus Friedl
2001-Jun-05 09:17 UTC
11.82 bpw clarification (was Re: ssh-keygen(1) misinfo: English prose entropy 0.6 - 1.3 b/char!)
On Mon, Jun 04, 2001 at 11:06:38PM -0600, auto92089 at hushmail.com wrote:> I notice that the maintainers have quietly fixed the problem > in the web-accessible ssh-keygen manpage and presumably > latest code - thanks.this was not 'quietly fixed' since fixed it in the CVS. however, i did not find a name your mail, so i could not give credit.> One shudders at how many crypto-enthusiasts > have read that page and failed to notice or correct it!it's in the ssh-keygen manpage since 1995