Hashing has various cryptographic uses, including the generation of fixed-length keys from arbitrary key phrases, and the accumulation or "distillation" of randomness or "entropy." Here we have a discussion on CRC polynomials, comments on hash127 (said to be faster than CRC), and a discussion resulting from "reverse CRC."
Subject: Re: So THAT's How It Works !!! Date: 09 Oct 1998 07:46:13 +0200 From: Kai Harrekilde-Petersen <khp@dolphinics.no> Message-ID: <kk3ogrm1e56.fsf@sciway.dolphinics.no> References: <handleym-0810982209530001@macip2-149.apple.com> <eeeF0JHwF.D8o@netcom.com> Newsgroups: comp.arch Lines: 26 handleym@ricochet.net (Maynard Handley) writes: > Maybe I'm wrong but, at least for CRCs, the impression I have is that all > the magic numbers and polynomials involved were simply "discovered", > they're not part of some massively deep theory. One simply uses them > because they're there, not because of any other remarkable features they > may have. Maynard, you're wrong on this one. CRC are heavily coupled to Error Correcting Codes (ECC), and both have a large body of math behind them. The polynomials and the numbers may seem 'magical' to you, but they aren't. ECC's (and CRC's for that matter) aren't "discovered"; they are _constructed_ or 'designed'. For example, a Reed-Solomon code used in Satellite communications is a RS(233,32,16) code: It takes 233 bytes of data, adds 32 bytes of parity, and can correct 16 independent 1-byte errors. Generally, you can design an RS code with 255-N bytes of data, adds N bytes of parity, and can correct int(N/2) independent errors. The math department of your local University or College, will probably have a course in Error Correcting Codes. --Kai
Subject: Re: So THAT's How It Works !!! Date: 09 Oct 1998 07:50:05 +0200 From: Kai Harrekilde-Petersen <khp@dolphinics.no> Message-ID: <kk3lnmq1dyq.fsf@sciway.dolphinics.no> References: <slrn71r7fj.8tq.smithz@kriek.cae.wisc.edu> <eeeF0JHwF.D8o@netcom.com> Newsgroups: comp.arch Lines: 21 smithz@kriek.cae.wisc.edu (Zak Smith) writes: > On Fri, 9 Oct 1998 03:05:03 GMT, Mark Thorson <eee@netcom.com> wrote: > >You divide by a prime number slightly larger than the > >data. E.g. if you're sending a 16-bit word, you use > >the following magic numbers: 11000000000000101 (CRC-16) > >or 10001000000100001 (CCITT v.24). > > Err, these aren't prime numbers: > > 11000000000000101 = 1 + 4 + 2^15 + 2^16 = 98309 = 37 * 2657 > 10001000000100001 = 1 + 2^5 + 2^12 + 2^16 = 69665 = 5 * 13933 > > Is that what you meant? We're talking polynomials in base 2, and that's a bit different from plain math. I cannot tell from looking at the polynomials above whether they are primitive or not - but if they are the official CRC's they should be. --Kai
Subject: Re: So THAT's How It Works !!! Date: 09 Oct 1998 12:15:50 +0200 From: Kai Harrekilde-Petersen <khp@dolphinics.no> Message-ID: <kk37lya11nt.fsf@sciway.dolphinics.no> References: <361de1a6.1117943@news.megsinet.net> <kk3lnmq1dyq.fsf@sciway.dolphinics.no> Newsgroups: comp.arch Lines: 19 msimon@tefbbs.com writes: > Are number only prime in a given base? > > I don't think so. You seem to miss the point. We are talking about polynomials, working in finite fields. In the finite field (2^0) 1+1=0. In ordinary math 1+1=2. The basic definitions of math in finitie fields and in N (the set of integer numbers) are the same, but when you do the matj, the results are quite different (eg: see above). As far as I learned at university, this means that you cannot use "ordinary" math to determine if a certain polynomial is primitive in certain Finite Field. --Kai
Subject: Re: So THAT's How It Works !!! Date: Fri, 9 Oct 1998 12:30:10 GMT From: dik@cwi.nl (Dik T. Winter) Message-ID: <F0K82A.2KM@cwi.nl> References: <361de590.2113019@news.megsinet.net> <kk37lya11nt.fsf@sciway.dolphinics.no> Newsgroups: comp.arch Lines: 28 In article <361de590.2113019@news.megsinet.net> msimon@tefbbs.com writes: > The CRC compares remainders. > > The reason for dividing by a prime is to distribute the remainders. But the CRC does not divide by a prime. It calculates the remainder of a polynomial by a primitive polynomial over the field mod 2. And that is something completely different. Or more precise: consider a message consisting of 0's and 1's. Now write the message as a polynomial in a variable X, e.g. the message 1011011000101 becomes 1.X^12 + 0.X^11 + 1.X^10 + 1.X^9 + 0.X^8 ... (where ^ is exponentiation), this can be shortened to X^12 + X^10 + X^9 + X^7 + X^6 + X^2 + 1. The complete message is divided by the CRC generating polynomial, for CCITT X^16 + X^12 + X^5 + 1. The remainder is the CRC (which is a polynomial but trivially can be written in 0's and 1's). Note that the CCITT when written as a binary number is *not* prime (it is divisable by 5), but as a polynomial over GF(2) it is primitive, i.e. *not* divisible by another polynomial over GF(2). The reason for all this is that it is much simpler to implement in hardware, only a shift register with feedback is needed. -- dik t. winter, cwi, kruislaan 413, 1098 sj amsterdam, nederland, +31205924131 home: bovenover 215, 1025 jn amsterdam, nederland; http://www.cwi.nl/~dik/
Subject: Re: So THAT's How It Works !!! Date: Fri, 09 Oct 1998 18:48:52 GMT From: ritter@io.com (Terry Ritter) Message-ID: <361e5a7b.8169985@news.io.com> References: <F0K82A.2KM@cwi.nl> Newsgroups: comp.arch Lines: 74 On Fri, 9 Oct 1998 12:30:10 GMT, in <F0K82A.2KM@cwi.nl>, in comp.arch dik@cwi.nl (Dik T. Winter) wrote: > >The complete message is divided by the CRC generating polynomial, for >CCITT X^16 + X^12 + X^5 + 1. The remainder is the CRC (which is >a polynomial but trivially can be written in 0's and 1's). >Note that the CCITT when written as a binary number is *not* prime (it >is divisable by 5), but as a polynomial over GF(2) it is primitive, i.e. >*not* divisible by another polynomial over GF(2). Alas, I happen to know that 11 is a factor, and we don't have to argue about it, we can show it: (This is mod-2 math. For a quick introduction, see my Crypto Glossary.) | 1111000000011111 | ------------------ | 11 ) 10001000000100001 ( bits set = 16+12+5+0 ) | 11 | -- | 10 | 11 | -- | 10 | 11 | -- | 11 | 11 | -- | 000000010 | 11 | -- | 10 | 11 | -- | 10 | 11 | -- | 10 | 11 | -- | 11 | 11 | -- | 0 remainder (!!!) Check: | 1111000000011111 | x 11 | ---------------- | 1111000000011111 | 1111000000011111 | ----------------- | 10001000000100001 = 16+12+5+1 bits set So, no, CRC-CCITT is *not* irreducible. (Which also means it is *certainly* not primitive!) These first-generation standard CRC's (along with CRC-16) were constructed by taking an irreducible times the poly 11. The poly 11 effectively provides the equivalent of the parity used in the earlier systems. As I recall, the argument was that CRC would thus never be worse than parity. --- Terry Ritter ritter@io.com http://www.io.com/~ritter/ Crypto Glossary 1998-08-27: http://www.io.com/~ritter/GLOSSARY.HTM
Subject: Re: So THAT's How It Works !!! Date: Mon, 12 Oct 1998 09:17:00 GMT From: dik@cwi.nl (Dik T. Winter) Message-ID: <F0pJ4D.A5J@cwi.nl> References: <361e5a7b.8169985@news.io.com> Newsgroups: comp.arch Lines: 14 In article <361e5a7b.8169985@news.io.com> ritter@io.com (Terry Ritter) writes: > On Fri, 9 Oct 1998 12:30:10 GMT, in <F0K82A.2KM@cwi.nl>, in comp.arch > dik@cwi.nl (Dik T. Winter) wrote: > >is divisable by 5), but as a polynomial over GF(2) it is primitive, i.e. > >*not* divisible by another polynomial over GF(2). > > Alas, I happen to know that 11 is a factor, and we don't have to argue > about it, we can show it: Indeed yes, as it has an even number of non-zero coefficients it is obviously divisible by X+1. Argh. -- dik t. winter, cwi, kruislaan 413, 1098 sj amsterdam, nederland, +31205924131 home: bovenover 215, 1025 jn amsterdam, nederland; http://www.cwi.nl/~dik/ Nyx is not responsible for the actions of its users. Our AUP / Free Speech Policy are at http://www.nyx.net/policies/ Direct complaints to abuse@nyx.net
Subject: Re: So THAT's How It Works !!! Date: Sun, 11 Oct 1998 15:43:46 GMT From: colin@nyx10.nyx.net (Colin Plumb) Message-ID: <908120626.503631@iris.nyx.net> References: <eeeF0JHwF.D8o@netcom.com> Newsgroups: comp.arch Lines: 35 In article <eeeF0JHwF.D8o@netcom.com>, Mark Thorson <eee@netcom.com> wrote: > For 20 years I've been hearing about this important > error checking algorithm called CRC (cyclic redundancy > checking), and I never really needed to know what it > was until today. > > It turns out to be really simple. You do an integer > divide of the word you are sending, and tack on the > remainder to the message. The receiver does the same > integer divide and if the remainders match, there's > probably no error. > > You divide by a prime number slightly larger than the > data. E.g. if you're sending a 16-bit word, you use > the following magic numbers: 11000000000000101 (CRC-16) > or 10001000000100001 (CCITT v.24). If the remainders > match, that catches all single and double bit errors, > plus lots besides. > > Whoa! How simple something can be when you actually > look and see how it works! This is *basically* the idea, except for the fact that: - These aren't integers. Indeed, these aren't "numbers" as non-mathematicians understand the concept. (Nonethless, the analogies are strong, as in both cases, you're working in a ring.) - They aren't primes. The polynomial equivalent is "irreducibility", meaning having no non-trivial divisors, aand both polynomials are reducible. - Irreducibility isn't enough, you need primitivity as well. But still, it *is* basically the same concept. -- -Colin
Subject: Re: So THAT's How It Works !!! Date: Sun, 11 Oct 1998 18:30:13 GMT From: ritter@io.com (Terry Ritter) Message-ID: <3620f88e.5661791@news.io.com> References: <908120626.503631@iris.nyx.net> Newsgroups: comp.arch Lines: 30 On Sun, 11 Oct 1998 15:43:46 GMT, in <908120626.503631@iris.nyx.net>, in comp.arch colin@nyx10.nyx.net (Colin Plumb) wrote: >[...] >- Irreducibility isn't enough, you need primitivity as well. No, we don't *need* primitivity (or even irreducibility), obviously, since the early CRC polys were composite. Nor, I think, does it even help very much, in the normal low-error-rate situation. There have even been proposals for software validation by arbitrary random poly, thus preventing virus writers from knowing the CRC they would encounter. Each user would just take a poly at random. I do suspect that primitivity could provide some additional CRC "guarantees" in particular cases. But in the long run there are always vast numbers of error patterns that are undetectable with any particular CRC poly, primitive or irreducible. This is acceptable because most error patterns *are* detectable. And most of that performance is delivered simply by having some irreducible factor of reasonable size. But, on the other hand, if we can get a primitive, why not? --- Terry Ritter ritter@io.com http://www.io.com/~ritter/ Crypto Glossary 1998-08-27: http://www.io.com/~ritter/GLOSSARY.HTM
Subject: Guaranteed message authentication faster than MD5 Date: 6 Apr 1999 23:34:22 GMT From: djb@koobera.math.uic.edu (D. J. Bernstein) Message-ID: <1999Apr623.34.22.28665@koobera.math.uic.edu> Newsgroups: sci.crypt Lines: 18 The first release of my hash127 package is available from http://pobox.com/~djb/hash127.html hash127 computes a 16-byte secret-key signature of an arbitrarily long message. An attacker cannot guess a signature of a different message with probability better than b/2^{131}, where b is the total number of bits in the messages and signatures. hash127 is the first ``universal hash function'' at this security level to break the MD5 speed barrier. The first release of hash127 is faster than the best asm versions of MD5 on a variety of architectures. hash127 can be used as a fast replacement for HMAC-MD5 or HMAC-SHA. I've set up a mailing list for discussions related to hash127. To join, send an empty message to hash127-subscribe@list.cr.yp.to. ---Dan
Subject: Re: Guaranteed message authentication faster than MD5 Date: Thu, 8 Apr 1999 06:24:14 GMT From: phr@netcom.com (Paul Rubin) Message-ID: <phrF9uxsE.KuA@netcom.com> References: <1999Apr623.34.22.28665@koobera.math.uic.edu> Newsgroups: sci.crypt Lines: 8 In article <1999Apr623.34.22.28665@koobera.math.uic.edu>, D. J. Bernstein <djb@koobera.math.uic.edu> wrote: >hash127 computes a 16-byte secret-key signature of an arbitrarily long >message. An attacker cannot guess a signature of a different message >with probability better than b/2^{131}, where b is the total number of >bits in the messages and signatures. Care to explain anything about the algorithm?!
Subject: Re: Guaranteed message authentication faster than MD5 Date: 9 Apr 1999 04:17:13 GMT From: djb@koobera.math.uic.edu (D. J. Bernstein) Message-ID: <1999Apr904.17.13.12567@koobera.math.uic.edu> References: <phrF9uxsE.KuA@netcom.com> Newsgroups: sci.crypt Lines: 29 Paul Rubin <phr@netcom.com> wrote: > Care to explain anything about the algorithm?! What the code does is compute a polynomial s = (r^{n+1} + m[0] r^n + m[1] r^{n-1} + ... + m[n-1] r + x) mod p at very high speed, given 32-bit chunks m[0],m[1],...,m[n-1] and big secret integers r,x. Here p is a public prime, namely 2^127-1. See http://pobox.com/~djb/papers/hash127.dvi for some computational details. If you have just one message m[0],m[1],...,m[n-1] to sign, you can use s directly as a signature of the message. It's easy to prove that an attacker, even with infinite computer power, has a negligible chance of correctly guessing a signature of a different message. If you have many messages to sign, you can either * use the same r for each message, always use x = 0, and feed s through a secret 128-bit function before sending it; or * use the same r for each message, and generate a new x for each message by feeding a nonce through a secret function. Both methods are provably safe against an opponent who can't predict the secret function. The first method doesn't need nonces; the second method has lower latency in some applications. ---Dan
Subject: Re: Guaranteed message authentication faster than MD5 Date: Sat, 10 Apr 1999 14:25:30 GMT From: Scott Fluhrer <sfluhrer@ix.netcom.com> Message-ID: <7enmg9$44h@dfw-ixnews10.ix.netcom.com> References: <1999Apr904.17.13.12567@koobera.math.uic.edu> Newsgroups: sci.crypt Lines: 29 In article <1999Apr904.17.13.12567@koobera.math.uic.edu>, djb@koobera.math.uic.edu (D. J. Bernstein) wrote: >Paul Rubin <phr@netcom.com> wrote: >> Care to explain anything about the algorithm?! > >What the code does is compute a polynomial > > s = (r^{n+1} + m[0] r^n + m[1] r^{n-1} + ... + m[n-1] r + x) mod p > >at very high speed, given 32-bit chunks m[0],m[1],...,m[n-1] and big >secret integers r,x. Here p is a public prime, namely 2^127-1. See >http://pobox.com/~djb/papers/hash127.dvi for some computational details. > >If you have just one message m[0],m[1],...,m[n-1] to sign, you can use s >directly as a signature of the message. It's easy to prove that an >attacker, even with infinite computer power, has a negligible chance of >correctly guessing a signature of a different message. The most obvious weakness in this is that anyone who has access to r and x can easily create collisions (that is, two different messages with the same signature). This reduces where this algorithm can be used to ones where both the sender and the receiver are trusted. For example, this cannot be used as part of a signature scheme. -- poncho
Subject: Re: Guaranteed message authentication faster than MD5 Date: 10 Apr 1999 22:36:11 GMT From: djb@koobera.math.uic.edu (D. J. Bernstein) Message-ID: <1999Apr1022.36.11.22467@koobera.math.uic.edu> References: <7enmg9$44h@dfw-ixnews10.ix.netcom.com> Newsgroups: sci.crypt Lines: 32 Secret-key signatures (also known as ``message authentication codes'') are much faster than public-key signatures. That's why they're used in ssh and many other systems. There's no need for the public to be able to check the signatures on ssh packets. Today most people use secret-key signatures based on MD5 or SHA, such as HMAC-MD5: The signature of m under secret key k is MD5(k,MD5(k+blah,m)). hash127 can be used for the same applications. hash127 is faster than HMAC-MD5 for messages of any size, and the underlying function includes a mathematical guarantee that it will never need to be replaced. Scott Fluhrer <sfluhrer@ix.netcom.com> wrote: > This reduces where this algorithm can be used to ones > where both the sender and the receiver are trusted. Actually, secret-key signatures can be used for untrusted communication. For example, one natural structure for signed person-to-person email is g^a, g^b, t, s, message where g^a is the sender's public key, g^b is the receiver's public key, t is a nonce chosen by the sender, and s is a signature of the message under a shared secret key derived from g^a, g^b, and t. The security properties here are just like face-to-face communication: the receiver knows that the message is authentic, but he can't prove it to someone else later. ---Dan
Subject: Re: Guaranteed message authentication faster than MD5 Date: Wed, 14 Apr 1999 18:16:20 GMT From: pierrepuzer@hotmail.com Message-ID: <7f2m1b$lm7$1@nnrp1.dejanews.com> References: <1999Apr623.34.22.28665@koobera.math.uic.edu> Newsgroups: sci.crypt Lines: 12 D. J. Bernstein wrote: > hash127 is the first ``universal hash function'' at this security level > to break the MD5 speed barrier. Is it possible to use hash127 or ideas from it to construct a provably unpredictable random function? How, exactly, do you prove the safeness of hash127? I read the abstract but it has no proof. -----------== Posted via Deja News, The Discussion Network ==---------- http://www.dejanews.com/ Search, Read, Discuss, or Start Your Own
Subject: Re: Guaranteed message authentication faster than MD5 Date: 14 Apr 1999 23:09:39 GMT From: djb@koobera.math.uic.edu (D. J. Bernstein) Message-ID: <1999Apr1423.09.39.15909@koobera.math.uic.edu> References: <7f2m1b$lm7$1@nnrp1.dejanews.com> Newsgroups: sci.crypt Lines: 50 > How, exactly, do you prove the safeness of hash127? Pick any two distinct messages, m and m', and (not necessarily distinct) integers s and s' between 0 and p-1, where p = 2^127-1. Say (r,x) is a uniform random 256-bit secret key. The chance of s being a signature of m is at least 1/2^127. The chance of s being a signature of m _and_ s' being a signature of m' is at most (3 max{n,n'} + 6)/2^255 where n and n' are the lengths of m and m' respectively. Thus the conditional probability of s' being a signature of m', given that s is a signature of m, is at most (3 max{n,n'} + 6)/2^128. It doesn't matter how much time the attacker spends after he sees m and s; any guess he comes up with for m' and s' will almost certainly be wrong. Proof that s is a signature of m with probability at least 2^129/2^256: There are 2^128 choices of r. For each choice of r, there are at least two choices of x satisfying s = (r^(n+1) + r^n m[0] + r^(n-1) m[1] + ... + r m[n-1] + x) mod p. So there are at least 2^129 choices of (r,x). Proof that s is a signature of m and s' is a signature of m' with probability at most (6 max{n,n'} + 12)/2^256: If s = (r^(n+1) + r^n m[0] + r^(n-1) m[1] + ... + r m[n-1] + x) mod p and s' = (r^(n'+1) + r^n' m'[0] + r^(n'-1) m'[1] + ... + r m'[n'-1] + x) mod p then r is a root mod p of the polynomial t^(n+1) + ... + t m[n-1] - s - t^(n'+1) - ... - t m'[n'-1] - s'. That polynomial is nonzero since m and m' are different, so it has at most max{n,n'} + 1 roots mod p, and therefore at most 2 max{n,n'} + 2 roots between 0 and 2p-1, as well as possibly 2p and 2p+1. For each of those 2 max{n,n'} + 4 choices of r there are at most 3 choices of x. > Is it possible to use hash127 or ideas from it to construct a provably > unpredictable random function? Nobody knows how to prove that one can efficiently stretch (say) 256 uniform random bits into (say) a string of 512 unpredictable random bits. This seems substantially more difficult than proving P<NP. What's interesting about authentication is that you only need to generate a few random bits for each message, no matter how long the messages are. ---Dan
Subject: Re: Guaranteed message authentication faster than MD5 Date: Thu, 15 Apr 1999 10:01:52 GMT From: puzer@my-dejanews.com Message-ID: <7f4dec$4ut$1@nnrp1.dejanews.com> References: <1999Apr1423.09.39.15909@koobera.math.uic.edu> Newsgroups: sci.crypt Lines: 20 > Nobody knows how to prove that one can efficiently stretch (say) 256 > uniform random bits into (say) a string of 512 unpredictable random > bits. However, is it possible to construct a random function from hash127 that is unpredictable enough in practice although one cannot prove it? SURF probably does the job but a single polynomial would look nicer than a generalized Feistel sequence. > Secret-key signatures are used in ssh and many other systems. There's no > need for the public to be able to check the signatures on ssh packets. Why not make an ssh replacement using hash127, Snuffle/SURF or SEOC, ucspi-tcp, ptyget and some key-exhange software? Is key-exhange the only missing piece? -----------== Posted via Deja News, The Discussion Network ==---------- http://www.dejanews.com/ Search, Read, Discuss, or Start Your Own
Subject: Extreme lossy text compression Date: Fri, 16 Apr 1999 00:20:22 GMT From: gteabo@my-dejanews.com Message-ID: <7f5vnu$hj1$1@nnrp1.dejanews.com> References: <1999Apr1423.09.39.15909@koobera.math.uic.edu> Newsgroups: sci.crypt Lines: 28 I have an idea to reduce redundancy in a large data store. When a plaintext is submitted to storage, I want to be able to easily determine whether that exact plaintext already exists in my storage. Briefly, my idea is to process some plaintext (which might be from 1KB to 10KB in length), and create, say, a 256 byte "key" that is derived from that plaintext that will be unique with a high degree of certainty. It would be kind of like an extreme "lossy" compression scheme. We'd be taking up to 10000 bytes of plaintext and creating a 256 byte "key" useful for matching identical plain texts. Then when a new plaintext is submitted, I can generate my 256 byte "key" and quickly search my database to determine if that 256 byte "key" already exists. Unlike most things discussed in sci.crypt: 1. I don't need to be able to recreate the plaintext from the key. 2. I don't care about security, so headers and things are perfectly fine. Let's assume that I won't be able to compare the original plaintexts to verify that they are actually identical. I need a high degree of confidence based upon the keys alone. Has anyone ever heard or seen anything like this before? -----------== Posted via Deja News, The Discussion Network ==---------- http://www.dejanews.com/ Search, Read, Discuss, or Start Your Own
Subject: Re: Extreme lossy text compression Date: Thu, 15 Apr 1999 19:11:05 -0600 From: John Myre <jmyre@sandia.gov> Message-ID: <37168E29.3A73F371@sandia.gov> References: <7f5vnu$hj1$1@nnrp1.dejanews.com> Newsgroups: sci.crypt Lines: 64 (see original question below) Any good hash (e.g. SHA-1) ought to work. The output is even shorter (20 bytes in the case of SHA-1) than requested, and so more efficient. The theory is that a good hash seems like a "random function" of the input and so the likelyhood of a false match (a "collision") is the same as two random values of the hash length being the same. For SHA-1 this means a 1 in 2^160 chance of error (*really* small). Intuitively, I think of it this way: we say a hash function is cryptographically secure when we don't think an adversary could generate a false match on purpose even with extraordinary amounts of compute power. If that is so, then an accidental false match must also be extremely unlikely, or else the adversary could generate a false match merely by trying random inputs. The way this is different from more ordinary check value functions (CRC's and the like) is that we (essentially) remove the chance that a few simple changes to a document would result in the same check value. An ordinary check value function has some structure which an adversary could use to figure out how to get a false match; given such a structure a "few" changes might work, and a "few" changes can happen by accident. If a 20 byte hash doesn't seem long enough, you could concatenate two or more different hashes (e.g., SHA-1 plus MD5, or a hash of the document plus a hash of the document modified in some simple way, like adding a blank at the end). (Of course, I am taking for granted that in fact you want even a tiny (e.g., 1 byte) change to show up as a difference; that you don't care if you store umpty-ump versions of the same document.) gteabo@my-dejanews.com wrote: > > I have an idea to reduce redundancy in a large data store. When a plaintext > is submitted to storage, I want to be able to easily determine whether that > exact plaintext already exists in my storage. > > Briefly, my idea is to process some plaintext (which might be from 1KB to 10KB > in length), and create, say, a 256 byte "key" that is derived from that > plaintext that will be unique with a high degree of certainty. > > It would be kind of like an extreme "lossy" compression scheme. We'd be > taking up to 10000 bytes of plaintext and creating a 256 byte "key" useful for > matching identical plain texts. > > Then when a new plaintext is submitted, I can generate my 256 byte "key" and > quickly search my database to determine if that 256 byte "key" already exists. > > Unlike most things discussed in sci.crypt: > 1. I don't need to be able to recreate the plaintext from the key. > 2. I don't care about security, so headers and things are perfectly fine. > > Let's assume that I won't be able to compare the original plaintexts to verify > that they are actually identical. I need a high degree of confidence based > upon the keys alone. > > Has anyone ever heard or seen anything like this before? > > -----------== Posted via Deja News, The Discussion Network ==---------- > http://www.dejanews.com/ Search, Read, Discuss, or Start Your Own
Subject: Re: Extreme lossy text compression Date: Fri, 16 Apr 1999 18:12:45 GMT From: gteabo@my-dejanews.com Message-ID: <7f7uil$8e8$1@nnrp1.dejanews.com> References: <37168E29.3A73F371@sandia.gov> Newsgroups: sci.crypt Lines: 68 John, I am very impressed and grateful for your reply. I think I understand what you mean by a "good hash" based upon your description of its characteristics. You mentioned two flavors: SHA-1 and MD5. While I have seen C source code posted for SHA-1 by an individual, I wondered whether there is any authoritative site where proven source can be found? Is there a web site that is considered the "Hash Authority". What about licensing these hashes? Who owns them? Is it considered freeware, shareware, or what? Or is it more just a programming structure like a "nested loop" which isn't really owned by anybody? To answer your question, in actual fact, I DO want to count even a 1 byte change as a change, so a good hash sounds BETTER than perfect for my needs. I wonder whether sufficient randomization can be achieved in just 20 bytes considering the non-random characteristic of my inputted plaintext. Generally the plaintext is going to be ordinary English language text (like that found in Usenet or in ordinary Email), so there will be a high concentration of the 26 letters A-Z upper and lower case, plus spaces and some punctuation. Can a "good hash" still output sufficiently random results from that input? Do hash developers run any statistical tests to back up their theory? Thanks, Geoff In article <37168E29.3A73F371@sandia.gov>, John Myre <jmyre@sandia.gov> wrote: > > Any good hash (e.g. SHA-1) ought to work. The output is even > shorter (20 bytes in the case of SHA-1) than requested, and so > more efficient. The theory is that a good hash seems like a > "random function" of the input and so the likelyhood of a false > match (a "collision") is the same as two random values of the > hash length being the same. > > I am taking for granted that in fact you want even a > tiny (e.g., 1 byte) change to show up as a difference; that you > don't care if you store umpty-ump versions of the same document.) > > gteabo@my-dejanews.com wrote: > > > > I have an idea to reduce redundancy in a large data store. When a plaintext > > is submitted to storage, I want to be able to easily determine whether that > > exact plaintext already exists in my storage. > > > > It would be kind of like an extreme "lossy" compression scheme. We'd be > > taking up to 10000 bytes of plaintext and creating a 256 byte "key" for > > matching identical plaintexts. > > > > Then when a new plaintext is submitted, I can generate my 256 byte "key" and > > quickly search my database to determine if that 256 byte "key" exists. > > > > Unlike most things discussed in sci.crypt: > > 1. I don't need to be able to recreate the plaintext from the key. > > 2. I don't care about security, so headers and things are perfectly fine. > > > > Assume I won't be able to compare the original plaintexts to verify > > that they are actually identical. I need a high degree of confidence based > > upon the keys alone. -----------== Posted via Deja News, The Discussion Network ==---------- http://www.dejanews.com/ Search, Read, Discuss, or Start Your Own
Subject: Re: Extreme lossy text compression Date: Fri, 16 Apr 1999 22:43:57 -0400 From: Geoff Thorpe <geoff@raas.co.nz> Message-ID: <3717F56D.2B1DA355@raas.co.nz> References: <7f7uil$8e8$1@nnrp1.dejanews.com> Newsgroups: sci.crypt Lines: 133 Hi there, > I think I understand what you mean by a "good hash" based upon your > description of its characteristics. You mentioned two flavors: SHA-1 and > MD5. While I have seen C source code posted for SHA-1 by an individual, I > wondered whether there is any authoritative site where proven source can be > found? Is there a web site that is considered the "Hash Authority". Loads of implementations are available in loads of places. I use cryptlib (http://www.cs.auckland.ac.nz/~pgut001/cryptlib/) and mostly through my Java wrapper for it (http://www.hoopal.com/geoff/jniCryptlib) ... if ALL you need is a hash function this would be somewhat like swatting flies with a large bus. I've just checked and cryptlib is using Eric Young's code (from SSLeay) for the hashing - that's under a GNU license I think ... anyway, you can probably just suck that code out (for SHA1) and use it, bearing in mind you can't sell or redistribute the result without taking a close look at the license. (BTW: For "proven source" - The various FIPS standards demostrate test vectors to check your implementation against, Peter has included these in cryptlib for start-up self-testing and you could probably consult his code to see the test vectors). > What about licensing these hashes? Who owns them? Is it considered > freeware, shareware, or what? Or is it more just a programming structure like > a "nested loop" which isn't really owned by anybody? Not sure about the licensing - but for sure it's not just a "nested loop" because this is more than a hash-function we're talking, it's a SECURE hash function which may be more than you actually need. > To answer your question, in actual fact, I DO want to count even a 1 byte > change as a change, so a good hash sounds BETTER than perfect for my needs. Yeah - but do you need a secure hash? The strength(s) of SHA1 (and MD5 and the other crypto-hash functions) is not just that they provide a good "ID" of a given piece of data (with your required property of any change in the input generating, with all probability, a corresponding change in output), but also that they are collision-resistant (as John said) ... that is, a hacker has a very tough job to find a piece of input data that would generate a chosen hash-value ... if you're just looking for a good index into your data-store, this is not an issue ... because you're not going to be deliberately trying to create texts that ARE different to anything there but have a matching hash-value just to force your application run an extra lookup to check. Maybe all you need is to use a standard hash function (or even cut your own if you feel daring) - lots of XORs and you're away [;-) Basically a standard hash (should) give you; -------------------------------------------- - If A and B are the same data, their hashes match (ie it's totally deterministic - the output is a function of the data only, no external or random information is used in the function). - If A and B are different (and independant/unrelated), then the probability their hashes match is pretty close to one in "m" (where "m" is the number of possible hash values, hopefully this would be equal to 2^n where n is the number of bits in the hash-value). This assumes however that A and B are independant - in most cases this probability REALLY isn't that microscopic if somebody is trying to intentionally construct a B for the purpose of matching A's output. If your texts are likely to be similar quite often - you might want to deterministically "munge" your input before applying a "standard" hash ... in other words choose a method for mashing up the contents to spread any differences in the text around ... then compression is always handy for hiding redundancy. A "secure hash" gives you; -------------------------- - the same as a standard hash function, EXCEPT; - even if somebody tries to construct a B for the purpose of matching A's output, they will (hopefully) fail to improve the probability much beyond 1 in "m" - this is why secure hashes are used in cryptographic "signature" techniques. The collision probabilty stays astronomically small even when you're TRYING to get them to collide. So a secure hash will solve your problems and by design, shouldn't require you to take the additional steps I suggested (if you had to then by definition, it's vulnerable to the problems those steps try to avoid). You just have to decide on convenience and/or performance. In C, on most platforms (OS and hardware), it's possible to get some pretty quick hashes going and from my experience; in network systems, particularly involving large databases, the time taken in a secure hash will be the last of your worries [;-) > I wonder whether sufficient randomization can be achieved in just 20 bytes > considering the non-random characteristic of my inputted plaintext. > Generally the plaintext is going to be ordinary English language text (like > that found in Usenet or in ordinary Email), so there will be a high > concentration of the 26 letters A-Z upper and lower case, plus spaces and > some punctuation. Can a "good hash" still output sufficiently random results > from that input? Do hash developers run any statistical tests to back up > their theory? See my above comment - making any change in the output data (be it appending something to it, transforming it all in some rational way, or just changing one byte a little bit) should make radical changes in the output of a secure hash function (and probably most good regular hash functions). Compressing data is a particularly good idea on language text (and should be considered when storing masses of it - especially if it's for archiving and not really high-load processing), but is not necessary at all in a secure hash. However, it may be handy for performance properties. English language text usually compresses a lot with typical algorithms (ZIP, etc) - you might want to try benchmarking the speed trade-offs on using or not using compression first (the compression takes a little time, but reduces the time required to hash because you're hashing less data). So a secure hash (unless it's been broken) is as good as the size of the hash-values it maps data to. And 20 bytes can hold a lot of different hash values. If the number of records in your database tables got close to even a 5-byte number I would be staggered. Some quick (and probably way-off) probability calculations; If you had 256^5 (maximum 5-byte number - just over a thousand billion) records in your database, which is a little optimistic - the probability that ANY two records had the same SHA1 hash would be less than 1 in 256^10 (that last huge number squared). Some realists would say you don't even have to run a check if two texts have matching hash-values - the chance that they do but aren't the same text is ... well ... unlikely (to grossly understate things). Another comment (although you probably don't want any more); if each record is storing a 20-byte hash in a seperate field (as a fixed-length ASCII field - 27 chars using BASE64, or 40 chars using hex-representation), you can optimize your database querying dramatically either by using it as your primary key (if you're prepared to take that astronomically small chance of a collision), or by putting an index on it (again, if you're prepared to take the non-risk then you could put a UNIQUE constraint index on it). The hash will provide a good serial number for each record, and with the index it would speed up your checks for duplicates. Cheers, Geoff
Subject: Re: Extreme lossy text compression Date: Sat, 17 Apr 1999 19:51:31 GMT From: Geoffrey Teabo <gteabo@my-dejanews.com> Message-ID: <7faoo0$iiv$1@nnrp1.dejanews.com> References: <3717F56D.2B1DA355@raas.co.nz> Newsgroups: sci.crypt Lines: 27 Geoff, Thank you for your TIME to write such a lengthy and VERY CLEAR post. > Maybe all you need is to use a standard hash function < If SHA1 and MD5 are well known secure hashes, what's a well known standard hash? > So a secure hash will solve your problems and by design, shouldn't > require you to take the additional steps I suggested By "additional steps" you mean the pre-hash "munge" you suggested for a standard hash, right? Maybe a secure hash is better for me just for its "munge"-free quality! FINALLY: I just want to re-ask one other question, which no one responded to... Do hash developers run any statistical TESTS to back up their theory of random output, particularly for secure hashes? Thanks, again, Geoff Teabo -----------== Posted via Deja News, The Discussion Network ==---------- http://www.dejanews.com/ Search, Read, Discuss, or Start Your Own
Subject: Re: Extreme lossy text compression Date: Sun, 18 Apr 1999 14:26:23 -0400 From: Geoff Thorpe <geoff@raas.co.nz> Message-ID: <371A23CF.650ACC8B@raas.co.nz> References: <7faoo0$iiv$1@nnrp1.dejanews.com> Newsgroups: sci.crypt Lines: 79 Hi there, > Thank you for your TIME to write such a lengthy and VERY CLEAR post. My pleasure - I've had to play with essentially the same issues you have a few times so if I can save you a little time as a result, great. > If SHA1 and MD5 are well known secure hashes, what's a well known standard > hash? Well here you've got me a little - I'm a mathematician interested in crypto who happens to make his $$ undercover as a programmer, not a computer scientist well versed in all Knuth's volumes of essential reading [;-) However, some others have mentioned a couple in these threads - the Universal Hash Function, and somebody also mentioned a simple (& fast) cute example involving XOR-ing in the blocks of the message one at a time (after seeding it with some constant key-value and padding incomplete blocks with zeroes I think?). This could suffice, although if you expect a LOT of your texts to be very similar but not quite the same this method would probably have a statistically better chance of collisions - just a hunch though, I'm not going to even try to justify that feeling though. > By "additional steps" you mean the pre-hash "munge" you suggested for a > standard hash, right? Maybe a secure hash is better for me just for its > "munge"-free quality! The munge would just be an idea to avoid what I was mentioning just before about my "suspected" weakness of the XOR-based hasher ... if the differences in many texts are isloted, possibly positioned on unfortunate block boundaries or some such thing, then maybe some of the hash values will come out similar - and worse, the same. A munge step would just be an attempt to make sure that any differences are given their best chance of influencing the hash value (or more precisely, influencing all of the hash value). A paranoid consideration, that's all. > FINALLY: I just want to re-ask one other question, which no one responded > to... Do hash developers run any statistical TESTS to back up their theory of > random output, particularly for secure hashes? Absolutely ... and even if they don't, there's a veritable army of cryptanalysts out there just waiting to puncture even the smallest hole in a "secure" hash algorithm (or any other class of crypto algorithm for that matter). If someone even found a highly contrived way to choose desirable texts that would cause the output hash values to exhibit ANY kind of pattern, other than random noise, then that hash algorithm would be discarded as a "cryptographic hash algorithm" and would turn overnight into a "standard hash" (and a slow one moreover). Many PRNG (psuedo-random number generators) take "random-info" from hardware, eg. hard-drive timings, kernel stats, mouse activity, blah blah - but although that info may be random, it may also be predictable structured meaning that if you just stick the numbers end on end, and XOR-hash the result you would have a VERY weak PRNG by cryptographic standards. What the PRNG's usually (or should) do is just keep throwing such "random-info" into a secure hash function - so as you can see, by design these hashes need to have the property that subtle changes in data, or data that is structurally similar every time, needs to still generate seemingly unpredictable output unless the data matches exactly. SHA1 and MD5 are dedicated secure hash algorithms, but there are others that are derived from symmetric ciphers - that gives you an indication of how seriously these algorithms are taken. They not only want a good statistical "spread" for input values - they want it to stay statistically uniform in the face of a determined attack to bias the outputs. In short, the answer to your question (for secure hashes) is yes. For standard hashes, I believe there is some basic theoretical results to the effect that given random (unknown) data A, the output value has equal probability of landing on all possible hash values etc, and similar probabilistic results about changes in input causing (in all probability) changes in output. I find that it's all very easy if you're prepared to accept hand-waving arguments without any mathematical details [;-) Well, best of luck, Geoff
Subject: Re: Extreme lossy text compression Date: 16 Apr 1999 16:01:06 GMT From: djb@koobera.math.uic.edu (D. J. Bernstein) Message-ID: <1999Apr1616.01.06.26203@koobera.math.uic.edu> References: <7f5vnu$hj1$1@nnrp1.dejanews.com> Newsgroups: sci.crypt Lines: 17 > Briefly, my idea is to process some plaintext (which might be from 1KB to 10KB > in length), and create, say, a 256 byte "key" that is derived from that > plaintext that will be unique with a high degree of certainty. Will the results be kept secret? If so, you can use a ``universal hash function.'' Universal hash functions are fast and guaranteed secure. See http://pobox.com/~djb/hash127.html for sample code. Even if the results are public, can you keep one short secret? If so, you can feed a universal hash function through a function such as Rijndael. This is still fast, and it's secure if Rijndael is. If nothing is secret, you'll have to resort to a ``collision-resistant hash function'' such as SHA-1. This is not as fast as a universal hash function. ---Dan
Subject: Re: Extreme lossy text compression Date: Fri, 16 Apr 1999 18:58:31 GMT From: gteabo@my-dejanews.com Message-ID: <7f818h$b5n$1@nnrp1.dejanews.com> References: <1999Apr1616.01.06.26203@koobera.math.uic.edu> Newsgroups: sci.crypt Lines: 66 Dan, Thank you for your reply. I'm not clear why secrecy is an issue. Maybe you're using the term "secret" in a technical sense that I don't understand. My exact application involves archiving. Imagine a program that would monitor news articles coming over the wire from AP or Reuters. I want to create a hash so that I can quickly search a database to determine whether that particular IDENTICAL newsarticle was ever received previously. So the whole system is secret, in a sense. The hash results will be created and stored on a backed closed system without any user interface. Keeping "one short secret" isn't a problem. Collision avoidance is my TOP priority. Security isn't. After all, the writers at AP aren't exactly composing their news stories to try to hack my system! They're just reporting the news. The number of articles that might be held in this system could be as many as 100,000 per year at first, growing to 200,000 per year within 5 years. Even if we purge every 20 years, that pace of growth will create a 1,000,000 article database within the likely life of this system. I need to know that with 1,000,000 different plaintexts in storage, that the next new unique article that comes down the pike has the near impossibility of creating hash identical to a NON-IDENTICAL article that was stored previously. The probability of two plaintext articles creating the same hash is what I need to minimize. It would also be nice if the hash were fast, of course. Is there some website or resource that you regard as the "Hash Authority" where I could learn about the different things you mention in your Email. I need to learn a little about all this, and I don't want to BUG you or the newsgroup for every last detail. Thanks for your ideas, and the link to your sample code. I'll check it out! Geoff In article <1999Apr1616.01.06.26203@koobera.math.uic.edu>, djb@koobera.math.uic.edu (D. J. Bernstein) wrote: > > Will the results be kept secret? If so, you can use a ``universal hash > function.'' Universal hash functions are fast and guaranteed secure. See > http://pobox.com/~djb/hash127.html for sample code. > > Even if the results are public, can you keep one short secret? If so, > you can feed a universal hash function through a function such as > Rijndael. This is still fast, and it's secure if Rijndael is. > > If nothing is secret, you'll have to resort to a ``collision-resistant > hash function'' such as SHA-1. This is not as fast as a universal hash > function. > Geoffrey Teabo wrote.... > > Briefly, my idea is to process plaintext (from 1KB to 10KB > > in length), and create, say, a 256 byte "key" that is derived from that > > plaintext that will be unique with a high degree of certainty. -----------== Posted via Deja News, The Discussion Network ==---------- http://www.dejanews.com/ Search, Read, Discuss, or Start Your Own
Subject: Re: Extreme lossy text compression Date: Fri, 16 Apr 1999 22:25:00 +0200 From: Terje Mathisen <Terje.Mathisen@hda.hydro.com> Message-ID: <37179C9C.569E@hda.hydro.com> References: <7f818h$b5n$1@nnrp1.dejanews.com> Newsgroups: sci.crypt Lines: 90 gteabo@my-dejanews.com wrote: > > Dan, > > Thank you for your reply. > > I'm not clear why secrecy is an issue. Maybe you're using the term "secret" > in a technical sense that I don't understand. > > My exact application involves archiving. Imagine a program that would monitor > news articles coming over the wire from AP or Reuters. I want to create a > hash so that I can quickly search a database to determine whether that > particular IDENTICAL newsarticle was ever received previously. > > So the whole system is secret, in a sense. The hash results will be created > and stored on a backed closed system without any user interface. Keeping "one > short secret" isn't a problem. > > Collision avoidance is my TOP priority. Security isn't. After all, the > writers at AP aren't exactly composing their news stories to try to hack my > system! They're just reporting the news. > > The number of articles that might be held in this system could be as many as > 100,000 per year at first, growing to 200,000 per year within 5 years. Even > if we purge every 20 years, that pace of growth will create a 1,000,000 > article database within the likely life of this system. > > I need to know that with 1,000,000 different plaintexts in storage, that the > next new unique article that comes down the pike has the near impossibility of > creating hash identical to a NON-IDENTICAL article that was stored previously. > > The probability of two plaintext articles creating the same hash is what I > need to minimize. > > It would also be nice if the hash were fast, of course. OK, in this case you have a few options: 1) (The simplest:) Calculate the total number of articles your system needs to handle over it's lifetime, and you can then determine the minimum number of bits needed in a hash to keep the chance of any collisions below an arbitrary limit. BTW, I believe this is the 'Birthday problem', i.e. in a group of 24 people, there is a more than 50% chance of at least two people having the same birthday. I don't remember the approximate formula for this when working with big numbers, but I'm willing to bet that with just 1E6 articles, a 128-bit hash would be more than enough. In fact, it was initially suggested that network MAC addresses (48 bits) should be determined by random, since this would make the likelyhood of any collisions low enough to be disregarded. (This suggestion was vetoed, with the result that manufacturing snafu's have produced many cards with unintentionally duplicated MACs.) With a 64-bit hash, you have approximately 1.6E19 different hash values, so the likelyhood of a collision (assuming uniform random distribution of hash values) should be 1 - ((1.6E19-1) / 1.6E19 * (1.6E19-2) / 1.6E19 * ... * (1.6E19-999999)/1.6E19) As a first (worst-case) approximation, I'll take the smallest value of those ratios (1.6E19-1E7) / 1.6E19 = 1 - 6.22E14 exp(ln((1.6E19-1E7)/1.6E19)*1E7) = 0.9999937494639061 So, the final value would be 1 - 0.9999937494639061 = 6.25E-6 which means that since the true value will be quite a bit smaller, your real odds against ever seeing a collision should be small enough to be disregarded. Back to your original question, which hash algorithm to select? Any of the well-known message digest functions would work, or you could use DES or some other encryption algorithm, xor'ing 64-bit result blocks together. Terje -- - <Terje.Mathisen@hda.hydro.com> Using self-discipline, see http://www.eiffel.com/discipline "almost all programming can be viewed as an exercise in caching"
Subject: Re: Extreme lossy text compression Date: 17 Apr 1999 06:32:54 GMT From: djb@koobera.math.uic.edu (D. J. Bernstein) Message-ID: <1999Apr1706.32.54.29838@koobera.math.uic.edu> References: <7f818h$b5n$1@nnrp1.dejanews.com> Newsgroups: sci.crypt Lines: 30 > The number of articles that might be held in this system could be as many as > 100,000 per year at first, growing to 200,000 per year within 5 years. Let's say you receive one billion different articles, each up to a megabyte long. You then choose a 38-digit number r and feed each article, along with your number r, to hash127. There's at least a 99.9999999999995% chance that the results will all be different: for any possible set of articles, at most 0.0000000000005% of the r's will produce a collision. > It would also be nice if the hash were fast, of course. hash127 on my old Pentium-133 runs at about 238 million bits per second from L2 cache; e.g., it hashes 40000 bytes in about 1.4 milliseconds. > I'm not clear why secrecy is an issue. If you reveal the hash for more than one article then a criminal can create new articles with any desired hash. Even if you don't think this is a problem now, there's no reason for you to allow it. You can, for example, feed the hash127 result through Rijndael_k with a secret key k. This is very fast---the hash127 result is only 16 bytes long. > Is there some website or resource that you regard as the "Hash Authority" > where I could learn about the different things you mention in your Email. No. The literature on universal hash functions is a mess. ---Dan
Subject: Re: Extreme lossy text compression Date: Sat, 17 Apr 1999 19:36:34 GMT From: Geoffrey Teabo <gteabo@my-dejanews.com> Message-ID: <7fans2$hot$1@nnrp1.dejanews.com> References: <1999Apr1706.32.54.29838@koobera.math.uic.edu> Newsgroups: sci.crypt Lines: 46 Dan, Criminals are just soooo NOT an issue for this application. Suppose I'm monitoring stories coming in off a newswire service (like Reuters), I'm just trying to make sure that I don't capture the identical news story TWICE in my database. A so-called attacker, or criminal, in this case would be trying to compose a news article to intentionally collide with an old news article, and supposedly to what purpose? The only effect would be that my system would IGNORE the new phony story anyway!!!!! How ironic! So clearly my ONLY concern is that a new VALID and REAL news story would appear with the same hash result as the hash result of ANOTHER DIFFERENT and OLDER news story. That would be really BAD because I'd be ignoring a VALID story. In light of my specific issues, another post to this thread suggested that a 128 bit CRC would be okay, and that a hash is more trouble than it's worth. What do you think? I have a hunch that a CRC isn't as good because it's not as random, and I might have to worry that similar plaintexts like "BBBBBBBB" and "BBBBBBDA" might have the same CRC. Geoff === Dan wrote: > > If you reveal the hash for more than one article then a criminal can > create new articles with any desired hash. Even if you don't think this > is a problem now, there's no reason for you to allow it. You can, for > example, feed the hash127 result through Rijndael_k with a secret key k. > This is very fast---the hash127 result is only 16 bytes long. > === Geoff wrote: > > I'm not clear why secrecy is an issue. -----------== Posted via Deja News, The Discussion Network ==---------- http://www.dejanews.com/ Search, Read, Discuss, or Start Your Own
Subject: Re: Extreme lossy text compression Date: 17 Apr 1999 15:20:49 -0600 From: vjs@calcite.rhyolite.com (Vernon Schryver) Message-ID: <7fatvh$1k9$1@calcite.rhyolite.com> References: <7fans2$hot$1@nnrp1.dejanews.com> Newsgroups: sci.crypt Lines: 73 In article <7fans2$hot$1@nnrp1.dejanews.com>, Geoffrey Teabo <gteabo@my-dejanews.com> wrote: > ... >Suppose I'm monitoring stories coming in off a newswire service (like >Reuters), I'm just trying to make sure that I don't capture the identical >news story TWICE in my database. > >A so-called attacker, or criminal, in this case would be trying to compose a >news article to intentionally collide with an old news article, and supposedly >to what purpose? The only effect would be that my system would IGNORE the new >phony story anyway!!!!! How ironic! > >So clearly my ONLY concern is that a new VALID and REAL news story would >appear with the same hash result as the hash result of ANOTHER DIFFERENT and >OLDER news story. That would be really BAD because I'd be ignoring a VALID >story. In that case, why not do the standard, ancient, obvious thing? Pick a fast, reasonable hash function, and keep an online database of all messages keyed by their hash values. Then as each message arrives, compute its hash value, and look up the hash value in your database. If there is a record with that hash value, then compare the new message with the record. If appropriate, let the comparison be "fuzzy' (and if so, remember to make the hash function use the fuzzy value of the message). If the new message matches the existing record, you have a duplicate, so do whatever is appropriate. Use whatever database is reasonable give the number of messages you need to deal with. Given modern disks, you can easily keep everything that a newswire carries on line for long enough. That worked great for a spam filter on a large corporate firewall. The system received about 200,000 email messages/day, of which between 10,000 and 14,000 were copies of 50 to 200 different streams of spam, some with the usual spammer customizing such as "Dear sucker" or random junk prepended or appended. Every email message passing through the system was hashed and checked against an automatically maintained "database" of previously seen spam. The hash function I chose was a simple, 32-bit byte-sum of the alphabetic characters in the message, excluding whitespace, numbers, punctuation, etc. In practice, the hash function had a collision rate of much better than 10e-6 messages. The "database" was a single UNIX field directory, with the "key" consisting of the hash value used as a file name containing the message. Hash collisions were handled by appending a "-%d" string to the name. >In light of my specific issues, another post to this thread suggested that a >128 bit CRC would be okay, and that a hash is more trouble than it's worth. > >What do you think? I think it's sad that people are unwilling to think about hashes. There is as much snake oil for hash functions as for encryption. An early response in this thread was typical of the common, silly notion that 'secure' hash functions are somehow less subject to hash collisions than hash functions. That secure hash functions are intended to be hard to invert does not let them magically map domains of 2^10000 or more messages 1-to-1 onto ranges of 2^128 hash values. >I have a hunch that a CRC isn't as good because it's not as random, and I >might have to worry that similar plaintexts like "BBBBBBBB" and "BBBBBBDA" >might have the same CRC. On the contrary, CRC-32 and other error detecting and correcting hash functions are expressly designed to detect such small changes. There are good reasons why your disk drive as well as your network links do not use SHA or MD5 to detect bit rot. If you care more about detecting small changes than making it hard for bad guys to compute naughty messages that have the same hash value as a good message, then your only rational choice is one of the many error detecting hash functions. -- Vernon Schryver vjs@rhyolite.com
Subject: Re: Extreme lossy text compression Date: 17 Apr 1999 23:26:16 GMT From: djb@koobera.math.uic.edu (D. J. Bernstein) Message-ID: <1999Apr1723.26.16.3111@koobera.math.uic.edu> References: <7fans2$hot$1@nnrp1.dejanews.com> Newsgroups: sci.crypt Lines: 24 Geoffrey Teabo <gteabo@my-dejanews.com> wrote: > I have a hunch that a CRC isn't as good because it's not as random, Actually, a 128-bit CRC _with a random polynomial_ guarantees an extremely low collision probability for any fixed set of inputs. (Just like hash127, except that hash127 is faster and provides a slightly better bound on the collision probability.) > A so-called attacker, or criminal, in this case would be trying to compose a > news article to intentionally collide with an old news article, and supposedly > to what purpose? What if the criminal spots a legitimate article before you do, and quickly feeds you a fake article with the same hash, preventing the legitimate article from entering your database? ``We're in sci.crypt. We're supposed to be paranoid.'' An attacker can easily do this if he knows your CRC polynomial; and he can easily figure out the polynomial given a few CRC results. That's why secrecy is an issue. ---Dan
Subject: Re: Extreme lossy text compression Date: 17 Apr 1999 19:24:38 -0700 From: daw@blowfish.isaac.cs.berkeley.edu (David Wagner) Message-ID: <7fbfp6$hi9$1@blowfish.isaac.cs.berkeley.edu> References: <1999Apr1723.26.16.3111@koobera.math.uic.edu> Newsgroups: sci.crypt Lines: 7 In article <1999Apr1723.26.16.3111@koobera.math.uic.edu>, D. J. Bernstein <djb@koobera.math.uic.edu> wrote: > Actually, a 128-bit CRC _with a random polynomial_ guarantees an > extremely low collision probability for any fixed set of inputs. I'm confused. Isn't it true that CRC(x) = CRC(0||x), for any polynomial? ("||" represents concatenation of bits.)
Subject: Re: Extreme lossy text compression Date: Sun, 18 Apr 1999 20:45:27 +0200 From: Terje Mathisen <Terje.Mathisen@hda.hydro.com> Message-ID: <371A2847.FAE@hda.hydro.com> References: <7fbfp6$hi9$1@blowfish.isaac.cs.berkeley.edu> Newsgroups: sci.crypt Lines: 21 David Wagner wrote: > > In article <1999Apr1723.26.16.3111@koobera.math.uic.edu>, > D. J. Bernstein <djb@koobera.math.uic.edu> wrote: > > Actually, a 128-bit CRC _with a random polynomial_ guarantees an > > extremely low collision probability for any fixed set of inputs. > > I'm confused. Isn't it true that CRC(x) = CRC(0||x), for > any polynomial? ("||" represents concatenation of bits.) Only if your CRC function is initialized with zero! That's why you often see CRC functions that start with (unsigned) (-1) instead: This way all leading zero bits are significant. Terje -- - <Terje.Mathisen@hda.hydro.com> Using self-discipline, see http://www.eiffel.com/discipline "almost all programming can be viewed as an exercise in caching"
Subject: Re: Extreme lossy text compression Date: 19 Apr 1999 04:45:47 GMT From: djb@koobera.math.uic.edu (D. J. Bernstein) Message-ID: <1999Apr1904.45.47.8690@koobera.math.uic.edu> References: <7fbfp6$hi9$1@blowfish.isaac.cs.berkeley.edu> Newsgroups: sci.crypt Lines: 35 David Wagner <daw@blowfish.isaac.cs.berkeley.edu> wrote: > I'm confused. Isn't it true that CRC(x) = CRC(0||x), for > any polynomial? Not if you initialize the CRC correctly. Given input bits m[1], m[2], ..., m[n-128], compute the polynomial x^n + m[1] x^(n-1) + m[2] x^(n-2) + ... + m[n-128] x^128 and divide by a secret irreducible degree-128 polynomial p. Then add 128 secret bits to the remainder. The result is is what people call ``almost strongly universal,'' i.e., almost uniformly distributed on any pair of distinct inputs. It's a secure single-message authenticator. Common modifications: * Use a public value for the 128 bits added at the end: typically 0 or all-1s. The result is ``almost xor universal,'' i.e., has an almost uniform output difference for any two distinct inputs. * Omit the factor x^128 from the polynomial. The result is ``almost universal,'' i.e., almost never collides on any two distinct inputs. This is good enough for many applications. * Omit the leading x^n term from the polynomial. This only works for fixed-length inputs. (It's a common mistake in the literature.) * Use a public value for p. This doesn't pose any risk if the input is chosen independently of p. All of these variants, and many more, are called CRCs in practice. ---Dan
Subject: Re: Extreme lossy text compression Date: Sat, 17 Apr 1999 12:33:02 GMT From: Scott Fluhrer <sfluhrer@ix.netcom.com> Message-ID: <7f9uhk$91k@sjx-ixn9.ix.netcom.com> References: <7f5vnu$hj1$1@nnrp1.dejanews.com> Newsgroups: sci.crypt Lines: 40 In article <7f5vnu$hj1$1@nnrp1.dejanews.com>, gteabo@my-dejanews.com wrote: >I have an idea to reduce redundancy in a large data store. When a plaintext >is submitted to storage, I want to be able to easily determine whether that >exact plaintext already exists in my storage. > >Briefly, my idea is to process some plaintext (which might be from 1KB to 10KB >in length), and create, say, a 256 byte "key" that is derived from that >plaintext that will be unique with a high degree of certainty. > >It would be kind of like an extreme "lossy" compression scheme. We'd be >taking up to 10000 bytes of plaintext and creating a 256 byte "key" useful for >matching identical plain texts. > >Then when a new plaintext is submitted, I can generate my 256 byte "key" and >quickly search my database to determine if that 256 byte "key" already exists. > >Unlike most things discussed in sci.crypt: >1. I don't need to be able to recreate the plaintext from the key. >2. I don't care about security, so headers and things are perfectly fine. > >Let's assume that I won't be able to compare the original plaintexts to verify >that they are actually identical. I need a high degree of confidence based >upon the keys alone. > >Has anyone ever heard or seen anything like this before? Several people have already given you responses that work quite nicely if you have to worry about attackers trying to generate collisions. If you really don't care about that, then you might as well use a 128 bit CRC. This can be computed much faster than any known hash function, and provably will change on small differences, and on random changes, it's maximally good (any 128 bit result is equally likely, which is more than we know about most hash functions). Its only drawback is that it's pretty trivial for an attacker to generate collisions, but you said you didn't have to worry about that. -- poncho
Subject: Re: Extreme lossy text compression Date: Sat, 17 Apr 1999 19:22:24 GMT From: Geoffrey Teabo <gteabo@my-dejanews.com> Message-ID: <7fan19$h59$1@nnrp1.dejanews.com> References: <7f9uhk$91k@sjx-ixn9.ix.netcom.com> Newsgroups: sci.crypt Lines: 36 Poncho... That's right. I really don't care about "attackers" at all. I'm monitoring an independent system that doesn't even know it's being monitored. Imagine if my program were collecting news articles over the wire services. I just want to be able to use a hash (or CRC) of each incoming plaintext to search a historical "list" of past hashes (or CRCs) to determine whether that particular IDENTICAL newsarticle was ever received previously. As I pointed out in another post, news writers won't compose their news stories to try to hack my system! They're just reporting the news. They don't even know I have a system!!! Can you point me toward source code for 128 bit CRC ??? Thanks a huge bunch. I don't want overkill. Scott Fluhrer <sfluhrer@ix.netcom.com> wrote: > > Several people have already given you responses that work quite nicely if you > have to worry about attackers trying to generate collisions. If you really > don't care about that, then you might as well use a 128 bit CRC. This can be > computed much faster than any known hash function, and provably will change on > small differences, and on random changes, it's maximally good (any 128 bit > result is equally likely, which is more than we know about most hash functions). > Its only drawback is that it's pretty trivial for an attacker to generate > collisions, but you said you didn't have to worry about that. > -----------== Posted via Deja News, The Discussion Network ==---------- http://www.dejanews.com/ Search, Read, Discuss, or Start Your Own
Subject: Re: Extreme lossy text compression Date: Sat, 17 Apr 1999 22:26:52 -0400 From: Boris Kazak <bkazak@worldnet.att.net> Message-ID: <371942EC.19CB@worldnet.att.net> References: <7fan19$h59$1@nnrp1.dejanews.com> Newsgroups: sci.crypt Lines: 450 Geoffrey Teabo wrote: > That's right. I really don't care about "attackers" at all. I'm monitoring > an independent system that doesn't even know it's being monitored. > > Imagine if my program were collecting news articles over the wire services. > I just want to be able to use a hash (or CRC) of each incoming plaintext to > search a historical "list" of past hashes (or CRCs) to determine whether that > particular IDENTICAL newsarticle was ever received previously. > > As I pointed out in another post, news writers won't compose their news > stories to try to hack my system! They're just reporting the news. They > don't even know I have a system!!! > > Can you point me toward source code for 128 bit CRC ??? > --------------------------- This should be adequate for your purposes, you can use it or throw it away, as you wish... I am not sure that you need the technical information, but included it for completeness. For your application I would reduce the ROUNDS constant in the header to 2, just to gain speed. BTW, your greatest concern will be in making the probability of "birthday paradox" as little as possible. For rough estimate, the probability of 2 different messages hashing to the same value rises to 0.5 when the number of messages rises to 2^(n/2), where "n" is the number of bits in the hash. Thus, with a 128-bit hash you must accumulate about 2^64 messages before you will start really worrying about hash collisions. With 10 million messages a year it will take about 16 trillion years... and you can always doublecheck the messages, should such an unlikely event happen once in a decade. The statistical tests indicate a uniform distribution of hex digits in the final hash with no noticeable biases (6 runs by 1 million hashes each). This proposed function is based on multiplication (mod 2^32+1) with two "twists". First, however, some background information. ************************* The program takes the filename from the simple dialog and writes the hash into a new file with the name of the original file and extension "h32". Original message is divided into segments of the size twice that of the final hash, e.g. if the final hash will be 160 bit, the segments will be 320 bit each. The size of the hash and segments is #defined by the HASHBLOCKSIZE constant and can be altered according to need. For each block, there are 3 rounds (for those paranoid among us, there is a #define, where one can change the ROUNDS constant to whatever desired). ************************* The function uses 256 different modular multipliers derived from one initial number "SEED". This number is 0x6A09E667, which happens to be the first 8 hex digits of SQRT(2), less initial 1. The use of this number should dispel any suspicions about a "trap door" built into SEED. The multipliers themselves are simply the powers of SEED, so that Mult[k] = SEED^(k+1), this way SEED^0 = 1 is avoided, we don't want any trivial multipliers. Also, this number produces a 33,502,080 long cycle of its powers until it comes back to 1, so there are no duplicate multipliers in the array. The multiplier array can be precomputed in advance or can be filled at the program start. The latter case saves some code (no need for a big data table), but implies a time penalty of 255 modular multiplications. Now about the "twists". The first twist boils down to the plaintext-dependent choice of the multiplier. If a certain 4-byte number is going to be hashed, the corresponding multiplier is chosen from the table with the index I = XOR(all 4 bytes of the number). Second twist regards the progression along the block. After the multiplication, when the bytes of the product are returned to their places and it is time to proceed with the subsequent multiplication, the two LS bytes of the previous product become the two MS bytes of the new number to be multiplied, and two adjacent bytes of the plaintext are appended to them in order to make this new 32-bit number. When the end of the block is reached, the index wraps cyclically over the BlockLength. Graphically a round can be visualized in the following sketch: ( the index is circular mod "BlockLength" ) ___________________________________________________________ 1-st multiplication XXXXxxxxxxxxxxxxxxxxxxxxxxxxxxxx ( XXXX - 32-bit number to multiply, multiplier index is computed as X^X^X^X) ___________________________________________________________ 2-nd multiplication ppPPXXxxxxxxxxxxxxxxxxxxxxxxxxxx (ppPP - product of the first multiplication, PPXX - 32-bit number to multiply, multiplier index is computed as P^P^X^X) ___________________________________________________________ 3-d multiplication ppppPPXXxxxxxxxxxxxxxxxxxxxxxxxx ( PPXX - 32-bit number to multiply, multiplier index is computed as P^P^X^X) ........................................................... Last multiplication XXppppppppppppppppppppppppppppPP ( The index is cyclically wrapped over - PPXX - 32-bit number to multiply, multiplier index is computed as P^P^X^X) ___________________________________________________________ This arrangement allows an ultra-rapid propagation of all changes across the entire block - there is complete avalanche after the 2-nd round and then after each subsequent round. It should be also noted that this arrangement effectively molds the entire hash into an inseparable entity, where the junctions between the adjacent bytes and words don't exist any more, they are all blurred by overlapping and by cyclical wrapover of the index. This strongly discourages any attacks based on different behavior of the distinct bytes and words during hashing. In this function the entire hash is just a circular sequence of bits without beginning or end, without the junctions between bytes or words, without any realistic possibility to trace the output changes back to the changes in any particular word or byte. The cryptanalyst would either be forced to attack the hash as a whole (128 or even 256 bits), or just give up, which I hope will precisely happen. After 3 rounds, the new subsequent block is read from the message and XOR-ed with the hash obtained in the previous cycles. Then the hashing procedure repeats again for 3 rounds, and so on until the last message block is read. If the last message block is shorter than needed, it is padded with 0-s. Thereafter one more block is added to the hash, containing the full message length as a 64-bit binary number padded with 0-s. The final operation XOR-s the two halves of the hash buffer, producing the output hash value. Since in each multiplication step the multiplicand is replaced with the product, there is no practical way of reconstructing the multiplier index (other than computing multiplicative inverses for all 256 multipliers and testing the product to find out the match for the index used, however, the final XOR-ing defeats this option). Initially the hash buffer contains either 0's or an appropriate IV. No hashing is done on this value, the first message block is just XOR-ed into the hash buffer. If the secret key will be used as the IV, the function will produce a MAC. The source code was compiled under Borland C++ 4.52 and should compile under any ANSI-compliant C compiler. The routines for breaking and making the 32-bit integers from 4 bytes should assure identical results on both big- endian and little-endian platforms (not tested). No attempt has been made in order to optimize the program in any way. After all, it is supposed to be just a working model, not a racing car. One simple possible optimization might reverse the direction of scanning over the "text" block for little-endian Intel processors. Any comments would be appreciated. /********************* Start of code ************/ #include <stdio.h> #include <string.h> #include <stdlib.h> #define byte unsigned char #define word16 unsigned int #define word32 unsigned long int #define LO 0x0000FFFFUL #define HI 0xFFFF0000UL #define HASHBLOCKSIZE 16 #define ROUNDS 3 #define SEED 0x6A09E667UL /* Modular Multiplier seed */ /* First 8 hex digits of SQRT(2), less initial 1 */ /* variables */ word32 mult[256]; /* multiplier array */ word32 message_length; /* size of input file */ int end_of_file; byte inbuffer[2*HASHBLOCKSIZE], outbuffer[HASHBLOCKSIZE]; byte text[2*HASHBLOCKSIZE]; /* the working array */ /*file handling functions*/ char read_char_from_file(FILE *fp); void read_block_from_file(FILE *fp); void write_char_to_file(char data,FILE *fp); /* Functions */ word32 Mul (word32 x, word32 y) /* Returns (x * y) mod (2^32+1) */ /* For the purposes of this algorithm 0 = 2^32 */ /* Modular Multiplication "CORRECT BUT SLOW" (mod 2^32+1) If your compiler does not handle the "double long" integers (64 bit), you have to split the 32-bit words into 16-bit halves, multiply them, then juggle the intermediate products until you get the correct result. */ { word32 t[4]; word16 u[2], v[2]; if (x == 0) return (1 - y); if (y == 0) return (1 - x); u[0] = (word16)(x & LO); v[0] = (word16)(y & LO); u[1] = (word16)((x >> 16) & LO); v[1] = (word16)((y >> 16) & LO); t[0] = (word32) u[0] * (word32) v[0] ; t[1] = (word32) u[1] * (word32) v[0] ; t[2] = (word32) u[0] * (word32) v[1] ; t[3] = (word32) u[1] * (word32) v[1] ; /* intermediate 32-bit products */ t[3] = t[3] + ((t[1] >> 16) & LO) + ((t[2] >> 16) & LO); /* no overflow possible here */ t[1] = (t[1] & LO) + (t[2] & LO) + ((t[0] >> 16) & LO); /* collect them all in t[1]. Overflow into the 17-th bit possible here */ t[0] = (t[0] & LO) + ((t[1] << 16) & HI); t[3] = t[3] + ((t[1] >> 16) & LO); /* add eventual overflow */ return (t[0] - t[3] + (t[3] > t[0])); } /* Mul */ #define MUL(x,y) (x=Mul(x,y)) void MultSetup (word32 *mul) { int k; *mul = SEED; /* Initialize multiplier array */ for (k = 1; k < 256; k++) { *(mul+k) = *(mul+k-1); /* Copy next <- previous */ MUL (*(mul+k),SEED); /* Subsequent power */ } } /* MultSetup */ void DissH1( word32 H, byte *D ) /* Disassemble the given word32 into 4 bytes. We want it to work on all kinds of machines, Little-Endian or Big-Endian. */ { word32 T ; T = H ; *D++ = (byte)((T & 0xff000000UL) >> 24) ; *D++ = (byte)((T & 0x00ff0000UL) >> 16) ; *D++ = (byte)((T & 0x0000ff00UL) >> 8) ; *D = (byte) (T & 0x000000ffUL) ; } /* DissH1 */ word32 MakeH1( byte *B ) /* Assemble a word32 from the four bytes provided. We want it to work on all kinds of machines, Little-Endian or Big-Endian. */ { word32 RetVal, temp ; temp = *B++ ; RetVal = (temp << 24) ; temp = *B++ ; RetVal += (temp << 16) ; temp = *B++ ; RetVal += (temp << 8) ; RetVal += *B ; return RetVal ; } /* MakeH1 */ void Scramble (void) /* This subroutine scrambles the "text" block */ /* It uses the global variables */ /* mult[], text[] and inbuffer[] */ { int i,k,m; word32 temp1, temp2; /* XOR-ing the input into the text array */ for (m=0; m < 2*HASHBLOCKSIZE; m++) text[m] ^= inbuffer[m]; /* Now we can start squashing the block */ for (m=0; m<ROUNDS; m++) { for (k=0; k<2*HASHBLOCKSIZE; k+= 2) { /* Build a 32-bit multiplicand and multiplier index */ /* We want this to produce same results on big-endian */ /* and little-endian machines alike */ temp2 = text[k % (2*HASHBLOCKSIZE)] ; /* Essentially this procedure */ i = (int)temp2; /* is identical to MakeH2 function */ temp1 = (temp2 << 24) ; /* However, it is impossible to use */ temp2 = text[(k+1)%(2*HASHBLOCKSIZE)] ; /* MakeH2 function directly, because */ i ^= (int)temp2; /* the loop index must be wrapped */ temp1 += (temp2 << 16) ; /* mod 2*HASHBLOCKSIZE, and also the */ temp2 = text[(k+2) % (2*HASHBLOCKSIZE)] ; /* multiplier index is computed, */ i ^= (int)temp2; /* which is not provided in MakeH2 */ temp1 += (temp2 << 8) ; temp1 += text[(k+3) % (2*HASHBLOCKSIZE)] ; i ^= (int)temp2; /* Get the modular product into "temp1" */ MUL(temp1, mult[i]); /* Return the bytes where they belong */ text[k % (2*HASHBLOCKSIZE)] = (byte)((temp1 & 0xff000000UL) >> 24) ; text[(k+1) % (2*HASHBLOCKSIZE)] = (byte)((temp1 & 0x00ff0000UL) >> 16) ; text[(k+2) % (2*HASHBLOCKSIZE)] = (byte)((temp1 & 0x0000ff00UL) >> 8) ; text[(k+3) % (2*HASHBLOCKSIZE)] = (byte) (temp1 & 0x000000ffUL) ; } } } /* Scramble */ char read_char_from_file(FILE *fp) { char temp=0; message_length++; if ((fread(&temp,sizeof(char),1,fp))!=1) { end_of_file=1; message_length--; } return (temp); } /* read_char_from_file */ void read_block_from_file(FILE *fp) { int i; for (i=0; i<2*HASHBLOCKSIZE; i++) { inbuffer[i]=read_char_from_file(fp); } } /* read_block_from_file */ void write_char_to_file(char data,FILE *fp) { if ((fwrite(&data,sizeof(char),1,fp))!=1) { printf("Fatal Error writing output file!!!\n"); exit(-1); } } /* write_char_to_file */ void main() { FILE *in,*out; char infile[100], outfile[100], *dot; int k; /* Opening files */ printf("\nEnter input filename:"); gets(infile); if ((in=fopen(infile,"r+b"))==NULL) { printf("\nError opening input file: %s\n",infile); exit (-1); } strcpy(outfile,infile); dot=strrchr(outfile,'.'); dot++; *dot++='h'; *dot++='3'; /* Changing file extension */ *dot++='2'; *dot='\0'; /* to .h32 */ if ((out=fopen(outfile,"w+b"))==NULL) { printf("\nError opening output file: %s\n",outfile); exit(-1); } printf("\nOutput file is: %s\n",outfile); /* Setting up the multipliers */ MultSetup(mult); /* Clearing the text buffer */ for (k=0; k<2*HASHBLOCKSIZE; k++) text[k]=0; /* Reading and squashing the input file */ while(end_of_file != 1) { read_block_from_file(in); Scramble(); } /* Building up and squashing the final block */ for (k=0; k<4; k++) inbuffer[k]=0; DissH1(message_length, (inbuffer+4)); for (k=8; k<2*HASHBLOCKSIZE; k++) inbuffer[k]=0; Scramble(); /* Building and writing the final hash */ for (k=0; k<HASHBLOCKSIZE; k++) outbuffer[k] = text[k]^text[k+HASHBLOCKSIZE]; for (k=0; k<HASHBLOCKSIZE; k++) { write_char_to_file(outbuffer[k],out); } fclose(in); fclose(out); }
Subject: Re: Extreme lossy text compression Date: 17 Apr 1999 22:20:41 GMT From: djb@koobera.math.uic.edu (D. J. Bernstein) Message-ID: <1999Apr1722.20.41.2875@koobera.math.uic.edu> References: <7f9uhk$91k@sjx-ixn9.ix.netcom.com> Newsgroups: sci.crypt Lines: 27 > you might as well use a 128 bit CRC. hash127 is much faster than a 128-bit CRC. Try it for yourself: http://pobox.com/~djb/hash127.html In spirit these functions are really all the same: * reduce (Z/2)[x] modulo some prime x^128-... (CRC/``division''); * reduce Z modulo some prime 2^128-... (basically Karp-Rabin); * reduce F_(2^32)[x] modulo some prime x^4-... (Shoup); * reduce (Z/(2^31-1))[x] modulo some prime x^4-...; * reduce F_(2^128)[x] modulo some prime x-... (``evaluation''); * reduce (Z/(2^127-1))[x] modulo some prime x-... (hash127); * etc. All of them have about the same natural hardware speed. However, in software, arithmetic in large characteristic benefits from CPU support for fast multiplication---often buried inside the floating-point unit, but still accessible---whereas arithmetic in characteristic 2 doesn't. I hope that CPUs will someday provide fast polynomial arithmetic mod 2; but there's no reason that it'll ever be faster than integer arithmetic. I haven't found any cryptographic applications where finite fields of small characteristic turned out to be a good idea. ---Dan
Subject: Re: Extreme lossy text compression Date: Sat, 17 Apr 1999 23:39:51 GMT From: ritter@io.com (Terry Ritter) Message-ID: <37191bb5.2504101@news.io.com> References: <1999Apr1722.20.41.2875@koobera.math.uic.edu> Newsgroups: sci.crypt Lines: 53 On 17 Apr 1999 22:20:41 GMT, in <1999Apr1722.20.41.2875@koobera.math.uic.edu>, in sci.crypt djb@koobera.math.uic.edu (D. J. Bernstein) wrote: >> you might as well use a 128 bit CRC. > >hash127 is much faster than a 128-bit CRC. Try it for yourself: > > http://pobox.com/~djb/hash127.html > >In spirit these functions are really all the same: > > * reduce (Z/2)[x] modulo some prime x^128-... (CRC/``division''); > * reduce Z modulo some prime 2^128-... (basically Karp-Rabin); > * reduce F_(2^32)[x] modulo some prime x^4-... (Shoup); > * reduce (Z/(2^31-1))[x] modulo some prime x^4-...; > * reduce F_(2^128)[x] modulo some prime x-... (``evaluation''); > * reduce (Z/(2^127-1))[x] modulo some prime x-... (hash127); > * etc. > >All of them have about the same natural hardware speed. However, in >software, arithmetic in large characteristic benefits from CPU support >for fast multiplication---often buried inside the floating-point unit, >but still accessible---whereas arithmetic in characteristic 2 doesn't. > >I hope that CPUs will someday provide fast polynomial arithmetic mod 2; >but there's no reason that it'll ever be faster than integer arithmetic. >I haven't found any cryptographic applications where finite fields of >small characteristic turned out to be a good idea. Surely we don't implement CRC's like this. Addition mod 2 is exclusive-OR which is well supported in CPU's. "Division" mod an irreducible mod 2 poly is just a test and conditional exclusive-OR, which is, again, well supported. For production use, CRC's are table-driven. Typically, we generate a 256-entry table, then process data byte-by-byte. Each step typically requires a byte exclusive-OR, an 8-bit "shift" of the CRC (which might involve simply changing a pointer to the head), and an exclusive-OR from table into the CRC register. That said, 128 bit (16-byte) values are not usually supported in CPU registers or instruction sets. So I would instead probably use 4 CRC's using different deg-32 primitives (or perhaps 5 deg-31's) to produce a similar effect. See, for example: http://www.io.com/~ritter/KEYSHUF.HTM --- Terry Ritter ritter@io.com http://www.io.com/~ritter/ Crypto Glossary http://www.io.com/~ritter/GLOSSARY.HTM
Subject: Re: Extreme lossy text compression Date: 18 Apr 1999 08:27:02 GMT From: djb@koobera.math.uic.edu (D. J. Bernstein) Message-ID: <1999Apr1808.27.02.4647@koobera.math.uic.edu> References: <37191bb5.2504101@news.io.com> Newsgroups: sci.crypt Lines: 43 Terry Ritter <ritter@io.com> wrote: > For production use, CRC's are table-driven. Indeed. The standard CRC-32 implementation reduces an n-bit polynomial modulo a degree-32 polynomial p using * n/8 additions of 32-bit polynomials, * n/8 multiplications of 24-bit polynomials by x^8, and * n/8 table lookups of f |-> (x^32 f) mod p, where f has 8 bits. The analogous CRC-128 implementation would use * n/8 additions of 128-bit polynomials, * n/8 multiplications of 120-bit polynomials by x^8, and * n/8 table lookups of f -> (x^128 f) mod p, where f has 8 bits. Now, if you could magically set up a size-2^32 table instead of a size-2^8 table, you could get away with far fewer operations: * n/32 additions of 128-bit polynomials, * n/32 multiplications of 96-bit polynomials by x^32, and * n/32 table lookups of f -> (x^128 f) mod p, where f has 32 bits. That's basically the speed achieved by hash127, with integers replacing polynomials mod 2, and the CPU's fast multiplication (inside the FPU) replacing table lookups. > "Division" mod an > irreducible mod 2 poly is just a test and conditional exclusive-OR, > which is, again, well supported. I said ``support for fast multiplication.'' Note the word ``fast.'' Bit-by-bit multiplication, doing a conditional 128-bit addition for each bit of input, is not fast. > So I would instead probably use 4 > CRC's using different deg-32 primitives That's a very bad idea for authentication of long messages. The maximum collision probability is around b^4/2^128 where b is the message length. Anyway, hash127 is much faster. ---Dan
Subject: Re: Extreme lossy text compression Date: Mon, 19 Apr 1999 07:00:09 GMT From: ritter@io.com (Terry Ritter) Message-ID: <371ad344.43382112@news.io.com> References: <1999Apr1808.27.02.4647@koobera.math.uic.edu> Newsgroups: sci.crypt Lines: 69 On 18 Apr 1999 08:27:02 GMT, in <1999Apr1808.27.02.4647@koobera.math.uic.edu>, in sci.crypt djb@koobera.math.uic.edu (D. J. Bernstein) wrote: >Terry Ritter <ritter@io.com> wrote: >> For production use, CRC's are table-driven. > >Indeed. The standard CRC-32 implementation reduces an n-bit polynomial >modulo a degree-32 polynomial p using > > * n/8 additions of 32-bit polynomials, Yes, this is the add of the table value. > * n/8 multiplications of 24-bit polynomials by x^8, and Note that this is just a single shift, and not a bit-by-bit multiplication, as one might think from the comment below. > * n/8 table lookups of f |-> (x^32 f) mod p, where f has 8 bits. >[...] >> "Division" mod an >> irreducible mod 2 poly is just a test and conditional exclusive-OR, >> which is, again, well supported. > >I said ``support for fast multiplication.'' Note the word ``fast.'' >Bit-by-bit multiplication, doing a conditional 128-bit addition for each >bit of input, is not fast. The only multiplication we have in CRC is a shift, but I expect an integer multiply may be just as fast, in the registers. Outside the registers, it looks like a multiple-precision operation, and 128 bits sounds outside the registers. But my comment was made in the context of the equations you showed: >>> * reduce (Z/2)[x] modulo some prime x^128-... (CRC/``division''); >>> * reduce Z modulo some prime 2^128-... (basically Karp-Rabin); >>> * reduce F_(2^32)[x] modulo some prime x^4-... (Shoup); >>> * reduce (Z/(2^31-1))[x] modulo some prime x^4-...; >>> * reduce F_(2^128)[x] modulo some prime x-... (``evaluation''); >>> * reduce (Z/(2^127-1))[x] modulo some prime x-... (hash127); >>> * etc. Everywhere here I see "modulo some prime." Now, that is hardly an issue with CRC, but it sure can be an issue when dividing big integers which here seem to be 128 bits. How do you deal with this? >> So I would instead probably use 4 >> CRC's using different deg-32 primitives > >That's a very bad idea for authentication of long messages. You're right. >The maximum >collision probability is around b^4/2^128 where b is the message length. >Anyway, hash127 is much faster. I still find this hard to believe. What happened to the multi-precision divide? --- Terry Ritter ritter@io.com http://www.io.com/~ritter/ Crypto Glossary http://www.io.com/~ritter/GLOSSARY.HTM
Subject: Re: Extreme lossy text compression Date: 19 Apr 1999 08:38:36 GMT From: djb@koobera.math.uic.edu (D. J. Bernstein) Message-ID: <1999Apr1908.38.36.9722@koobera.math.uic.edu> References: <371ad344.43382112@news.io.com> Newsgroups: sci.crypt Lines: 9 Terry Ritter <ritter@io.com> wrote: > I still find this hard to believe. That's why they pay me the big bucks. :-) Get the code from http://pobox.com/~djb/hash127.html and try it for yourself. On a Pentium it takes about 4.5 clock cycles per input byte. ---Dan
Subject: Re: Extreme lossy text compression Date: Mon, 19 Apr 1999 17:31:51 GMT From: ritter@io.com (Terry Ritter) Message-ID: <371b6854.1313556@news.io.com> References: <1999Apr1908.38.36.9722@koobera.math.uic.edu> Newsgroups: sci.crypt Lines: 42 On 19 Apr 1999 08:38:36 GMT, in <1999Apr1908.38.36.9722@koobera.math.uic.edu>, in sci.crypt djb@koobera.math.uic.edu (D. J. Bernstein) wrote: >Terry Ritter <ritter@io.com> wrote: >> I still find this hard to believe. > >That's why they pay me the big bucks. :-) > >Get the code from http://pobox.com/~djb/hash127.html and try it for >yourself. On a Pentium it takes about 4.5 clock cycles per input byte. That's fine, but I'd rather have an explanation in plain English: How do you avoid multi-precision operations, multiply and divide, when dealing with a 128-bit integer hash? And, after thinking about it for a while, I would like to add to my response in the previous message: >>> So I would instead probably use 4 >>> CRC's using different deg-32 primitives >> >>That's a very bad idea for authentication of long messages. > >You're right. Indeed, *any* linear system is normally a bad idea for authentication (unless, perhaps, it is inside the cryptographic envelope). But that would seem to rule out the whole list of hash methods which you posted. Does your hash fit in there somewhere? Do you thus recommend a linear hash for authentication? When a linear system is acceptable, I see no serious theoretical consequences from splitting it into smaller subsystems with about the same total state. --- Terry Ritter ritter@io.com http://www.io.com/~ritter/ Crypto Glossary http://www.io.com/~ritter/GLOSSARY.HTM
Subject: Re: Extreme lossy text compression Date: 19 Apr 1999 20:48:18 GMT From: djb@koobera.math.uic.edu (D. J. Bernstein) Message-ID: <1999Apr1920.48.18.12961@koobera.math.uic.edu> References: <371b6854.1313556@news.io.com> Newsgroups: sci.crypt Lines: 46 Here's an analogy in base 10. Suppose you want to compute 2718281828 r^4 + 4590452353 r^3 + 6028747135 r^2 + 2662497757 r modulo p, where p = 10^30-1 and r = 314159265358979323846264338327. You do some precomputations with r: r^1 mod p = 3141592653 10^20 + 5897932384 10^10 + 6264338327 r^2 mod p = 2631235735 10^20 + 1304915222 10^10 + 3466068927 r^3 mod p = 7883992070 10^20 + 5909899612 10^10 + 2554294263 r^4 mod p = 9595247344 10^20 + 1716075106 10^10 + 5252936926 Now you can compute the (approximately 40-digit) number 2718281828 (r^4 mod p) + 4590452353 (r^3 mod p) + 6028747135 (r^2 mod p) + 2662497757 (r mod p) with 3 dot products of ten-digit numbers, totalling 12 multiplications; and then you can easily reduce the result modulo p. Terry Ritter <ritter@io.com> wrote: > How do you avoid multi-precision operations, multiply and divide, when > dealing with a 128-bit integer hash? I don't know what distinction you're trying to draw. These _are_ multiprecision operations. The same techniques can be used in many other multiprecision arithmetic problems. > Indeed, *any* linear system is normally a bad idea for authentication On the contrary. See the hash127 man pages, or my previous postings, for several safe authentication methods. > When a linear system is acceptable, I see no serious theoretical > consequences from splitting it into smaller subsystems with about the > same total state. Again: The maximum collision probability for four independent CRC-32s on a b-bit message is roughly b^4/2^128. That grows disastrously with b. With today's networking technology an attacker can break that system in a matter of minutes. In contrast, CRC-128 has maximum collision probability around b/2^128, and hash127 has maximum collision probability around 3b/2^133. ---Dan
Subject: Re: Extreme lossy text compression Date: Wed, 28 Apr 1999 14:25:42 +0200 From: Mok-Kong Shen <mok-kong.shen@stud.uni-muenchen.de> Message-ID: <3726FE46.7BD6044A@stud.uni-muenchen.de> References: <37191bb5.2504101@news.io.com> Newsgroups: sci.crypt Lines: 14 Terry Ritter wrote: > > For production use, CRC's are table-driven. Typically, we generate a > 256-entry table, then process data byte-by-byte. Each step typically > requires a byte exclusive-OR, an 8-bit "shift" of the CRC (which might > involve simply changing a pointer to the head), and an exclusive-OR > from table into the CRC register. Could you say what speed you can achieve with that on a typical processor (in assembler and in C or other languages)? Thanks in advance. M. K. Shen
Subject: Re: Extreme lossy text compression Date: Wed, 28 Apr 1999 22:48:08 -0600 From: jcoffin@taeus.com (Jerry Coffin) Message-ID: <MPG.1191686f9d841f78989a3b@news.rmi.net> References: <3726FE46.7BD6044A@stud.uni-muenchen.de> Newsgroups: sci.crypt Lines: 13 In article <3726FE46.7BD6044A@stud.uni-muenchen.de>, mok- kong.shen@stud.uni-muenchen.de says... [ a table-driven CRC ] > Could you say what speed you can achieve with that on a typical > processor (in assembler and in C or other languages)? Thanks > in advance. I just did a quick check with a version I have lying around and it did around 5 megabytes a second a Pentium II/400. In all honesty, I suspect that a faster disk would improve that more than a faster CPU though.
Subject: Re: Extreme lossy text compression Date: Sat, 1 May 1999 04:24:15 -0400 From: "John A. Limpert" <johnl@radix.net> Message-ID: <7gedl0$sj2$1@news1.Radix.Net> References: <MPG.1191686f9d841f78989a3b@news.rmi.net> Newsgroups: sci.crypt Lines: 25 Jerry Coffin <jcoffin@taeus.com> wrote in message news:MPG.1191686f9d841f78989a3b@news.rmi.net... > In article <3726FE46.7BD6044A@stud.uni-muenchen.de>, mok- > kong.shen@stud.uni-muenchen.de says... > > [ a table-driven CRC ] > > > Could you say what speed you can achieve with that on a typical > > processor (in assembler and in C or other languages)? Thanks > > in advance. > > I just did a quick check with a version I have lying around and it did > around 5 megabytes a second a Pentium II/400. In all honesty, I > suspect that a faster disk would improve that more than a faster CPU > though. I ran benchmarks of various CRC schemes on a Pentium a few years ago. Most ran in the range of 5-10 megabytes/second. The limiting factor appeared to be main memory bandwidth, not CPU speed. They all seemed to blow out the L2 cache.
Subject: Re: Extreme lossy text compression Date: 1 May 1999 21:22:44 GMT From: djb@koobera.math.uic.edu (D. J. Bernstein) Message-ID: <1999May121.22.44.24765@koobera.math.uic.edu> References: <7gedl0$sj2$1@news1.Radix.Net> Newsgroups: sci.crypt Lines: 8 John A. Limpert <johnl@radix.net> wrote: > The limiting factor appeared to be main memory bandwidth, not CPU speed. hash127 handles input straight from DRAM at over 200 million bits/second on my old Pentium-133. Anything slower than that is clearly not limited by memory bandwidth. ---Dan
Subject: Re: CRC32 Date: Thu, 3 Jun 1999 14:41:48 +0200 From: "Amit Ghosh" <ghosh@gmx.de> Message-ID: <7j5t5i$l60$1@news.online.de> References: <19990603075422.4548.qmail@nym.alias.net> Newsgroups: sci.crypt Lines: 35 Hi, I'm not a crypto/compression guy, so there may be better answers to your questions, but I do my very best ;-) 1. If I remember it right, crc32 is described in ISO 3309. 2. If you are looking for a faster yet powerfull algorithm, you may want to try adler32 instead of crc32. You will find a small description of adler32 in RFC 1950. You may find ZLib interesting if you're looking for an implementation. 3. Since I'm not an expert on this field of science (see above), I may be wrong but as I understand, checksum algorithms are optimized to detect errors in data blocks and not to generate optimized hash values. You may want to look for hash value generation algorithms to solve your problem (if I get your question right). crc32 will most likely not solve your problem! Hope this helps! P.S.: Please update your newsreader configuration. Your reply address is malformed. -- Amit Ghosh maito:ghosh@gmx.de
Subject: Re: CRC32 Date: Thu, 03 Jun 1999 14:46:54 GMT From: "Rosco Schock" <snarf@ptdprolog.net> Message-ID: <yzw53.3088$ea.336874@nnrp1.ptd.net> References: <19990603075422.4548.qmail@nym.alias.net> Newsgroups: sci.crypt Lines: 53 Hello, Try this url. The paper it links to is very readable. It gives a good basic understanding of CRCs. http://www.ross.net/crc/crcpaper.html As far as odds go For example, CRC-32 gives a 1 / 2^32 chance of failure. That equates to odds of 1 / 4,294,967,295 of getting the same checksum. However, if you change to a filesize / CRC pair of numbers then the only way you will get the same checksum is if the files are the same. HTH, Rosco iLLusIOn wrote in message <19990603075422.4548.qmail@nym.alias.net>... >-----BEGIN PGP SIGNED MESSAGE----- > >Hello, > > I'm looking for any texts/urls with algorithms and descriptions > of CRC-32 (like used in zip files for example), I have found a few > sites providing source code but none which provide really useful > descriptions of how/why it works. anything appreciated :) > > are there any ways to speed crc32 up a bit? or are there any > faster (yet as "secure") algorithms as crc32? how high is the > possibility that i'd get the same checksum twice? my > implementation would be used to generate checksums of many > files, getting the same checksum on two different files would be > bad. > > tia > >~~~ >This PGP signature only certifies the sender and date of the message. >It implies no approval from the administrators of nym.alias.net. >Date: Thu Jun 3 07:54:19 1999