Newsgroups: sci.crypt
Path: illuminati.io.com!uunet!gatech!howland.reston.ans.net!usc!crash!straits
From: straits@crash.cts.com (Stewart Strait)

Subject: Re: Doing Better than XOR in RC4-like Algorithms
Organization: CTS Network Services (CTSNET), San Diego, CA
Date: Sat, 19 Nov 1994 06:48:15 GMT
Message-ID: 
X-Newsreader: TIN [version 1.2 PL2]
References:  <3a7llc$kej@netaxs.com>  <3acu3n$36j@netaxs.com>
Sender: news@crash.cts.com (news subsystem)
Nntp-Posting-Host: crash.cts.com
Lines: 42

Steve O'Neill (soneill@unix3.netaxs.com) wrote:
...
: To me, this means that the "plaintext" can be binary,
: compressed text, Fijian text, or anything else. If an attacker doesn't know
: in advance what the nature of the original messages is, having the XOR of
: two messages in front of him doesn't give him _any_ information. If it did,
: then every one-time pad ever devised would have been broken, since XORing
: one unknown message with another unknown message is equivalent to a one-time
: pad.
...
A vital quantitative point is missed here. All the attacker needs is for
the sum of the redundancies of the two messages to exceed
100% by a large margin.  XORing one unknown message with
another is _not_ equivalent to a one-time pad unless
'unknown' means 'so unknown that all possible messages are
roughly equally likely'.

English has about one bit per character of information.  You
would need compression of about 4:1 to get the redundancy
down to 50% in each message. PKZIP only gets about 2.2:1, for
example. Removing all non-alpha characters and using all caps
compresses by about 2:1 but is lossy. If you want to compress
much better that this, your compression algorithm needs to
know cryptographic-style detailed statistics of the language
used. This is equivalent to designing a code (in the
old-fashioned codebook sense) that compresses enough to
resist analysis.  Similar arguements hold for other
human and machine languages.

If the attacker knows that the message is in one of a few
hundred languages, some of which are human and some machine
code, and knows that the message is either ascii, machine
binary, or compressed with one of the few commonly used
compression programs, he knows enough. Just because amateurs
like me can't try a few thousand hobbyist-level attacks,
lacking the language knowledge and the time, doesn't mean a
whole lot in the real world.  A real crypto organization
could hardly exist without having statistics on all these
possibilities. Not knowing which possibility is actually
realized only costs the attacker about 10 to 15 bits of
information out of a message length which is most likely
thousands of bits.