PUBLISHED: Ritter, T. 1994. Estimating Population from Repetitions in Accumulated Random Samples.

Cryptologia. 18(2):155-190.Errors corrected.

To read the complete article off-line, save these graphics files: Fig. 1.1 (BIRTHFL3.GIF), Fig. 7.1 (2310114A.GIF), Fig. 7.2 (1315114A.GIF), Fig. 7.3 (1320114A.GIF), Fig. 7.4 (8225114A.GIF), Fig. 7.5 (7230114A.GIF), Fig. 7.6 (5240114A.GIF), Fig. 9.1 (DIST91.GIF), Fig. 9.2 (DIST92.GIF), and Fig. 9.3 (DIST93.GIF).

ADDRESS: Ritter Software Engineering, 2609 Choctaw Trail, Austin, Texas 78745, (512) 892-0494

ABSTRACT: It was desired to estimate the number of unique values or codes produced by a physically-random number generator of the sort used to form cryptographic message keys; only the codes produced by the generator were available for analysis. Subsequently, the classical "birthday paradox" was reversed to estimate statistical population from the average number of repetitions found in trials containing a substantial number of random samples. Although similar in premise to previous work, a new relationship was found which includes combinatorial contributions from higher repetition levels. The resulting equations proved to be simple, intuitive, exact, and novel in that they were easily reversed to produce unique population estimates. The distribution of repetitions in different trials was investigated experimentally and found not inconsistent with a hypothesis of Poisson distribution. In practice, multiple trials establish the mean number of repetitions for a trial of a given size; the new equations then estimate population directly. These reversed birthday methods are able to detect population variations despite the presence of inverse probability effects which can hide variations from common distribution tests. Thus, reversed birthday methods could prove valuable in the certification of physically-random number generators. Reversed birthday methods also could be useful for estimating the effective number of keys in cipher certification tests as well as more general statistical applications.

KEYWORDS: Population estimation, birthday paradox, physically random generators, population deviation testing, augmented repetitions, augmented doubles, cryptography, statistics

Cryptographic systems often need "really random" numbers for
various applications such as message keys and authentication
values. Really random numbers are valuable because they follow
no pattern, so no pattern can be predicted by cryptanalysis.
Unfortunately, computing machinery is fully deterministic and
so cannot be used to generate really random numbers; algorithmic
processes are inherently
*pseudo*-random. But really random numbers
*can* be generated by special "physically random" hardware
designs.
Normally, such devices measure some complex physical effect such
as thermal noise, diode electron leakage, quantum particle decay,
chaotic fluid flow, etc.

Various tests of distribution can (and should) be applied to physically random generators, but such tests are not conclusive. In marked contrast to most pseudorandom designs, a physically random generator gives no guarantee that the population of possible values are all available with equal probability. Thus, there is a need for some technique to measure the population (or show that it exceeds some measurable value) and also show that the population is evenly distributed. Notice that any such technique is limited to using only the random values produced by such a generator.

When the issue of testing a really random generator came up in the Usenet sci.crypt group (see Appendix A), it was suggested that "the birthday paradox" could be used, in reverse, to estimate population. As it turns out, the birthday paradox has been discussed in cryptography for the past decade (for example, see Meyer and Matyas (1982) [14: 672-673], and Davies and Price (1984) [4: 116-117, 278-280]). The birthday paradox has been proposed for public-key distribution in Merkle [13: 13-21], used for DES analysis in Kaliski, Rivest and Sherman [6] (also described in Patterson [17: 156-166]), and forms the basis for authentication attacks described in Seberry and Pieprzyk [21: 157]. Birthday methods were apparently used by Letham, Hoff and Folmsbee (1986) [11] during tests of the hardware random number generator on the Intel 27916 "Keprom" (see Appendix B); results were given, but the theory of the test itself was not described.

The method of the investigation in this paper was to find
easily-reversed equations for predicting the average number of
repeated values found in trials containing substantial numbers of
random samples. Some well-known simple equations, previously
described as approximations
[9],
were found to be exact predictors of a new relationship here called
*augmented repetitions*.
This relationship was exhaustively analyzed
on some small cases, and then tested experimentally with good results.
Experiments indicated that the number of augmented repetitions in
separate trials apparently occur in Poisson distribution. Multiple
trials established a mean value for the augmented repetitions in a
trial of a given size, which gave a good estimate of the parent
population.

During this development, it became apparent that certain types of population deviation which could be detected by reversed birthday methods probably would not be detected by conventional statistical tests. Because of the importance of the cryptographic objects created from physically random devices, the possibility that some hardware designs may have unsuspected problems is more than just interesting.

There are fewer than 30 students in a particular classroom; is it likely that at least two of the students have the same birthday? (See, for example, Goodman and Ratti [5: 139-140].)

At first the question may seem preposterous: There are 365
possible birthdays, and only 30 students. The paradox is that it is
nevertheless probable (probability exceeding 0.5) that some pair of
students
*does* have the same birthday, provided that we have at
least 23 students.

How can this be? Well, let's think about it: Suppose, as students enter the classroom, they call out their birthday, and any other student in the room with the same birthday calls out "match":

- The first student has no possibility of a match.
- The second student has a 1/365 probability of having a birthday which matches the first student. But if the birthdays do not match -- probability (1 - 1/365) -- there are now two students for possible matching.
- The third student gives us a 1/365 probability of matching the first student, and also a 1/365 probability of matching the second student, for a match probability of 2/365. But if there is no match -- total probability (1 - 1/365)(1 - 2/365) -- there are now three students for possible matching.

The non-intuitive part of the paradox is that, while the number
of students increases
*linearly*, the number of pairs -- the number of possible
matches -- increases
*combinatorially*. When there are 23 students, there are
(23 C 2) or 253 possible pairs; the next student gives us a total
of 276 possible pairs. Whenever we add a single student, we
can have a possible match with each student already in the room.

`
`

**Fig. 1.1 Birthday Probability Logic.**
On its own, the first sample, W, has no probability
of forming a double. With the second sample, X, there is just one possible match and
therefore just one chance in N (for population N) of a match. The third sample, Y,
presents two chances for a match (W and X), but the total probability of a match in three
samples also includes all previous possible matches, for a total of three chances out of N.
The logic for subsequent samples continues similarly.

The situation can be described by the flowchart in figure 1.1,
where each sample (each different student) is labeled with a
different letter (W, X, Y and Z). Each elemental test (each
diamond-shaped box) is a comparison to a single previous value;
the branch to the right is "match" or "success," while the path
downward is "no match." For a single comparison, there is one
possible match out of the entire population of N; thus, probability
of a match is just 1/N. If we have two comparisons (both will be
different, or we would have had a match with the previous sample),
the probability of success is 2/N. This leaves 1 - 2/N (of the
cases requiring a new sample!) to continue as the current
"no match" probability. The expression for "success" gets
complex; a good way to think about the problem is to follow the
path down; this is the cumulative probability of having no matches.

By following the path down in figure 1.1, we can write the probability expressions shown in table 1.1.

Table 1.1 Unmatched Probability Expressions Samples Probability of No Match 1 1 2 1-1/N 3 (1-1/N)(1-2/N) 4 (1-1/N)(1-2/N)(1-3/N)

Clearly, we have a well-defined sequence, and once we recognize this, it is relatively easy to compute the values for the birthday problem. We can formalize this slightly as the cumulative probability (from 0 to 1) of no match (q) for population N and the number of samples (students) s:

(1.1) Pq(N,s) = (1-1/N)(1-2/N)..(1-(s-1)/N)and, since the total probability is 1, the probability of a match (p) is just 1 minus (1.1):

(1.2) Pd(N,s) = 1 - (1-1/N)(1-2/N)..(1-(s-1)/N).

For a small population like 365, equation 1.2 allows us to compute the birthday probabilities shown in table 1.2.

Table 1.2 Numerical Birthday Probabilities Students Probability 5 0.0271 10 0.1169 15 0.2529 19 0.3791 20 0.4114 22 0.4757 23 0.5073 24 0.5383 25 0.5687 27 0.6269 30 0.7063 35 0.8144 40 0.8912 50 0.9704 60 0.9941 70 0.9992 80 0.9999But for a large population, while (1.2) may be term-by-term computable, there will be too many terms to compute. Fortunately, we can write (1.2) as

N N-1 N-2 .. N-s+1 Pd(N,s) = 1 - ------------------------ , N N N .. Nthen recognize the numerator as N!/(N-s)!, and have

N! (1.3) Pd(N,s) = 1 - ------------ . (N-s)! N^sThis is exact, although it quickly becomes unruly as the factorials exceed any given precision, and the lower factorial limits the number of samples s to the population N. (Kullback [9: 29] promotes e^-s(s-1)/2N as a "good approximation" to the right-hand part of (1.3) for large N; we will see this exponent again in section 2.) Fortunately, we have a good approximation to Ln(x!) for at least modest x (say, x > 9) using Stirling's formula (e.g., Knuth [7: 111]):

(1.4) LnFact(x) =(approx.) (x + 1/2) Ln(x) - x + Ln(TwoPi)/2 + 1 1 1 1 --- - ------- + -------- - -------- 12x 360 x^3 1260 x^5 1680 x^7And so equation 1.3 can be tamed with logs (for s > 9, and s <= N) as:

(1.5) Pd(N,s) = 1 - e^( LnFact(N) - LnFact(N-s) - s Ln(N) )Corrected 2000 Sept 06

Equation 1.5 is fairly convenient for the machine evaluation of birthday probabilities (although it will not handle s larger than N) because only the log factorials need be evaluated, and these are easy to find with high accuracy. But (1.5) is not easy to reverse for use in estimating population N when given values for doubles d and samples s. Accordingly, let us pursue an alternate development:

Consider a trial of two samples (s=2): One sample is whatever it is, and, if the next sample is really independent, there should be one chance in population N that exactly the same value will repeat, so the expected number of doubles Ed for two samples is: Ed(N,2) = 1 / N.

Consider a trial of three samples (s=3): There are three possible pairs of values, and for each pair, one value is whatever it is, and the probability that the other sample will match is one in N, so the expected number of doubles Ed(N,3) = 3 / N.

Consider a trial of four samples (s=4): There are six possible pairs of values, so the expected number of doubles will be Ed(N,4) = 6 / N.

By now it should be obvious that we are walking down Pascal's triangle and finding the number of combinations of s things taken two at a time: (s C 2), also known as "n choose r" or "the binomial coefficient."

We can relate the number of samples s to the population N
through the expected number of
*augmented doubles* Ead we observe:

(2.1) Ead(N,s) = (s C 2) / N .But

(s C 2) = s(s-1) / 2 ,so

(2.2) Ead(N,s) = s(s-1) / 2N .

As we shall see, in this development the expected value has
an unusual meaning. Here the expected value represents not just
the number of
*exact* doubles, but also contributions from triples and
higher-order repetitions (if any). We will call this
*augmented doubles*.

Consider the cases of N=2 and s=3, and N=3 and s=2 in tables 2.1 and 2.2 (note that equation 2.2 works just fine with s > N). In table 2.1 equation 2.2 is used to predict the number of expected augmented doubles for these cases.

Table 2.1 Expected Augmented Doubles for N=2 and s=3, and N=3 and s=2 3 * 2 2 * 1 Ead(2,3) = ------- = 1.5 Ead(3,2) = ------- = 1/3 2 * 2 2 * 3Table 2.2 illustrates every possible trial for both of these cases, and gives totals for comparison to the predicted values.

Table 2.2 All Possible Trials for N=2 and s=3, and N=3 and s=2 N=2 s=3 N=3 s=2 Trial Aug. Doubles Trial Aug. Doubles 0 0 0 3 0 0 1 0 0 1 1 0 1 0 0 1 0 1 0 2 0 0 1 1 1 1 0 0 1 0 0 1 1 1 1 1 0 1 1 1 2 0 1 1 0 1 2 0 0 1 1 1 3 2 1 0 --- 2 2 1 12 --- 3 Ead(2,3) = 12 aug. doubles / 8 cases = 1.5 Ead(3,2) = 3 aug. doubles / 9 cases = 1/3

Note that the equations in this new development predict the
number of doubles *exactly*, *if* we interpret a triple
as three different pairs of doubles. Thus, the augmentation from
triples to doubles is (3 C 2) or 3. We could express a particular
augmented result as the sum of the contributions from higher
repetition levels as:

In (2.3) Ar[k] is the augmented repetition count at the k-th level (k=2 for doubles; k=3 for triples, etc.), and r[i] the exact repetition count at the i-th level. Notice that we are using the idea of combinations twice: Once to give us the number of pairs (and, thus, the number of possible matches) in s samples, and then again to represent the number of doubles hidden in any higher-order exact repetitions.n (2.3) Ar[k] = SUM (i C k) r[i] . i=kCorrected

In
Section 6
we will encounter equations which predict
*exact*
repetitions and thus avoid augmentation calculations. But when we
use exact repetitions, if we have a double, and then find that same
value again, we will have a new triple, and one
*fewer* doubles.
Thus, if we are using exact doubles to predict population, finding
another match can actually
*decrease* the number of exact doubles, and so change our
prediction in exactly the wrong direction. In contrast, augmented
values retain repetition information instead of throwing it away.

These simple examples have (so far) confirmed the equations. With a beginning confidence in the analysis, we can estimate the population N from the number of samples s and augmented doubles ad:

(2.4) Nad(s,ad) = s(s-1) / 2 ad.With s positive and ad greater than zero, (2.4) produces an absolutely unambiguous unique positive result.

We can even create an equation for predicting samples s needed from population N to produce (on average) a given number of augmented doubles ad. Using (2.2), we can put s into normal quadratic form, use the quadratic formula, and end up with:

___________ / 1 + \/ 1 + 8 ad N (2.5) s(N,ad) = --------------------- . 2(As we shall see in Section 5, the expectation value ad for a 50 percent probability of finding an augmented double is -Ln(0.5) or 0.693.)

Now consider the situation with triples: Given any set of three samples, one is whatever it is, and one time out of N the second is the same value (a double), and of these, one time out of N the third value is also the same (a triple). So we have:

(3.1) Eat(N,s) = (s C 3) / N^2 .But

(s C 3) = s(s-1)(s-2) / 6 ,so

(3.2) Eat(N,s) = s(s-1)(s-2) / 6N^2 .Now we can estimate the population N from samples s and augmented triples at:

_____________ / (3.3) Nat(s,at) = / s(s-1)(s-2) . _ / ------------- \/ 6 at

Consider the cases of N=2 and s=4, and N=3 and s=3, as estimated in table 3.1:

Table 3.1. Expected Augmented Doubles and Triples for N=2 and s=4, and N=3 and s=3 4 * 3 3 * 2 Ead(2,4) = ------- = 3 Ead(3,3) = ------- = 1 2 * 2 2 * 3 4 * 3 * 2 3 * 2 * 1 Eat(2,4) = ----------- = 1 Eat(3,3) = ----------- = 1/9 6 * 4 6 * 9with every possible trial shown in table 3.2.

Table 3.2 All Possible Trials for N=2 and s=4, and N=3 and s=3 Augmented Augmented Trial Doubles Triples Trial Doubles Triples 0 0 0 0 6 4 0 0 0 3 1 0 0 0 1 3 1 0 0 1 1 0 0 0 1 0 3 1 0 0 2 1 0 0 0 1 1 2 0 0 1 0 1 0 0 1 0 0 3 1 0 1 1 1 0 0 1 0 1 2 0 0 1 2 0 0 0 1 1 0 2 0 0 2 0 1 0 0 1 1 1 3 1 0 2 1 0 0 1 0 0 0 3 1 0 2 2 1 0 1 0 0 1 2 0 1 0 0 1 0 1 0 1 0 2 0 1 0 1 1 0 1 0 1 1 3 1 1 0 2 0 0 1 1 0 0 2 0 1 1 0 1 0 1 1 0 1 3 1 1 1 1 3 1 1 1 1 0 3 1 1 1 2 1 0 1 1 1 1 6 4 1 2 0 0 0 --- --- 1 2 1 1 0 48 16 1 2 2 1 0 2 0 0 1 0 Ead(2,4) = 48 doub / 16 = 3 2 0 1 0 0 2 0 2 1 0 Eat(2,4) = 16 trip / 16 = 1 2 1 0 0 0 2 1 1 1 0 2 1 2 1 0 2 2 0 1 0 Ead(3,3) = 27 doub / 27 = 1 2 2 1 1 0 2 2 2 3 1 Eat(3,3) = 3 trip / 27 = 1/9 --- --- 27 3Note that the values from (2.2) and (3.2) are precisely the values we find by exhaustive enumeration.

For some parameters, when the number of samples s exceeds the population N, we can expect more triples than doubles; this is just the realization that (n C 3) is often greater than (n C 2). For example, suppose we have N=2 and s=6, a 6-bit binary value. Suppose all bits in s are the same. In this case we have only 15 doubles (6 C 2), but 20 triples (6 C 3). Of course, there are only two such cases (zeros and ones) at this length, and these are rare; it takes more such cases to affect the expected value (see table 3.3).

Table 3.3 Expected Augmented Doubles and Triples in Random Bit Arrays Bits Aug. Doubles Aug. Triples 2 0.5 0 4 3 1 6 7.5 5 8 14 14 10 22.5 30

A similar effect occurs on observed counts: An observed count of six occurrences of the same value (a level-6 exact repetition) would be taken as 15 augmented doubles (6 C 2), but 20 augmented triples (6 C 3). Augmentation from higher repetitions can have more effect on triples than doubles.

Every repetition size r (r=2 for doubles; r=3 for triples) gives us an expected augmented repetition value Ear :

(4.1) Ear(r,N,s) = (s C r) / N^(r-1),and from this we get an estimation of population N:

(4.2) Nr(r,s,ar) = ( (s C r) / ar )^(1/(r-1))where ar is the augmented repetition count.

To bypass the limitations of (s C r) to parameters s >= r, we note that

(s C r) = s(s-1)..(s-r+1) / (s-r)(s-r-1)..2 = s! / r! (s-r)!so factorial versions are:

s! (4.3) Ear(r,N,s) = ------------------- , r! (s-r)! N^(r-1)

s! (4.4) Nr(r,s,ar) = -------------- ^ (1/(r-1)) , r! (s-r)! arand, in logs:

(4.7) LnEar(r,N,s) = LnFact(s) - LnFact(r) - LnFact(s-r) -(r-1) Ln(N) ,

LnFact(s) - LnFact(r) - LnFact(s-r) - Ln(ar) (4.8) LnNr(r,s,ar) = -------------------------------------------- . (r-1)

Table 4.1 and table 4.2 illustrate two checks on equation 4.7 and higher-order repetitions. In table 4.1 we have all 64 possible trials of six bits (binary samples). The first column lists each particular trial, the second shows the number of repetitions found (from 6-reps through doubles), the third column gives the augmented repetitions, while the last column accumulates totals. At the bottom, equation 4.7 produces values for comparison; note that augmented 6-reps, 5-reps, 4-reps, 3-reps and doubles all occur exactly as equation 4.7 predicts. (Exact repetitions are also predicted, using (6.4).) Similarly, table 4.2 presents all 81 possible trials of four samples from a population of three, with similar success. Table 4.3 gives other tests, in summary version only.

The new development yields formulas for the expected number of augmented doubles, but sometimes we would prefer to have probability, a value from 0 to 1 (although this is unnecessary for population estimation). Let us assume, for the present, that the number of augmented doubles found in a fixed number of samples obeys a Poisson distribution. A Poisson distribution has only a single parameter, the quantity np (samples n times probability of success p); this is the mean u or expected value (see, for example, Parratt [16]). Given u we can calculate the probability that k augmented doubles will be found by using the Poisson cumulative distribution function shown in equation 5.1:

n (u^k)(e^-u) (5.1) SUM P(k;u) = ----------- = 1 k=0 k! u^2 e^-u u^n e^-u = e^-u + u e^-u + -------- + ... + -------- 2 n!

Each term in (5.1) gives the probability of k successes. In our case, (5.1) gives us the probability of k doubles given a particular value of u, and a trial of a particular size will imply some value for u. Since the total probability (the sum of all terms) is 1, the probability of having at least one double is 1 minus the probability of zero doubles:

(5.2) Pd(u) = 1 - e^-u .

Now, since we already have an equation for expected augmented doubles, we might use Ead from (2.2) as an approximation to u, and then (5.2) should give us the approximate probability of an augmented double. Right?

Unconvinced? Well, how about this (from Meyer and Matyas [14: 672-673]): We start with (1.1), the probability of no match, from the original development for "at least one match":

Pq(N,s) = (1-1/N)(1-2/N)..(1-(s-1)/N)and take the log of each side:

Ln(Pq(N,s)) = Ln(1-1/N) + Ln(1-2/N) + ... + Ln(1-(s-1)/N).Now (assuming s << N), each of these terms can be expressed as a series:

1 1 1 (5.3) Ln(1-1/N) = - --- - ---- - ---- - . . . N 2N^2 3N^3but if N is large (over, say, 10^3), the first term will dominate, and the rest can be discarded. This leaves us with:

Ln(Pq(N,s)) = -[ 1/N + 2/N + 3/N + ... + (s-1)/N ]and this series is exactly the average of all the terms (the first plus the last divided by two), times the number of terms, after factoring out N:

Ln(Pq(N,s)) = -(((1 + s-1) / 2) s-1) / Nor

Ln(Pq(N,s)) = - s (s-1) / 2N.But look at this equation: The right-hand-side is identical with equation (2.2) for expected augmented doubles, except for the sign! So the approximate probability of "at least one match" can be calculated from our development for augmented repetitions. Thus,

Pq(N,s) = e^-Eadand

(5.4) Pd(N,s) = 1 - e^-Ead ,which is exactly what we expect from a Poisson distribution of augmented doubles (in separate trials each with the same fixed number of samples). (Kaliski, et. al. [6: 87] points out that the right hand part of (1.3) can be approximated by essentially

To finish up, we can easily write:

(5.5) Ed = -Ln( 1 - Pd ) .This is why, in equation 2.4, we could not simply use 0.5 or 1.0 as the number of expected doubles. The number of doubles corresponding to a 50 percent probability of success is -Ln(0.5) or 0.693.

As it turns out, Kullback [9] provides formulas which can be used directly on the birthday problem. (To conform to the previous notation, and to prevent confusion between n and N, we change Kullback's N to s (samples), and Kullback's n to N (population).) Starting on p. 29 we find: "For random text of s elements, where there are N different possible elementsūthe average number of elements occurring twice each is given by"

s(s-1) N (1 - 1/N)^(s-2) (6.1) Ed(N,s) = ------------------------- 2 N^2"the average number of elements occurring r times each is given by"

s(s-1)..(s-r+1) N (1 - 1/N)^(s-r) (6.2) Er(r,N,s) = ---------------------------------- . r! N^rFor machine evaluation, we may express these in logs:

(6.3) LnEd(N,s) = Ln(s(s-1)/2) + (s-2) Ln(1 - 1/N) - Ln(N) ,

(6.4) LnEr(r,N,s) = LnFact(s) - LnFact(s-r) - LnFact(r) + (s-r) Ln(1 - 1/N) - (r-1) Ln(N) .

Although these equations do predict expected exact repetitions, they are going to be difficult to re-arrange to predict population. It would of course be possible to use any of these equations with an indirect root-finding numerical technique (for example, bisection [18]), but because the equations typically include powers higher than the first, any particular root may not be unique.

Interestingly, we see that (6.4) has five terms, and four of those are identical with the ones in (4.7). Thus, the additional term in (6.4) is the log of the fraction which converts the larger augmented repetition value ar into the smaller exact repetition value (or vice versa):

(6.5) LnExactReps(r,N,s) = (s-r) Ln(1 - 1/N) + Ln(ar) ,or without logs:

(6.6) ExactReps(r,N,s) = (1 - 1/N)^(s-r) * ar .

The expression (1 - 1/N) is always fractional, so there can never be more exact repetitions than augmented ones. This expression is close to one when N is large, and as long as s is much smaller than N, the overall factor is still close to one. Thus, under normal circumstances, the number of exact repetitions is not too different from the number of augmented repetitions. (With N=10,000, the ratio is 0.99 at s=100, and 0.96 at s=400.) On the other hand, small N and large s maximize the differences. With N=2 and s=3 (the situation illustrated in table 2.2) there are only half as many exact doubles as augmented doubles, and the ratio increases exponentially with higher s. Of course, in order to use (6.5) or (6.6), we first must know the population N, and this is often what we hope to find.

The numerical evaluation of Ln(1 - 1/N) is trickier than it looks. For large N, (1 - 1/N) is going to be very close to 1, and Ln(1) is 0. Consequently, the evaluation of the log may well underflow to zero with a serious loss of precision. But we have seen Ln(1 - 1/N) expanded in equation 5.3 (for large N):

(6.7) Ln(1 - 1/N) =(approx.) -1/N ,and, so, for large N (from 10^3 to 10^6, depending on the local numerical precision):

(6.8) LnExactReps(r,N,s) =(approx.) (r-s)/N ,and the same could be done for (6.3) and (6.4).

Kullback does not explain how he came up with (6.1) and (6.2), but they turn out to be very nearly binomial equations (see, for example, Parratt [16]). The binomial cumulative distribution function is:

n n (6.9) SUM B(k;n,p) = SUM (n C k) p^k q^n-k = 1 . k=0 k=0Each term in the sum gives the probability of finding exactly k successes in n independent tries when each success has probability p. In our terminology, consider the probability of an exact double (k=2) in s samples (n=s) from population N (p=1/N):

(6.10) B(2;s,1/N) = (s C 2) ((1/N)^2) (1-1/N)^(s-2)Observe that the first factor (s C 2) is s(s-1)/2 and that a factor of N is necessary to yield the expected number of doubles, and the link to (6.1) is immediate. (Note that the Poisson and binomial distributions have two different applications here: The binomial approaches are simply alternatives to (1.3) or (2.2), while the distribution of multiple trials requires the Poisson equations to estimate the resulting probability.)

Now that we see how the binomial distribution fits in, we can see that it also supports yet another approach: the probability of doubles or better. Since the sum of all terms (the probability of each repetition-level) is 1, the probability of having at least one match is 1 minus the probability of zero successes minus the probability of one success:

(6.11) Ebd(n,p) = N( 1 - B(0;s,p) - B(1;s,p) ) .Alas, this is not just a new expression for (1.3), but is instead yet another different concept of "doubles." We now have four such concepts, here arranged in decreasing magnitude:

(1.3) > (2.2) > (6.11) > (6.1) "at least one" "augmented" "binomial multiples" "exact"

The classical birthday development in section 1 is for "at least one match." The new development in section 2 is also for two or more occurrences, albeit in the special and numerically enhanced form of "augmented doubles." And the development in this section predicts "binomial multiples" as well as exactly two occurrences of a value. These different viewpoints make direct result comparisons difficult, but augmented repetitions remain by far the easiest to reverse for predicting population.

Computer programs were implemented to simulate random sampling from a known population. That population was then estimated by sampling the population in trials having various numbers of random samples and using those results in the augmented repetition equations.

Ideally, each time we perform a trial we would like to be
confident that the result reflects the larger reality.
Unfortunately, that is not the way sampling works. Instead, the
set of all possible trials form a
*probability distribution*, so some
individual trials will always yield misleading results. On the
other hand, if we take
*many* trials, we have some reason to hope
that we can develop both an accurate distribution, and an
accurate mean for use in population prediction.

The question of the distribution of trial values was addressed experimentally. A statistical random number generator was used to generate some number of samples in each trial, and this was repeated for a large number of trials:

- First, storage was allocated for a symbol count associated with every possible symbol. Then, every time a symbol was drawn from the random number generator, the count associated with that symbol was incremented. Eventually this produced the number of times each symbol occurred in that particular trial.
- Next, storage was allocated for a count associated with each repetition level. Depending on the number of samples taken, there will normally be empties, singletons, doubles, triples, and so on; we want to count the number of doubles, triples, etc., in each trial. So for each possible symbol, the associated symbol count value was found, and the repetition count for that value was incremented. This produced the number of times each repetition level occurred in that particular trial.
- Then, for each trial, counters were incremented which were associated with the number of times each repetition level occurred in that trial. Consider doubles: Various trials found different numbers of doubles; each was a valid result. But a large number of trials produced a general distribution of doubles counts (and distributions of other repetition levels as well). We can collect the experimental doubles distribution by incrementing a count associated with the number of doubles found in each trial.

`
`

**Fig. 7.1 Augmented Doubles in Sqrt(N) Samples.**
*2000 trials of 100 samples each from population 10,000.*
Each bar
represents a particular number of augmented doubles, and the height
of each bar represents the number of experimental trials which found
that many augmented doubles. The fat white line connects each
number of trials predicted by a Poisson distribution having an
expected value calculated as e to the power given by equation 4.7
(log of expected augmented repetitions), for augmented doubles,
assuming the same overall number of trials.

`
`

**Fig. 7.2 Augmented Doubles, 1.5 Sqrt(N) Samples.**
*1333 trials of 150 samples each from population 10,000.*

The graphs in figures 7.1 through 7.6 illustrate the distribution
of the number of augmented doubles found in large numbers of trials
using six different quantities of random samples. Each graph
represents essentially the same collection effort. The
vertical bars represent the number of trials which produced the
associated number of augmented doubles.

`
`

**Fig. 7.3 Augmented Doubles in 2 Sqrt(N) Samples.**
*1000 trials of 200 samples each from population 10,000.*

The thick white line on each graph illustrates the number of trials
predicted to have the given number of doubles, as computed by a
Poisson distribution for expected doubles developed from
(4.7)
and the same total trials.

`
`

**Fig. 7.4 Augmented Doubles, 2.5 Sqrt(N) Samples.**
*800 trials of 250 samples each from population 10,000.*

The author finds these graphs compelling
evidence that large numbers of trials do find augmented doubles
in a distribution which is essentially Poisson in nature (and this
happy result nicely supports the probability development in
section 5).

`
`

**Fig. 7.5 Augmented Doubles in 3 Sqrt(N) Samples.**
*667 trials of 300 samples each from population 10,000.*

`
`

**Fig. 7.6 Augmented Doubles in 4 Sqrt(N) Samples.**
*500 trials of 400 samples each from population 10,000.*

Table 7.1 lists the conditions under which each experiment occurred.

Table 7.1. Means of Augmented Doubles over Many Trials and Other Results from Distribution Experiments Samples Trials Mean Chi-Sq Est. N 100 2000 0.492 - 10061 150 1333 1.093 - 10224 200 1000 1.962 0.336 10143 250 800 3.111 0.261 10004 300 667 4.493 0.573 9982 400 500 8.034 0.033 9933This was a single contiguous set of experiments; the samples were taken from a fast statistical RNG and a "population" of 10,000 symbols. The mean is the arithmetic average of augmented doubles over all trials. The chi-square column gives the probability of the chi-square "p" statistic (the area of the chi-square density function below the experimental value) comparing the experimental results to the Poisson distribution (but only for experiments which had more than five bins exceeding five counts). The chi-square value of 0.033 in the last row indicates that the results in the last set were unexpectedly close to the Poisson distribution (for random samples); this was unusual. The predicted population values were relatively accurate.

Whenever we randomly sample, we can expect to get various result values. While it is useful to know the expected distribution of those values, we must remember that even very improbable values can and will occur occasionally. We will need enough experimental trials to see the form the distribution is taking. But then the experimental distribution should be fairly obvious, so the theoretical distribution may be of somewhat less import than one might think.

Obtaining samples is expensive in terms of time, while storing
and comparing sampled values is expensive in both storage
*and* time.
There is ample motive to use the fewest possible samples. However,
those who would understand the accuracy of reversed birthday
techniques should refer again to the graphs in figures
7.1 through
7.6. Every one of the repetition counts in each bar was formed as
a valid result of a separate trial on
*the exact same population*.
An experiment measuring population could have produced any of
these results in any single trial.

A zero result, common in trials with under 2 Sqrt(N) samples,
will give no estimate at all (by itself). Results of 1 and 24,
both of which occur in figure
7.6
(with 4 Sqrt(N) samples),
estimate populations which differ by a factor of 24. Since we are
measuring population, we will not know, beforehand, the shape of
the particular distribution we are measuring, and which estimates
are most likely. Consequently, the reversed birthday approach is
really useful only
*after* we have enough trials to be fairly sure
of the mean number of repetitions.

When we develop a mean value across many trials, we expect that
the more trials we have, the more accurate the result will be. But
*how much* more accurate? (See, for example, Boas
[3: 732-735].)
*Without*
assuming a particular underlying distribution, we can use
the *central limit theorem*
as grounds for assuming that errors in
the mean will be distributed normally (over many trials). Using
the sample standard deviation sd, and trials t, the
*standard deviation of the mean* sd(m) is expected to be:

sd (8.1) sd(m) = -------- . ____ / \/ t(Four times as many trials should give us half the standard deviation and thus twice the accuracy in the sample mean.) For a 50 percent

If we assume the underlying distribution is Poisson, we might think to avoid computing the sample variance sd^2 since that should be equal to the mean u. However, in the experimental data, while the variance and mean do track fairly well, it is not at all unusual to see them differ by ten percent or more. Thus, we probably should compute the variance explicitly from the data. But for a quick-and-dirty answer given only mean u and trials t, the probable error r should be:

___ / 0.6745 \/ u (8.2) r = ------------- ____ / \/ t

The normal distribution demands large t and produces a wider confidence interval factor (0.6745) than we really need (especially for our expected small values of the mean u [16: 206]). Nevertheless, using the normal distribution for the confidence interval appears to be standard practice. Excessive concern about accuracy in these statistical predictions may be misplaced anyway, since each experiment will be different.

During statistical tests of the reversed birthday approach, it became apparent that these methods could reveal problems which are virtually invisible to other statistical tests. There are many statistical tests of distribution (see, for example, Knuth [8] and Marsaglia [12]), but most of these can be fooled by simultaneous inverse relationships between population and probability.

`
`

**Fig. 9.1 Even Distribution without Population Deviation.**
Each possible symbol has exactly the same probability. The ideal
distribution for a discrete random number generator (RNG).

Consider figure 9.1, which is intended to represent the even distribution we expect from a good RNG. The separate spikes represent the discrete individual values (or codes or symbols) that the RNG can produce (any digital result can give us only a finite number of representations, no matter how the values are produced). The height of each spike represents the probability of obtaining that particular value. Figure 9.1 represents the ideal in which any possible value can be obtained with equal probability.

`
`

**Fig. 9.2 Even Distribution with Population Deviation.**
A distribution in which half of the range has only half of the
expected values, yet each of these values is twice as probable
as it should be. Any common test which groups even a couple of
adjacent symbols is unlikely to find much wrong with this
distribution.

Now consider figure 9.2; strangely, this is *also* an even
distribution. While half of the range contains only half the
symbols of the other half (and one-quarter of the symbols are thus
unused), each symbol in the reduced side is twice as probable.
Consequently, this distribution has essentially the same range,
mean, deviation, and comparison to flat distribution as 9.1, yet
only three-quarters the
*uniqueness*. (The means will be very close
when there are a large number of symbols; a chi-square comparison
to a flat distribution will indicate similarity if, as usual, the
comparison bins cover multiple symbols.)

Table 9.1 illustrates all possible trials for two cases of
N=4 and s=2: First, the expected
*ideal* distribution with four
symbols, and then a
*deviant* distribution in which symbol 1 never
occurs, and symbol 0 occurs twice as often to compensate for the
loss.

Table 9.1 All Possible Trials for N=4, s=2, for Both Ideal and Deviant Populations Ideal Population Deviant Population Trial Aug. Doubles Trial Aug. Doubles 0 0 1 0 0 1 0 1 0 0 0 1 0 2 0 0 2 0 0 3 0 0 3 0 1 0 0 0 0 1 1 1 1 0 0 1 1 2 0 0 2 0 1 3 0 0 3 0 2 0 0 2 0 0 2 1 0 2 0 0 2 2 1 2 2 1 2 3 0 2 3 0 3 0 0 3 0 0 3 1 0 3 0 0 3 2 0 3 2 0 3 3 1 3 3 1 --- --- 4 6Note that reversed birthday methods do detect something strange about the population. (With N=4, the half-symbol difference in means between the two cases is obvious without birthday computations, but for large N, a huge number of samples would be necessary to confirm any such difference.) Table 9.1 shows that repetition counts are sensitive to symbols of higher than expected probability, even if those symbols are nominally compensated by low probability symbols. When low-probability symbols and high-probability symbols are in close proximity, binned tests are going to have a difficult time detecting a deviant population.

`
`

**Fig. 9.3 Even Distribution with Smaller Population
Deviation.**

Figure 9.3 continues the development of a distribution which
is nominally
*flat* statistically, but which has local population
deviations. Although a
*large* number of high-probability values
would clearly dominate repetition results, a
*small* number of
high-probability values could be hard to find even using
repetition methods.

Actual probability variations could be continuous and
arbitrary. One could
*easily* imagine that some "physically random"
noise-based designs might well have an inherent inverse
relationship between local population and probability.
A cryptanalyst who knew of this relationship would naturally
try the most-probable values first.

At this point, we can use the development in this paper and compare it to the Keprom results mentioned in the introduction. Table 10.1 illustrates the use of equation 4.8 to estimate population based on augmented repetitions calculated from the keprom results [11].

Table 10.1 Estimating Keprom Population from Augmented Repetitions s: 900000 r: 18 Ear: 1 Aug Pop: 17.9 bits s: 900000 r: 17 Ear: 18 Aug Pop: 17.7 bits s: 900000 r: 16 Ear: 153 Aug Pop: 17.7 bits s: 900000 r: 15 Ear: 816 Aug Pop: 17.6 bits s: 900000 r: 14 Ear: 3060 Aug Pop: 17.6 bits s: 900000 r: 13 Ear: 8568 Aug Pop: 17.6 bits s: 900000 r: 12 Ear: 18564 Aug Pop: 17.7 bits s: 900000 r: 11 Ear: 31824 Aug Pop: 17.7 bits s: 900000 r: 10 Ear: 43758 Aug Pop: 17.8 bits s: 900000 r: 9 Ear: 48621 Aug Pop: 18.0 bits s: 900000 r: 8 Ear: 43768 Aug Pop: 18.2 bits s: 900000 r: 7 Ear: 31868 Aug Pop: 18.5 bits s: 900000 r: 6 Ear: 18676 Aug Pop: 19.0 bits s: 900000 r: 5 Ear: 8751 Aug Pop: 19.7 bits s: 900000 r: 4 Ear: 3262 Aug Pop: 21.0 bits s: 900000 r: 3 Ear: 972 Aug Pop: 23.4 bits s: 900000 r: 2 Ear: 362 Aug Pop: 30.1 bitsOf course, these results should be considered a single huge trial, which means that the sampling distribution has not been resolved. Thus, the reported results may not accurately represent the true experimental means.

Table 10.2 illustrates the indirect use of equation 6.4 in a numerical root-finding procedure to estimate the population based on exact repetitions.

Table 10.2 Estimating Keprom Population from Smallest Root of Exact Repetition Equations s: 900000 r: 18 Er: 1 Xct Pop: 17.4 bits s: 900000 r: 9 Er: 1 Xct Pop: 19.8 bits s: 900000 r: 8 Er: 1 Xct Pop: 20.3 bits s: 900000 r: 5 Er: 1 Xct Pop: 23.0 bits s: 900000 r: 4 Er: 1 Xct Pop: 24.8 bits s: 900000 r: 3 Er: 2 Xct Pop: 27.9 bits s: 900000 r: 2 Er: 123 Xct Pop: 31.6 bitsThere is some agreement with table 10.1, as one would expect with a large population. However, the numerical procedures can find more than one root, and selecting a particular root makes a big difference. Here the root-finding procedure was intended to find the smallest root (although, without specific analysis of each case, we cannot be sure it did). This sort of indirect root-search is

Note that both tables predict a much-reduced effective population based on the existence an 18-rep. If the reported data are actually representative of the device, it would appear that we may be warranted in entertaining serious doubts about the effectiveness of the design. Note that higher-order repetitions can provide the clearest indication of local distribution irregularities.

The ability to estimate population based on mean augmented repetitions is based on two questions:

Does the new development really predict augmented repetitions, and

Are augmented repetitions approximately Poisson-distributed over multiple similar trials?

These questions have been investigated experimentally (to some extent), and it appears that both answers are "yes." If so, multiple trials can be used to estimate the mean augmented repetitions (in trials with a fixed number of samples), and that value used in the new reversed equations to estimate the population. (Note that birthday techniques do not address common software linear-congruential or shift-register RNG sequences in which any particular result occurs only once before the system repeats. Birthday techniques require the possibility that multiple samples may produce the same value.)

Reversed birthday techniques are useful in that they allow estimation of population based only on random samples from that population. Useful estimation is possible (over tens or hundreds of trials) with as few as Sqrt(N) samples per trial; with "only" Sqrt(N) samples (per trial), we can begin to estimate N. The difficulty is that we must in some way take, save and compare Sqrt(N) distinct samples, and this can be a lot of samples. Nevertheless, population analysis of 32-bit physically-random RNG's should be well within the range of a modern personal computer.

Triples and higher repetitions do not help us much with the sampling size problem; in fact, they make it worse (see table 11.1).

Table 11.1 Expected Augmented Doubles and Triples for Various Populations 2^16 Samples 2^18 Samples Population Doubles Triples Doubles Triples 2^16 32767 10922 524284 699042 2^20 2048 43 32768 2731 2^24 128 0 2048 11 2^28 8 0 128 0 2^32 0.5 0 8 0For repetitions to be useful in population estimation, they must occur in numbers sufficient to support a reliable mean. For any given population, it is far easier to collect data on doubles than triples. But the occasional triple or higher repetition will make a fine contribution to the total of augmented doubles.

On the other hand, higher-order repetitions can be sensitive indicators of deviant populations. If we limit all trials to a size which only produces a few doubles, we could miss indications that our doubles come from a subset of the overall population. While this may not matter if we are interested only in confirming a minimum population, it could be a mistake if we wish to identify problems so they can be fixed.

In practice, it is important first to survey the situation with a few large trials and follow up on any suspect indications. Assuming the initial survey goes well, we should have enough data to fix the size of each trial for record; we might like to see three or more augmented doubles per trial, which implies at least 2.5 Sqrt(N) samples. We would also like to perform as many trials as possible (tens, hopefully; hundreds, if possible), find the augmented doubles in each, and examine the experimental distribution. Then we can come up with a reasonable estimate of the mean, and plug that value into (2.4) to estimate the population. We can do a lot of this automatically, of course, but no simple computation is likely to be as effective as an experimenter who is actively involved with the data.

An easy and intuitive development has produced simple, exact and easily-reversed equations for estimating population from the number of repeated values found in multiple trials. For a reasonable hope (probability 0.63 or more) of finding at least one augmented double, each trial needs a number of random samples at least equal to the square root of the actual population. Because the number of repetitions in separate trials occur in a broad distribution, multiple trials are necessary to calculate a mean augmented repetition value which can be easily used to estimate the parent population.

Earlier equations predict the number of exact repetitions expected from a known population, but those equations are difficult to reverse to predict population. Even if we find numerical solutions to those equations (using inherently slow numerical root-finding techniques), the resulting solutions generally are not unique. The equations from the new development eliminate these problems, at the expense of relatively-simple augmentation computations.

The raw estimation of population has obvious applications in the design of physically random number generators. The ability to identify actual element values which appear in higher than expected quantities is an obvious benefit of repetition methods. But the fact that unexpectedly-probable elements can be detected despite being intermixed with unexpectedly-improbable elements seems to provide new analysis technology. Since population variations would weaken any really random generator, testing for these defects could be an important tool in the future development of physically random generators.

Another potential application is the analysis of cipher designs. Most cipher systems function by hiding a message "needle" in a "haystack" of other decipherings. Ordinarily we assume that the number of different cipherings is equal to the number of possible keys, but in a defective cipher this may not be the case. If many keys produce the same result, the cipher may be unexpectedly weak. Reversed birthday methods may provide an opportunity to test for such a problem.

Reversed birthday techniques presumably could be useful also in more general statistical applications.

Thanks to Nico de Vries (100115.2303@compuserve.com) for proposing (on Usenet sci.crypt) a random number generator computer program which he promoted as physically random. Normally, any computer RNG is a deterministic mechanism [20], and simple examination will easily disclose the "amount of internal data" or "state" which it uses. Nico's design was unusual in that it was based on real-time program interactions with IBM PC-type hardware, so it was difficult to know how much internal state there was, or where that state was located. Such a design begged the question of measuring RNG population when only the values themselves were available.

Thanks to Ross Anderson (Ross.Anderson@cl.cam.ac.uk) for suggesting (also on Usenet sci.crypt) that a birthday test could be used to estimate population, and apologies to him also, for my obsession with this idea. Ross sent me email preprints of his birthday test papers [1,2], but then I developed the concept of augmented repetitions, which produced simple equations and exact unique results. All errors in this development are, of course, mine alone.

1. Anderson, R., R. Gibbens, C. Jagger, F. Kelly and M. Roe. Measuring the Diversity of Random Number Generators. In progress.

2. Anderson, R. Testing Random Number Generators. Submitted
for publication in *Electronics Letters*.

3. Boas, M. 1983. *Mathematical Methods in the Physical
Sciences*. John Wiley & Sons.

4. Davies, D. and W. Price. 1984. *Security for Computer
Networks*. John Wiley & Sons.

5. Goodman, A. and J. Ratti. 1971. *Finite Mathematics with
Applications*. Macmillan.

6. Kaliski, B., R. Rivest and A. Sherman. 1985. Is the Data
Encryption Standard a Group? (Preliminary Abstract)
*Advances in Cryptology -- Eurocrypt '85*. 81-95.

7. Knuth, D. 1973. *The Art of Computer Programming*.
Vol. I, *Fundamental Algorithms*. Addison-Wesley.

8. Knuth, D. 1981. *The Art of Computer Programming*.
Vol. II, *Seminumerical Algorithms*. Addison-Wesley.

9. Kullback, S. 1976 (first published 1938). *Statistical
Methods in Cryptanalysis*. Aegean Park Press, P.O. Box 2837,
Laguna Hills, CA 92654-0837.

10. Holtzman, J. 1987. The KEPROM: Sinking the Software Pirates.
*Radio-Electronics*. June. 100-104.

11. Letham, L., D. Hoff and A. Folmsbee. 1986. A 128K EPROM Using
Encryption of Pseudorandom Numbers to Enable Read Access.
*IEEE Journal of Solid-State Circuits*. SC-21(5): 881-888.

12. Marsaglia, G. 1984. A Current View of Random Number
Generators. *Proceedings of the Sixteenth Symposium on the
Interface Between Computer Science and Statistics*. 3-10.

13. Merkle, R. 1982. *Secrecy, Authentication, and Public Key
Systems*. UMI Research Press, University Microfilms
International, Ann Arbor, Michigan 48106.

14. Meyer, C. and S. Matyas. 1982. *Cryptography: A New
Dimension in Data Security*. John Wiley & Sons.

15. Morita, H., K. Ohta and S. Miyaguchi. 1991. A switching
closure test to analyze cryptosystems. *Advances in
Cryptology -- CRYPTO '91*. 183-193.

16. Parratt, L. 1961. *Probability and Experimental Errors in
Science*. Dover.

17. Patterson, W. 1987. *Mathematical Cryptology*. Rowman &
Littlefield, 81 Adams Drive, Totowa, New Jersey 07512.

18. Press, W. et. al. 1986. *Numerical Recipes*.
Cambridge University Press.

19. Quisquater, J. and J. Delescaille. 1989. How easy is
collision search? Applications to DES. *Advances in
Cryptology -- Eurocrypt '89*. 429-434.

20. Ritter, T. 1991. The Efficient Generation of Cryptographic
Confusion Sequences. *Cryptologia*. 15(2): 81-139.

21. Seberry, J. and J. Pieprzyk. 1989. *Cryptography: An
Introduction to Computer Security*. Prentice Hall.

22. Zollo, S. 1985. IC Keeps Pirates Out. *Electronics
Week*. February 11: 73, 76.

Sci.crypt is the name of a news group on Usenet, a bulletin-board network of computers which use dial-up lines for network communication. Usenet was originally developed using the UUCP (Unix-to-Unix Copy) package distributed with Unix, plus free news software which collects messages in many hundreds of different topics or "groups." On Usenet, messages are passed from computer to computer (until the messages hopefully reach every connected computer), so there is no central facility and no single overall ownership. Electronic mail is supported in the same way.

Access to Usenet News is generally also available on the Internet, an exponentially growing interconnection of networks. Although once confined to higher education and research laboratories, the Internet is increasingly available to ordinary citizens who pay a fee for an account on a computer which is Internet connected.

Those unacquainted with the Internet and News could be disappointed in what they find. Huge numbers of individuals have access to these groups, and many of the groups (like sci.crypt) are unmoderated. Anyone can respond to any message, and much of what is authoritatively expounded is also wrong, or perhaps just incomplete. New rank novices appear almost every day. Not only is it hard to know what to believe, it can be hard simply to find time to read through the chaff (currently tens of messages per day in sci.crypt) to get to the occasional kernel of new information. In addition, the bulletin-board format seems to encourage vituperative responses or "flames," which naturally produce similar counter-responses. These "flame wars" can be disconcerting to those used to respectful intellectual discourse.

At one time sci.crypt supported mostly technical discussions, and these occasionally still appear. But with continuing exponential growth, the advent of an available public key package (Pretty Good Privacy or PGP) and the U.S. government's "Clipper Chip" proposal, most of the traffic in the past year has addressed political issues.

The 27916 Keprom is an integrated circuit memory device designed at Intel Corporation in the early 80's [22]. The Keprom is a "Keyed-Access EPROM" (Erasable Programmable Read-Only Memory) which performs an authentication handshake, during which physically-random values are encrypted under a programmed key [10]. In this way, the data in the EPROM are unlocked only if the associated equipment contains another Keprom with the correct key. Apparently the device was intended to be a technical solution to the problem of software piracy, perhaps especially in the games market. Unfortunately, the device appeared at about the same time that major software manufacturers were dropping copy protection.

The issue of interest here is that the Keprom device included an on-chip physically-random number generator, and a technical paper [11] reported results from evaluation tests on that generator. Briefly, the design consists of a 32-bit shift-register which accumulates bits sampled from a drifting on-chip oscillator, noise, and feedback from the register itself. Because 32-bit values are reported from generator, we expect to estimate a 32-bit population. However, the higher-order repetitions in the reported data imply a value very much less than this, which may indicate problems in the physically-random design.

Table 4.1 All Possible Trials for N=2 and s=6 Trial Repetitions Aug. Reps. Accum. Aug. Reps. 0 0 0 0 0 0 1 0 0 0 0 1 6 15 20 15 1 6 15 20 15 0 0 0 0 0 1 0 1 0 0 0 0 1 5 10 10 1 7 20 30 25 0 0 0 0 1 0 0 1 0 0 0 0 1 5 10 10 1 8 25 40 35 0 0 0 0 1 1 0 0 1 0 1 0 0 1 4 7 1 8 26 44 42 0 0 0 1 0 0 0 1 0 0 0 0 1 5 10 10 1 9 31 54 52 0 0 0 1 0 1 0 0 1 0 1 0 0 1 4 7 1 9 32 58 59 0 0 0 1 1 0 0 0 1 0 1 0 0 1 4 7 1 9 33 62 66 0 0 0 1 1 1 0 0 0 2 0 0 0 0 2 6 1 9 33 64 72 0 0 1 0 0 0 0 1 0 0 0 0 1 5 10 10 1 10 38 74 82 0 0 1 0 0 1 0 0 1 0 1 0 0 1 4 7 1 10 39 78 89 0 0 1 0 1 0 0 0 1 0 1 0 0 1 4 7 1 10 40 82 96 0 0 1 0 1 1 0 0 0 2 0 0 0 0 2 6 1 10 40 84 102 0 0 1 1 0 0 0 0 1 0 1 0 0 1 4 7 1 10 41 88 109 0 0 1 1 0 1 0 0 0 2 0 0 0 0 2 6 1 10 41 90 115 0 0 1 1 1 0 0 0 0 2 0 0 0 0 2 6 1 10 41 92 121 0 0 1 1 1 1 0 0 1 0 1 0 0 1 4 7 1 10 42 96 128 0 1 0 0 0 0 0 1 0 0 0 0 1 5 10 10 1 11 47 106 138 0 1 0 0 0 1 0 0 1 0 1 0 0 1 4 7 1 11 48 110 145 0 1 0 0 1 0 0 0 1 0 1 0 0 1 4 7 1 11 49 114 152 0 1 0 0 1 1 0 0 0 2 0 0 0 0 2 6 1 11 49 116 158 0 1 0 1 0 0 0 0 1 0 1 0 0 1 4 7 1 11 50 120 165 0 1 0 1 0 1 0 0 0 2 0 0 0 0 2 6 1 11 50 122 171 0 1 0 1 1 0 0 0 0 2 0 0 0 0 2 6 1 11 50 124 177 0 1 0 1 1 1 0 0 1 0 1 0 0 1 4 7 1 11 51 128 184 0 1 1 0 0 0 0 0 1 0 1 0 0 1 4 7 1 11 52 132 191 0 1 1 0 0 1 0 0 0 2 0 0 0 0 2 6 1 11 52 134 197 0 1 1 0 1 0 0 0 0 2 0 0 0 0 2 6 1 11 52 136 203 0 1 1 0 1 1 0 0 1 0 1 0 0 1 4 7 1 11 53 140 210 0 1 1 1 0 0 0 0 0 2 0 0 0 0 2 6 1 11 53 142 216 0 1 1 1 0 1 0 0 1 0 1 0 0 1 4 7 1 11 54 146 223 0 1 1 1 1 0 0 0 1 0 1 0 0 1 4 7 1 11 55 150 230 0 1 1 1 1 1 0 1 0 0 0 0 1 5 10 10 1 12 60 160 240 1 0 0 0 0 0 0 1 0 0 0 0 1 5 10 10 1 13 65 170 250 1 0 0 0 0 1 0 0 1 0 1 0 0 1 4 7 1 13 66 174 257 1 0 0 0 1 0 0 0 1 0 1 0 0 1 4 7 1 13 67 178 264 1 0 0 0 1 1 0 0 0 2 0 0 0 0 2 6 1 13 67 180 270 1 0 0 1 0 0 0 0 1 0 1 0 0 1 4 7 1 13 68 184 277 1 0 0 1 0 1 0 0 0 2 0 0 0 0 2 6 1 13 68 186 283 1 0 0 1 1 0 0 0 0 2 0 0 0 0 2 6 1 13 68 188 289 1 0 0 1 1 1 0 0 1 0 1 0 0 1 4 7 1 13 69 192 296 1 0 1 0 0 0 0 0 1 0 1 0 0 1 4 7 1 13 70 196 303 1 0 1 0 0 1 0 0 0 2 0 0 0 0 2 6 1 13 70 198 309 1 0 1 0 1 0 0 0 0 2 0 0 0 0 2 6 1 13 70 200 315 1 0 1 0 1 1 0 0 1 0 1 0 0 1 4 7 1 13 71 204 322 1 0 1 1 0 0 0 0 0 2 0 0 0 0 2 6 1 13 71 206 328 1 0 1 1 0 1 0 0 1 0 1 0 0 1 4 7 1 13 72 210 335 1 0 1 1 1 0 0 0 1 0 1 0 0 1 4 7 1 13 73 214 342 1 0 1 1 1 1 0 1 0 0 0 0 1 5 10 10 1 14 78 224 352 1 1 0 0 0 0 0 0 1 0 1 0 0 1 4 7 1 14 79 228 359 1 1 0 0 0 1 0 0 0 2 0 0 0 0 2 6 1 14 79 230 365 1 1 0 0 1 0 0 0 0 2 0 0 0 0 2 6 1 14 79 232 371 1 1 0 0 1 1 0 0 1 0 1 0 0 1 4 7 1 14 80 236 378 1 1 0 1 0 0 0 0 0 2 0 0 0 0 2 6 1 14 80 238 384 1 1 0 1 0 1 0 0 1 0 1 0 0 1 4 7 1 14 81 242 391 1 1 0 1 1 0 0 0 1 0 1 0 0 1 4 7 1 14 82 246 398 1 1 0 1 1 1 0 1 0 0 0 0 1 5 10 10 1 15 87 256 408 1 1 1 0 0 0 0 0 0 2 0 0 0 0 2 6 1 15 87 258 414 1 1 1 0 0 1 0 0 1 0 1 0 0 1 4 7 1 15 88 262 421 1 1 1 0 1 0 0 0 1 0 1 0 0 1 4 7 1 15 89 266 428 1 1 1 0 1 1 0 1 0 0 0 0 1 5 10 10 1 16 94 276 438 1 1 1 1 0 0 0 0 1 0 1 0 0 1 4 7 1 16 95 280 445 1 1 1 1 0 1 0 1 0 0 0 0 1 5 10 10 1 17 100 290 455 1 1 1 1 1 0 0 1 0 0 0 0 1 5 10 10 1 18 105 300 465 1 1 1 1 1 1 1 0 0 0 0 1 6 15 20 15 2 24 120 320 480 For all 64 possible trials of 6 samples from a population of 2 symbols: XctTot: 2 12 30 40 30 XctEr: 2 12 30 40 30 AugTot: 2 24 120 320 480 AugEar: 2 24 120 320 480

Table 4.2 All Possible Trials for N=3 and s=4 Trial Reps Aug. Reps Accum. Aug. Reps 0 0 0 0 1 0 0 1 4 6 1 4 6 0 0 0 1 0 1 0 0 1 3 1 5 9 0 0 0 2 0 1 0 0 1 3 1 6 12 0 0 1 0 0 1 0 0 1 3 1 7 15 0 0 1 1 0 0 2 0 0 2 1 7 17 0 0 1 2 0 0 1 0 0 1 1 7 18 0 0 2 0 0 1 0 0 1 3 1 8 21 0 0 2 1 0 0 1 0 0 1 1 8 22 0 0 2 2 0 0 2 0 0 2 1 8 24 0 1 0 0 0 1 0 0 1 3 1 9 27 0 1 0 1 0 0 2 0 0 2 1 9 29 0 1 0 2 0 0 1 0 0 1 1 9 30 0 1 1 0 0 0 2 0 0 2 1 9 32 0 1 1 1 0 1 0 0 1 3 1 10 35 0 1 1 2 0 0 1 0 0 1 1 10 36 0 1 2 0 0 0 1 0 0 1 1 10 37 0 1 2 1 0 0 1 0 0 1 1 10 38 0 1 2 2 0 0 1 0 0 1 1 10 39 0 2 0 0 0 1 0 0 1 3 1 11 42 0 2 0 1 0 0 1 0 0 1 1 11 43 0 2 0 2 0 0 2 0 0 2 1 11 45 0 2 1 0 0 0 1 0 0 1 1 11 46 0 2 1 1 0 0 1 0 0 1 1 11 47 0 2 1 2 0 0 1 0 0 1 1 11 48 0 2 2 0 0 0 2 0 0 2 1 11 50 0 2 2 1 0 0 1 0 0 1 1 11 51 0 2 2 2 0 1 0 0 1 3 1 12 54 1 0 0 0 0 1 0 0 1 3 1 13 57 1 0 0 1 0 0 2 0 0 2 1 13 59 1 0 0 2 0 0 1 0 0 1 1 13 60 1 0 1 0 0 0 2 0 0 2 1 13 62 1 0 1 1 0 1 0 0 1 3 1 14 65 1 0 1 2 0 0 1 0 0 1 1 14 66 1 0 2 0 0 0 1 0 0 1 1 14 67 1 0 2 1 0 0 1 0 0 1 1 14 68 1 0 2 2 0 0 1 0 0 1 1 14 69 1 1 0 0 0 0 2 0 0 2 1 14 71 1 1 0 1 0 1 0 0 1 3 1 15 74 1 1 0 2 0 0 1 0 0 1 1 15 75 1 1 1 0 0 1 0 0 1 3 1 16 78 1 1 1 1 1 0 0 1 4 6 2 20 84 1 1 1 2 0 1 0 0 1 3 2 21 87 1 1 2 0 0 0 1 0 0 1 2 21 88 1 1 2 1 0 1 0 0 1 3 2 22 91 1 1 2 2 0 0 2 0 0 2 2 22 93 1 2 0 0 0 0 1 0 0 1 2 22 94 1 2 0 1 0 0 1 0 0 1 2 22 95 1 2 0 2 0 0 1 0 0 1 2 22 96 1 2 1 0 0 0 1 0 0 1 2 22 97 1 2 1 1 0 1 0 0 1 3 2 23 100 1 2 1 2 0 0 2 0 0 2 2 23 102 1 2 2 0 0 0 1 0 0 1 2 23 103 1 2 2 1 0 0 2 0 0 2 2 23 105 1 2 2 2 0 1 0 0 1 3 2 24 108 2 0 0 0 0 1 0 0 1 3 2 25 111 2 0 0 1 0 0 1 0 0 1 2 25 112 2 0 0 2 0 0 2 0 0 2 2 25 114 2 0 1 0 0 0 1 0 0 1 2 25 115 2 0 1 1 0 0 1 0 0 1 2 25 116 2 0 1 2 0 0 1 0 0 1 2 25 117 2 0 2 0 0 0 2 0 0 2 2 25 119 2 0 2 1 0 0 1 0 0 1 2 25 120 2 0 2 2 0 1 0 0 1 3 2 26 123 2 1 0 0 0 0 1 0 0 1 2 26 124 2 1 0 1 0 0 1 0 0 1 2 26 125 2 1 0 2 0 0 1 0 0 1 2 26 126 2 1 1 0 0 0 1 0 0 1 2 26 127 2 1 1 1 0 1 0 0 1 3 2 27 130 2 1 1 2 0 0 2 0 0 2 2 27 132 2 1 2 0 0 0 1 0 0 1 2 27 133 2 1 2 1 0 0 2 0 0 2 2 27 135 2 1 2 2 0 1 0 0 1 3 2 28 138 2 2 0 0 0 0 2 0 0 2 2 28 140 2 2 0 1 0 0 1 0 0 1 2 28 141 2 2 0 2 0 1 0 0 1 3 2 29 144 2 2 1 0 0 0 1 0 0 1 2 29 145 2 2 1 1 0 0 2 0 0 2 2 29 147 2 2 1 2 0 1 0 0 1 3 2 30 150 2 2 2 0 0 1 0 0 1 3 2 31 153 2 2 2 1 0 1 0 0 1 3 2 32 156 2 2 2 2 1 0 0 1 4 6 3 36 162 For all 81 possible trials of 4 samples from a population of 3 symbols: XctTot: 3 24 72 XctEr: 3 24 72 AugTot: 3 36 162 AugEar: 3 36 162

Table 4.3 Summary of All Possible Trials for N=10 s=3 and N=100 s=2 For all 1000 possible trials of 3 samples from a population of 10 symbols: XctTot: 10 270 XctEr: 10 270 AugTot: 10 300 AugEar: 10 300 For all 10000 possible trials of 2 samples from a population of 100 symbols: XctTot: 100 XctEr: 100 AugTot: 100 AugEar: 100

Terry Ritter is a registered Professional Engineer who has spent the last five years conducting independent research and development on cryptographic systems. Mr. Ritter is a member of the IEEE and ACM, and holds the US patent on Dynamic Substitution technology; he has previously contributed articles on reversible nonlinear cryptographic combiners and a survey of cryptographic random number generators. Ritter Software Engineering offers fast cipher engines to hardware and software developers.