distributed.net
Checking all 2^64 keys would be much too big a job for any one computer to attempt, no
matter how fast or powerful. 2^64 is two multiplied by itself 64 times, which works
out at 18,446,744,073,709,551,616 or over eighteen quintillion. For comparison,
this is roughly the age of the universe - in minutes.
distributed.net's
aim was to complete this mammoth task by combining the power of many tens of
thousands of smaller computers, with each one working on a separate, more manageable
chunk of the overall task at the same time. A small client program was installed on
each computer taking part in the effort and the internet was then used to coordinate
and distribute work and results. Together, these computers formed a distributed
network of machines cooperating to achieve the same goal.
On 14th July 2002 the correct decryption key was finally found by a "relatively
characterless PIII-450 in Tokyo". The key was 0x63DE7DC154F4D039 and decoded the
secret message "The unknown message is: some things are better left unread".
Click here to read the full press release of the discovery (3Kb)
.
The RC5 Algorithm
The RC5 algorithm was designed by Ron Rivest and is
©RSA.
It is a block cipher with variable key and word lengths and a variable number of
encryption rounds. This particular competition uses a variant of RC5 with a 32 bit
word length, 12 encryption rounds and an 8 byte (64 bit) key, denoted as "RC5-32/12/8".
describes the algorithm in full.
However, the distributed.net client software does not require a full implementation
of this standard but merely a function that will encrypt a single block of data (64
bits in this case) using a given key and compare the encrypted output with a known
value. The return code of the core function then indicates if the key was the correct
one or not.
Applying the definition and restrictions of the RC5-64 contest to the reference
implementation given in the RFC, the core function may be described as shown in
Table 1 below.
 |
 |
Table 1: Base RC5-64 core algorithm |
 |
 |
u32 A, B;
u32 i, j, k;
u32 P = 0xb7e15163;
u32 Q = 0x9e3779b9;
u32 S[26];
// initialise the expanded key table S[]
S[0] = P;
for (i = 1; i < 26 ;i++) {
S[i] = S[i-1] + Q;
}
// mix in the secret key
A = 0; B = 0;
i = 0; j = 0;
for (k = 0; k < 26*3; k++) {
A = ROTL(S[i] + A + B, 3);
S[i] = A;
B = ROTL(L[j] + A + B, A + B);
L[j] = B;
i = (i + 1) % 26;
j = (j + 1) % 2;
}
// perform the encryption
A = P[0] + S[0];
B = P[1] + S[1];
for (i = 1; i <= 12; i++) {
A = A ^ B;
A = ROTL(A, B) + S[2*i];
B = B ^ A;
B = ROTL(B, A) + S[(2*i)+1];
}
// check the result
return ((A == C[0] && B == C[1]) ? 1 : 0); |
|
Notes:
- u32 defines an unsigned 32 bit integer type
- P[], L[] and C[] are arrays of u32's with two elements
- P[] is the plaintext to encrypt
- L[] is the encryption key to encrypt P[] with
- C[] is the ciphertext which the correct key will encrypt P[] to
- ROTL(v,s) performs a 32 bit left circular shift of v by s places
|
|
 |
 |
 |
 |
Optimisations
The reference RC5 core function described above will work, but its performance
can be drastically improved by making a few simple observations and applying the
optimisations these imply. Examining the reference algorithm, we can make the
following changes to improve performance:
- The key expansion loop can be expanded into 3 separate loops, removing the expensive modulo-26 operations
- L[] can be held as two u32 variables L0 and L1, removing the modulo-2 operations
- Observing that the expanded key table initialisation sets each S[n] = P+Q*n, the initialisation loop can be combined with the first key expansion loop
- The final key expansion loop and the encryption loop can be combined
Applying these optimisations results in an algorithm as shown in Table 2 below.
There are also further (smaller) algorithmic optimisations to be made as well
as any machine or CPU specific gains that may be possible, but these are left
as an exercise for the reader! Hints: (i) B is not required; (ii) the loops can
be unrolled; (iii) what does it mean if eA does not match C[0]?; (iv) the last
element of S[] won't be needed for the last stage most of the time.
 |
 |
Table 2: Partially optimised RC5-64 core algorithm |
 |
 |
u32 A, B, eA, eB;
u32 i, j;
u32 P = 0xb7e15163;
u32 Q = 0x9e3779b9;
u32 S[26];
// combined S[] initialisation and first key expansion loops
A = 0; B = 0;
j = P;
for (i = 0; i < 26; ) {
A = ROTL(j + A + B, 3);
S[i++] = A;
B = ROTL(L0 + A + B, A + B);
L0 = B;
j += Q;
A = ROTL(j + A + B, 3);
S[i++] = A;
B = ROTL(L1 + A + B, A + B);
L1 = B;
j += Q;
}
// second key expansion loop
for (i = 0; i < 26; ) {
A = ROTL(S[i] + A + B, 3);
S[i++] = A;
B = ROTL(L0 + A + B, A + B);
L0 = B;
A = ROTL(S[i] + A + B, 3);
S[i++] = A;
B = ROTL(L1 + A + B, A + B);
L1 = B;
}
// combined 3rd key expansion and encryption loops
eA = P[0] + S[0];
eB = P[1] + S[1];
for (i = 2; i < 26; ) {
A = ROTL(S[i++] + A + B, 3);
B = ROTL(L0 + A + B, A + B);
L0 = B;
eA = eA ^ eB;
eA = ROTL(eA, eB) + A;
A = ROTL(S[i++] + A + B, 3);
B = ROTL(L1 + A + B, A + B);
L1 = B;
eB = eB ^ eA;
eB = ROTL(eB, eA) + A;
}
// check the result
return ((eA == C[0] && eB == C[1]) ? 1 : 0); |
|
 |
 |
 |
 |
680x0 Cores
Taking the general algorithmic optimisations described in the previous section and
combining them with CPU specific optimisations, I developed several RC5-64 core
functions for the
Motorola MC68000 series of processors.
These CPUs are found in
all "Classic" Amiga computers as well as
several types of older Apple Macintosh and UNIX workstation.
Benchmark Results
Using the test harness I have measured the performance of the various 680x0 RC5-64
cores I have developed on a number of different configurations of Amiga computer.
The results obtained are shown in Table 3 below.
 |
 |
Table 3: 680x0 core benchmark results |
 |
 |
| |
reference |
original |
new |
| keyrate |
efficiency |
keyrate |
efficiency |
keyrate |
efficiency |
gain |
| cpu
| 7.14Mhz 68000 |
241 | 34 |
560 | 78 |
579 | 81 |
+3.4% |
| 7.14Mhz 68010 |
257 | 36 |
580 | 81 |
608 | 85 |
+4.8% |
| 14Mhz 68020 NF |
658 | 47 |
2,551 | 182 |
4,406 | 314 |
+72.7% |
| 14Mhz 68020 |
1,209 | 86 |
4,107 | 293 |
5,056 | 361 |
+23.1% |
| 50Mhz 68030 NF |
796 | 16 |
3,134 | 63 |
9,869 | 197 |
+214.9% |
| 50Mhz 68030 |
4,490 | 90 |
14,847 | 297 |
19,195 | 384 |
+29.3% |
| 25Mhz 68040 |
4,773 | 191 |
21,476 | 859 |
22,664 | 906 |
+5.5% |
| 28Mhz 68040 |
8,320 | 297 |
24,545 | 876 |
25,902 | 925 |
+5.5% |
| 40Mhz 68040 |
8,320 | 208 |
34,362 | 859 |
36,263 | 907 |
+5.5% |
| 50Mhz 68060 |
26,481 | 530 |
118,463 | 2,369 |
151,484 | 3,030 |
+27.9% |
| 66Mhz 68060 |
35,326 | 530 |
157,945 | 2,369 |
201,970 | 3,030 |
+27.9% |
Notes:
- the "reference" core is an optimised C implementation of the RC5-64 core compiled
(with maximum speed optimisations) using v6.59 of the SAS/C compiler package for the Amiga
- the "original" core is the the RC5-64 core implementation found in v2.7100.413 of the the
distributed.net Amiga client program
- the "new" core is one of the 680x0 RC5-64 core implementations developed by myself
- the "new" core tested is the one that is best optimised for the CPU being used
- "keyrate" is the number of RC5-64 keys that can be checked each second by the
given core on the given CPU
- "efficiency" is a measure of how efficient the core/CPU combination is at
checking RC5-64 keys in terms of keys checked per second per CPU Mhz
- "gain" is the percentage difference in keyrates between the core and
the "original" core. This is the important figure!
- keyrates are calculated by measuring the length of time it takes to check 100,000 keys
with the benchmark program getting a guaranteed 100% of CPU time for the duration of the test
- the overhead imposed by the benchmark harness is negligible
- a cpu marked "NF" is using No Fastram, i.e.: only the very slow
"chip" memory of the Amiga is being used
|
 |
 |
 |
 |
Benchmark Discussion
The majority of the keyrate gain demonstrated by the new
68020 and 68030 cores comes from reducing the size and rearranging the structure of the core code
to make more efficient use of the tiny 256 byte level 1 cache found on these
processors. This is clearly demonstrated by the results for the 68020 and 68030
machines using only the very slow "chip" memory of the Amiga (i.e.:
basic machines with no expansion memory added) as the keyrate gain with the
new core is much larger for these machines than for the otherwise identical machines
using faster memory.
The
68000,
68010,
68040 and 68060 machines use practically the same new core, yet the
68060 shows over 20% more improvement over the original core than the others. Why?
Instructions in the core are arranged to make maxiumum use of the 68060's superscalar
execution features (two largely independent integer execution pipelines), but the other
chips do not have this ability so therefore do not benefit from the optimisations. The
gain they do experience is due to other optimisations in the core.
At just over 3,000K/s/Mhz the 68060 was one of the most efficient CPUs at
RC5-64 key checking; for comparison, an Intel Pentium manages less than half
this efficiency! The 68060 compares favourably to the more modern PowerPC,
Athlon and Pentium III architectures in terms of efficiency, but of course
these processors run at much higher clock rates than the 68060 and so their
actual attained keyrate was proportionally much greater.
These new core implementations were integrated into the Amiga version of the RC5-64
client program as of v2.7100.418 BETA-2. Real life core performance in the client was
not far off the rates under ideal conditions given in the table above.
Fun Statistics
- 15,268,315,356,922,380,288 keys tested
- 1,726 days elapsed
- 102,385,059,633 keys tested per second
- equivalent to 864,278 50Mhz 68060 machines using the original core
- ...or 675,880 using the new core
- to check these keys one 50Mhz 68060 machine would take 4,103,009 years
- ...or only 3,206,325 years using the new core
- 2,044,528,782,668,201,984 (13.391%) keys tested by all Amiga teams
- 2,044,429,986,072,363,008 (13.390%) keys tested by the main Amiga team
- 11,574,649,368,346,624 (0.076%) keys tested by AmigaOS machines
- 3,738,792,385,052,672 (0.024%) keys tested by 680x0 processors
- 811,382,672,982,016 (0.005%) keys tested by my machines (0.040% of team total)
Related Links
Related Downloads
|