Asynchronous ARM, by Steve Furber

   The async ARM is probably the only ARM related activity which isn't covered
   by an NDA! This is pure university research, with no plans for commercial
   exploitation at present. ARM Ltd is very supported and interested, but
   as you would expect they are waiting to see what the technology does before
   building any business plans around it.

   There was a good article in January 93 Byte on our work, and I will be
   presenting a paper at VLSI '93 on the architecture of the design. We haven't
   got much else onto paper yet, but material is beginning to come together.
   We will generate some full reports when we have seen silicon (in a couple
   of months). Below I append a summary submission I made to 'Hot Chips' in
   Stanford, which was accepted for a presentation this summer:

        AMULET1 - An Asynchronous ARM Processor
        =======================================

A fully asynchronous implementation of the ARM microprocessor has
been developed using Sutherland's "Micropipeline" approach. The
design incorporates a number of concurrent units which cooperate
to give instruction level compatibility with the existing synchronous
part. These include an Address unit, which autonomously generates
instruction fetch requests and interleaves (non-deterministically)
data requests from the Execution unit; a Register file which sources
operands, queues write destinations and handles data dependencies;
an Execution unit which includes a multiplier, a shifter and an
ALU with data-dependent delay; a Data interface which performs byte
extraction and alignment and includes an instruction prefetch buffer,
and a control path which performs instruction decode. These units
all operate independently, only synchronizing at mutual interfaces
to exchange data.

The design demonstrates that all the usual problems of processor
design can be solved in this asynchronous framework: backwards
instruction set compatibility, interrupts and exact exceptions for
memory faults are all covered. It also demonstrates some unusual
behaviour, for instance non-deterministic prefetch depth beyond
a branch instruction (though the instructions which actually get
executed are, of course, deterministic). There are some unusual
problems for compiler optimization, as the metric which must be
used to compare alternative code sequences is continuous rather
than discrete, and the non-determinism in external behaviour must
also be taken into account.

The chip (which is presently in fabrication) was designed using a
mixture of custom datapath and compiled control logic elements, as
was the synchronous ARM. The fabrication technology is the same as
that used for one version of the synchronous part, reducing the
number of variables when comparing the two parts.

The macrocell size (without pad ring) is 5.5mm by 4.5mm on a 1 micron
CMOS process, which is about twice the area of the synchronous part.
Some of the increase can be attributed to the more sophisticated
organization of the new part: it has a deeper pipeline than the
clocked version, and it supports multiple outstanding memory requests.
There is undoubtedly some overhead attributable to the asynchronous
control logic, but we estimate this to be closer to 20% than to the
100% suggested by the direct comparison.

The performance of the chip has been simulated at around 20K dhrystones,
which is comparable to the synchronous part. This is based on compiler
output which takes no note of data dependencies between instructions
(the performance of the synchronous part is unaffected by instruction
order), so we expect to be able to improve on this considerably by
code re-ordering. The first design is very conservative in its timing,
as there is no equivalent to backing-off on the clock frequency if the
samples don't meet the design speed, so again we see considerable room
for improvement through reducing the engineering margins.

Tests on the first silicon should enable us to refine the above results
before the Symposium takes place. The work has taken place as part of
a broad ESPRIT funded investigation into low-power technologies within
the European Open Microprocessor systems Initiative (OMI) programme,
where there is interest in low-power techniques both for portable
equipment and (in the longer term) to alleviate the problems of the
increasingly high dissipation of high-performance chips. This initial
investigation into the role asynchronous logic might play in the quest
for lower power has now demonstrated through simulation (and shortly
through silicon) that asynchronous techniques can be applied to problems
of the scale of a complete microprocessor.


I hope this gives you some of what you want.

---Steve

--------------------------------------------------------------------
S B Furber              tel: (+44) 61 275 6129
ICL Professor of Computer Engineering   fax: (+44) 60 275 6202
The University              email: sfurber@cs.man.ac.uk
Oxford Road
Manchester M13 9PL
UK
--------------------------------------------------------------------