Asynchronous ARM, by Steve Furber The async ARM is probably the only ARM related activity which isn't covered by an NDA! This is pure university research, with no plans for commercial exploitation at present. ARM Ltd is very supported and interested, but as you would expect they are waiting to see what the technology does before building any business plans around it. There was a good article in January 93 Byte on our work, and I will be presenting a paper at VLSI '93 on the architecture of the design. We haven't got much else onto paper yet, but material is beginning to come together. We will generate some full reports when we have seen silicon (in a couple of months). Below I append a summary submission I made to 'Hot Chips' in Stanford, which was accepted for a presentation this summer: AMULET1 - An Asynchronous ARM Processor ======================================= A fully asynchronous implementation of the ARM microprocessor has been developed using Sutherland's "Micropipeline" approach. The design incorporates a number of concurrent units which cooperate to give instruction level compatibility with the existing synchronous part. These include an Address unit, which autonomously generates instruction fetch requests and interleaves (non-deterministically) data requests from the Execution unit; a Register file which sources operands, queues write destinations and handles data dependencies; an Execution unit which includes a multiplier, a shifter and an ALU with data-dependent delay; a Data interface which performs byte extraction and alignment and includes an instruction prefetch buffer, and a control path which performs instruction decode. These units all operate independently, only synchronizing at mutual interfaces to exchange data. The design demonstrates that all the usual problems of processor design can be solved in this asynchronous framework: backwards instruction set compatibility, interrupts and exact exceptions for memory faults are all covered. It also demonstrates some unusual behaviour, for instance non-deterministic prefetch depth beyond a branch instruction (though the instructions which actually get executed are, of course, deterministic). There are some unusual problems for compiler optimization, as the metric which must be used to compare alternative code sequences is continuous rather than discrete, and the non-determinism in external behaviour must also be taken into account. The chip (which is presently in fabrication) was designed using a mixture of custom datapath and compiled control logic elements, as was the synchronous ARM. The fabrication technology is the same as that used for one version of the synchronous part, reducing the number of variables when comparing the two parts. The macrocell size (without pad ring) is 5.5mm by 4.5mm on a 1 micron CMOS process, which is about twice the area of the synchronous part. Some of the increase can be attributed to the more sophisticated organization of the new part: it has a deeper pipeline than the clocked version, and it supports multiple outstanding memory requests. There is undoubtedly some overhead attributable to the asynchronous control logic, but we estimate this to be closer to 20% than to the 100% suggested by the direct comparison. The performance of the chip has been simulated at around 20K dhrystones, which is comparable to the synchronous part. This is based on compiler output which takes no note of data dependencies between instructions (the performance of the synchronous part is unaffected by instruction order), so we expect to be able to improve on this considerably by code re-ordering. The first design is very conservative in its timing, as there is no equivalent to backing-off on the clock frequency if the samples don't meet the design speed, so again we see considerable room for improvement through reducing the engineering margins. Tests on the first silicon should enable us to refine the above results before the Symposium takes place. The work has taken place as part of a broad ESPRIT funded investigation into low-power technologies within the European Open Microprocessor systems Initiative (OMI) programme, where there is interest in low-power techniques both for portable equipment and (in the longer term) to alleviate the problems of the increasingly high dissipation of high-performance chips. This initial investigation into the role asynchronous logic might play in the quest for lower power has now demonstrated through simulation (and shortly through silicon) that asynchronous techniques can be applied to problems of the scale of a complete microprocessor. I hope this gives you some of what you want. ---Steve -------------------------------------------------------------------- S B Furber tel: (+44) 61 275 6129 ICL Professor of Computer Engineering fax: (+44) 60 275 6202 The University email: sfurber@cs.man.ac.uk Oxford Road Manchester M13 9PL UK --------------------------------------------------------------------