From one Arm to the next!

ARM Processors and Architectures

Currently available processors

Arm 1
Arm 2
Arm 3
Arm 4 & 5
Arm 6
Arm 7
Arm 8
StrongARM
Arm 9
Arm10

ARM Architecture

v1 - ARM1.
v2 - ARM2.
v2as - ARM3 & ARM250.
v3 - ARM6, ARM7, ARM8 & Amulet 1.
v3M - various ARM6, 7 & 8 variants.
v4 - StrongARM, ARM9.
v5 - ARM10.
VFP1 - ARM10 (some variants of).
Thumb (T variants).
Long multiply instructions (M variants).
Enhanced DSP instructions (E variants).

Currently available processors

ARM1 This was the very first ARM processor and was very close in capabilities to the ARM2. This processor was used in a few evaluation systems for BBC and PC machines, but it was primarily a prototype chip and was superseded by the ARM2. The exact, major, differences between an ARM2 and an ARM1 are:

No banked R8 and R9 in FIQ mode.
No multiply instruction.
LDR/STR instructions with register-specified shift amounts.
No co-processor interface or co-processor instructions.

ARM2 The ARM2 chip features 27 registers of which 16 are accessible at any one time. Four processor modes are available -

USR : user mode
IRQ : interrupt mode ( with a private copy of R13 and R14.)
FIQ : fast interrupt mode ( private copies of R8 to R14.)
SVC : supervisor mode. (private copies of R13 and R14.)

Only non USR mode code may change the processor mode providing hardware security if the hardware and physical memory is only accessible from privileged code. Due to the top six bits of the program counter being used to hold the processor status flags this chip was restricted to addressing 26 bits of memory, or a 64 Megabyte address space. In actuality there are eight bits of processor status held in the PC register. Because an ARM instruction is always four bytes long the bottom two bits of the PC were always an implied zero when the register was being used as a PC. When that register is used for other operations the bottom bits reflect the mode the processor is operating in. (00 - USR, 01 - IRQ, 10 - FIQ & 11 - SVC)

A three stage instruction pipeline allows the chip to execute instructions quickly with a fairly low transistor count. One side effect of the pipeline is the ability to get a 'free' rotation/shift on every instruction as one stage of the pipeline dealt exclusively with a barrel shift of a given register. Combined with the condition execution of every instruction then long runs of code without branches, which stall the pipeline, could be achieved allowing a fairly high instruction execution speed for the clock rate. (About 0.6 instructions per clock cycle on average)

The ARM2 chip was clocked at 8 MHz giving an average performance of 4-4.7 MIPS.

ARM3 This is an ARM2 core macrocell with a cache and dedicated coprocessor interface added. The register set was unchanged and no new processor modes were added. What was new, in the ARM3 chip produced, was the addition of an on chip cache (4Kbyte, 64 way set associative, random replacement, 4 word lines, write through, mixed data and instructions.) and much faster clock speeds. Also new were adjustments to the co-processor interface on the chip including defining co-processor fifteen to be cache control and chip identification.

Finally one new instruction was added, the SWP instruction. A monotonic register to memory swap command useful for multi-processor arrays.

Several speeds of ARM3 chips were produced. Initially 26 MHz varieties were released with the A540 machines, then 25 MHz versions were used in the A5000 and 24 MHz ones in the A4. Finally a 33MHz version was produced and used in the alpha variant of the A5000.

A second incarnation of the chip was as the ARM250 which was a 12MHz variant of the ARM3 cell and had the IOC1, VIDC1a and MEMC1 chips all integrated into the one chip but unlike the normal ARM3 it had no processor cache. The ARM250 delivered about 7 MIPS performance.

A 24 MHz ARM3 using a 12MHz main memory will produce an average speed of execution of 13.26 MIPS. At 33 MHz 17.96 MIPS is delivered.

ARM4 & ARM5 These were never made. In the change over from Acorn to Armltd designing the processors the number scheme for the chips was changed. As such the numbers 4 and 5 were skipped.
ARM6 This processor cell is the first of the commercially available ARMs to have a full 32bit addressing capability. Additionally the processor now has 31 registers in it along with six new processor modes :-

User32 - 32 bit USR mode.
Supervisor32 - 32 bit SVC mode. (private SPSR register)
IRQ32 - 32 bit IRQ mode. (private SPSR register)
FIQ32 - 32 bit FIQ mode. (private SPSR register)
Abort32 - Memory fetch abort more. (private SPSR register)
Undefined32 Undefined instruction mode. (private SPSR register)

The SPSR register is a Saved Processor Status Register and holds a copy of the CPSR (Current Processor Status Register) when the new mode is entered. The addition of the Abort32 mode and this change, although the CPSR/SPSR is really a corollary of the change to 32bits, allows the ARM6 cell to easily handle virtual memory without the contortions you had to go through on earlier cell ARM chips.

Two new instructions for reading and writing the CPSR and SPSR registers were added. The program counter is now fully 32 bit with the CPSR being hardware shifted into position when the PC is read in 26 bit modes. (for backwards compatibility.) The ARM6 cell is fully binary compatible, in the 26 bit modes, with the earlier ARM cell's code. The chip is fully static, the clock can slowed to any speed and the processor will maintain state. Finally the cell can work in either big-endian or little endian operation can be hardware switched between the two modes. Total register count in the ARM6 cell (not chip) is 36,000 transistors.

Several versions of the ARM6 cell have been produced. The ARM61 is a hardwired version of the ARM6 cell in ARM2/3 compatibility mode. This chip cannot enter the 32bit address/processor modes. The ARM600 range of chips is an ARM6 cell with an inbuilt MMU, on chip cache similar to the ARM3 chip's, an eight deep write back buffer with two independent addresses and a total transistor count of 360,000. The cache has had performance tweaks, is now controlled by the MMU and has been adjusted for 32 bit addressing. Three ARM610 chip speeds have been produced. One at 20 MHz delivering 17 MIPS, one at 30 MHz delivering 26 MIPS performance and finally one at 33MHz giving around 27-28 MIPS.

Also available are the ARM60 (an ARM 6 cell as a chip, without anything else.), ARM650 (An ARM6 with some RAM & peripheral controllers. Designed For embedded control systems.), ARM6l (lower power ARM6 cell) and the ARM60l (lower power version of the ARM 6 cell as a chip.).

ARM7 The ARM7 cell is functionally identical to the ARM6 cell in capabilities but may be clocked faster than the ARM6. A variant of the ARM7 cell offers an improved hardware multiply, suitable for DSP work.

Most of what is new in the ARM7 cell is internal changes on timings for various signals. The ARM700 chip has a larger on chip cache (8kb, and radically altered for power efficiency) to the ARM600, improving cache hit rates. It also has twice the number of translation lookaside entries in the MMU and twice the number of address on the write buffer. (Presumably now four address can be written to before the buffer stalls.) At 40MHz the ARM710 delivers about 36 MIPS, or around a 40% improvement over the ARM610.

ARM7 series devices are ARM7 (chip cell core.), ARM7D (the chip core with debugging support.), ARM7DM ( an ARM7D with an enhanced multiply.), ARM7DMI (an ARM7DM with ICEbreaker (tm). ICEbreaker is on chip support for In-Circuit-Emulation.), ARM70DM (ARM7DMI as a chip.), ARM700 (ARM7 + MMU + cache + Writeback Buffer.) and the ARM7500 (ARM7 + MMU + cache + Writeback Buffer + IOMD + VIDC20). Nearly all of these cores can be offered with the Thumb core as well.

ARM8 The ARM8 cell is directly compatible with the ARM6 and 7 devices. However it includes a five stage pipeline (an idea duplicated in the StrongARM device), a speculative instruction fetcher and internal tweaks to the processor to allow a higher clock speed. The cache remains the same size but becomes a writeback cache as well and a 64bit multiply instruction added.

Fabricated on 0.5 micron process the chip is listed as delivering 80 MIPS performance with a 3.3 Volt device at 80 MHz. This is over twice the performance of an ARM7 chip and lives up to the initial 'roadmap' promises made about the ARM family. However it's performance is eclipsed by the StrongARM devices for raw processing power.

StrongARM This is the high speed variant of the ARM chip family, having been developed by Armltd in conjunction with Digital. Architecturally it is similar to the ARM8 core, sharing the five stage pipeline with that processor. A further difference is change from a unified data and instruction cache to a split, Harvard architecture, instruction and data cache. Each cache is 16kb in the SA110.

In terms of the instruction set there is one new instructions added, the halfword load/store for moving 16 bit data units. Complete code compatibility is not guaranteed with earlier processors because of two factors, The extended pipeline means stack calls that store the Program Counter will have a value of the PC a full sixteen bytes ahead of the currently executing instruction, rather than the more normal eight bytes. Secondly the split cache introduces problems with self modifying code being first executed, then treated as data, manipulated and an attempt is then made to execute the altered code before it is flushed from the instruction cache.

Such code fragments will break. Fortunately such code tends to be fairly rare and confined to the OS (SWI handlers in particular). Produced on a 0.35 micron process the SA110 part achieves 115 MIPS at 100 MHz, 185 MIPS at 160 MHz and 230 MIPS at 200 MHz. The SA1100 part is designed for portable applications and contains an SA core, MMU, read/write buffers (probably a Level 1 cache and write buffer akin to the SA110 part), PCMCIA support, colour/greyscale LCD controller and general purpose IO controller (including two serial ports and USB support). It can be clocked at 133 or 200 MHz and consumes less than 500 mW of power.

ARM9 An incremental improvement over the ARM8 this chip features the same five stage pipeline but is now a Harvard Architecture chip, like the StrongARM. This probably means that same restrictions on self modifying code apply as for the StrongARM.

It is initially going to be offered as two parts, the ARM9TDMI (Thumb, Debug support, 64bit Mulitply and ICEBreaker In Circuit Emulation) - which is the base core part, and the ARM940T. The ARM940T offers, above and beyond he base core, 4kb Instruction/Data caches, a write buffer (8 words, 4 independant addresses), AMBA bus interface, external co-processor support and a protection unit for embedded applications (requires no address translation and allows eight, independantly sized and level of protection, protected areas of memory). Both parts are fabricated at 0.35 microns, clock at 150 MHz (producing 165 MIPS) with the ARM9TDMI consuming 225 mW and the ARM950T 675mW.

ARM10 Designed to be fabricated on 0.25 and 0.18 processes it is meant to function at 300 MHz giving 400 MIPS performance while consuming less than 600 mWatts of power. As well a companion development of the core is a Vector Floating Point unit (VFP10) delivering 600 MFLOPS, at 300 MHz, and designed to be used by the ARM10. New features in the core include branch prediction, parallel instruction execution (but curiously it is not full super scalarity, presumably the trick is multiple executions of the same pipeline stage are now possible if the instructions are independant of each other.) and some method of continuing instruction execution on cache misses. (perhaps this is only for Data Cache misses seeing as the new processor appears to be a harvard architecture like the StrongARM processor)

Initially planned versions include the ARM10TDMI core with the ARM1020T processor built around this core but adding an MMU with demand paged virtual memopry support, a 32Kb harvard style level 1 cache (most likely 16Kb Instruction and 16Kb Data caches ala the StrongARM), write buffer and an enhanced AMBA bus interface. Exact power consumption figures haven't been released but I expect the ARM1020T will consume between 0.6 to 1 Watt worth of power at 300 MHz.

ARM Architectures

The ARM Architecture is built around a programmers model of sixteen general purpose registers and a variety of processor modes. Each processor mode offers differing levels of memory access, manipulation of the PC & mode and it's own private registers.

Version 1 - ARM1

By default the programmer 'sees' 16 User mode registers, but when in other modes various registers are swapped out with registers particular to that mode. This table summarizes the various modes and registers.

  USR      IRQ      FIQ        SVC

R0

R1

R2

R3

R4

R5

R6

R7

R8

R9

   R10             R10_fiq

   R11             R11_fiq

   R12             R12_fiq

   R13   R13_irq   R13_fiq   R13_svc

   R14   R14_irq   R14_fiq   R14_svc

   R15 (aka PC)

Where a register isn't named in the table, then the USR mode register is visible.

To help keep interrupt latency to a minimum, FIQ (Fast Interrupt Request) mode has a reasonably large set of private registers allowing interrupt code to execute in register as much as possible. If there is only one FIQ claimant allowed at a time, a stricture RISC OS stipulates, a further optimization of pre-loading these registers can be performed.

By convention, and partially enforced by the instruction set, R14 is the 'link' register - commonly holding the return address of any sub routine call. The BL (Branch and Link) instruction automatically stores the correct return address in R14. All registers are general purpose, including R15 which is the Program Counter, status flags and mode register all in one. 26 bits of word aligned address, two bits of processor mode in bits 1 & 0 ( 00 - USR, 01 - IRQ, 10 - FIQ & 11 - SVC) and six bits of processor status (Negative, Carry, Overflow, Zero, Interrupt Request Disable and Fast ).

Instructions include Load/Store (Register, Multiple registers, Byte), Move (and Move NOT), Addition (Add, Add with Carry, Subtract, Subtract with Carry, Reverse Subtract, Reverse Subtract with Carry), Comparison (Compare and Compare Not), Boolean Logic (Test, Test Equivalence, And, Exclusive Or, Or, Bit Clear), Program Flow (Branch, Branch with Link) and the Software Interrupt.

Version 2 - ARM2

This architecture added a banked R8 and R9 in FIQ mode, the LDR/STR instruction with register specified shift amounts was withdrawn and two new 'classes' of instruction were added - these being Multiply (multiply and multiply accumulate) and co-processor control (Data operation, co-processor data to ARM register, ARM register to co-processor, Load & Store).

Version 2as - ARM3 & ARM250

Functionality identical to the v2 architecture this variant added one extra instruction SWP and allocated co-processor zero to be CPU identification and cache control.

Version 3 - ARM6, ARM7 & Amulet 1

This update to the ARM architecture removed the 26bit restriction to the PC counter allowing full 32bit addressing for both data and code. (Previously only data could be addressed across the full 32bit address range.) As a result the dodge of storing processor flags mixed in with the PC in register 15 was no longer possible and a new set of registers were added to hold processor state. For each processor mode the registers CPSR (Current Processor Status Register) and SPSR (Stacked Processor Status Registers) were added. Two new processor modes were added as well Abort32 and Undefined32. For backwards compatibility the chip could be set to emulate the older 26bit mode of operation. A further improvement included the ability to change the byte order of the chip from little-endian to big-endian operation.

All this required the addition of new Move instructions (SPSR to register, CPSR to register, register to SPSR, register to CPSR, immediate constant to SPSR and immediate constant to CPSR.) to communicate with the status registers for each processor mode.

Version 3M

This extension of the version three architecture gave extended Multiply opcodes including unsigned long, unsigned accumulate long, signed long and signed accumlate long multiplies.

Version 4 - StrongARM, ARM8 & ARM9

The new instructions first introduced in the 3M architecture now become part of the main architecture in version 4. Additionally a Halfword (16bit) load/store instruction was added.

This version extends architecture 4 by adding instructions and slightly modifying the definitions of some existing instructions to improve the efficiency of ARM/Thumb interworking in T variants and allow the same code generation techniques to be used for non-T variants as for T variants.

Version 5 also adds a count leading zeros instruction, which allows more efficient integer divide and interrupt prioritization routines. A software breakpoint instruction and more instruction options for coprocessor designers has been added. Additionally, version 5 tightens the definition of how flags are set by multiple instructions.

The Thumb Instruction set (T variants)

The Thumb instruction set is a re-encoded subset of the ARM instruction set. Thumb instructions are half the size of ARM instructions (16 bits compared with 32), with the result that greater code density can usually be achieved by using the Thumb instruction set instead of the ARM instruction set. The trade-off includes that the Thumb instruction set loses the conditional instruction execution and can only address the first eight registers of the processor.

The Thumb instruction set does not include some instructions that are needed for exception handling, so ARM code needs to be used for at least the top-level exception handlers. Because of this, the Thumb instruction set is always used in conjunction with a suitable version of the ARM architecture.

Long multiply instructions (M variants)

M variants of the ARM instruction set include four extra instructions which perform 32 x 32 > 64 multiplication and 32 x 32 +64 > 64 multiply-accumulates. These instructions imply the existence of a multiplier that is significantly larger than minimum, and are sometimes omitted in implementations for which a small die size is very important and multiply performance is not very important.

Enhanced DSP instructions (E variants)

E variants of the ARM instruction set include a number of extra instructions which enhance the performance of an ARM processor on typical digital signal processing (DSP) algorithms.

Vector Floating Point v1 - ARM10

Developed concurrently with ARM Architecture Version 5 this is a coprocessor extension to the ARM architecture designed for high floating-point performance on typical graphics and DSP algorithms. Its provides single-precision and double-precision floating point arithmetic.

Finally for the latest information and details regarding the ARM family of processors why not visit ARMLtd's homepages where details on current and upcoming ARM processors are kept.

Main | Top