No banked R8 and R9 in FIQ
No multiply instruction.
LDR/STR instructions with register-specified shift amounts.
No co-processor interface or co-processor instructions.
USR : user mode
IRQ : interrupt mode ( with a private copy of R13 and R14.)
FIQ : fast interrupt mode ( private copies of R8 to R14.)
SVC : supervisor mode. (private copies of R13 and R14.)
Only non USR mode code may change the processor mode providing hardware security if the hardware and physical memory is only accessible from privileged code. Due to the top six bits of the program counter being used to hold the processor status flags this chip was restricted to addressing 26 bits of memory, or a 64 Megabyte address space. In actuality there are eight bits of processor status held in the PC register. Because an ARM instruction is always four bytes long the bottom two bits of the PC were always an implied zero when the register was being used as a PC. When that register is used for other operations the bottom bits reflect the mode the processor is operating in. (00 - USR, 01 - IRQ, 10 - FIQ & 11 - SVC)
A three stage instruction pipeline allows the chip to execute instructions quickly with a fairly low transistor count. One side effect of the pipeline is the ability to get a 'free' rotation/shift on every instruction as one stage of the pipeline dealt exclusively with a barrel shift of a given register. Combined with the condition execution of every instruction then long runs of code without branches, which stall the pipeline, could be achieved allowing a fairly high instruction execution speed for the clock rate. (About 0.6 instructions per clock cycle on average)
The ARM2 chip was clocked at 8 MHz giving an average performance of 4-4.7 MIPS.
Finally one new instruction was added, the SWP instruction. A monotonic register to memory swap command useful for multi-processor arrays.
Several speeds of ARM3 chips were produced. Initially 26 MHz varieties were released with the A540 machines, then 25 MHz versions were used in the A5000 and 24 MHz ones in the A4. Finally a 33MHz version was produced and used in the alpha variant of the A5000.
A second incarnation of the chip was as the ARM250 which was a 12MHz variant of the ARM3 cell and had the IOC1, VIDC1a and MEMC1 chips all integrated into the one chip but unlike the normal ARM3 it had no processor cache. The ARM250 delivered about 7 MIPS performance.
A 24 MHz ARM3 using a 12MHz main memory will produce an average speed of execution of 13.26 MIPS. At 33 MHz 17.96 MIPS is delivered.
User32 - 32 bit USR mode.
Supervisor32 - 32 bit SVC mode. (private SPSR register)
IRQ32 - 32 bit IRQ mode. (private SPSR register)
FIQ32 - 32 bit FIQ mode. (private SPSR register)
Abort32 - Memory fetch abort more. (private SPSR register)
Undefined32 Undefined instruction mode. (private SPSR register)
The SPSR register is a Saved Processor Status Register and holds a copy of the CPSR (Current Processor Status Register) when the new mode is entered. The addition of the Abort32 mode and this change, although the CPSR/SPSR is really a corollary of the change to 32bits, allows the ARM6 cell to easily handle virtual memory without the contortions you had to go through on earlier cell ARM chips.
Two new instructions for reading and writing the CPSR and SPSR registers were added. The program counter is now fully 32 bit with the CPSR being hardware shifted into position when the PC is read in 26 bit modes. (for backwards compatibility.) The ARM6 cell is fully binary compatible, in the 26 bit modes, with the earlier ARM cell's code. The chip is fully static, the clock can slowed to any speed and the processor will maintain state. Finally the cell can work in either big-endian or little endian operation can be hardware switched between the two modes. Total register count in the ARM6 cell (not chip) is 36,000 transistors.
Several versions of the ARM6 cell have been produced. The ARM61 is a hardwired version of the ARM6 cell in ARM2/3 compatibility mode. This chip cannot enter the 32bit address/processor modes. The ARM600 range of chips is an ARM6 cell with an inbuilt MMU, on chip cache similar to the ARM3 chip's, an eight deep write back buffer with two independent addresses and a total transistor count of 360,000. The cache has had performance tweaks, is now controlled by the MMU and has been adjusted for 32 bit addressing. Three ARM610 chip speeds have been produced. One at 20 MHz delivering 17 MIPS, one at 30 MHz delivering 26 MIPS performance and finally one at 33MHz giving around 27-28 MIPS.
Also available are the ARM60 (an ARM 6 cell as a chip, without anything else.), ARM650 (An ARM6 with some RAM & peripheral controllers. Designed For embedded control systems.), ARM6l (lower power ARM6 cell) and the ARM60l (lower power version of the ARM 6 cell as a chip.).
Most of what is new in the ARM7 cell is internal changes on timings for various signals. The ARM700 chip has a larger on chip cache (8kb, and radically altered for power efficiency) to the ARM600, improving cache hit rates. It also has twice the number of translation lookaside entries in the MMU and twice the number of address on the write buffer. (Presumably now four address can be written to before the buffer stalls.) At 40MHz the ARM710 delivers about 36 MIPS, or around a 40% improvement over the ARM610.
ARM7 series devices are ARM7 (chip cell core.), ARM7D (the chip core with debugging support.), ARM7DM ( an ARM7D with an enhanced multiply.), ARM7DMI (an ARM7DM with ICEbreaker (tm). ICEbreaker is on chip support for In-Circuit-Emulation.), ARM70DM (ARM7DMI as a chip.), ARM700 (ARM7 + MMU + cache + Writeback Buffer.) and the ARM7500 (ARM7 + MMU + cache + Writeback Buffer + IOMD + VIDC20). Nearly all of these cores can be offered with the Thumb core as well.
Fabricated on 0.5 micron process the chip is listed as delivering 80 MIPS performance with a 3.3 Volt device at 80 MHz. This is over twice the performance of an ARM7 chip and lives up to the initial 'roadmap' promises made about the ARM family. However it's performance is eclipsed by the StrongARM devices for raw processing power.
In terms of the instruction set there is one new instructions added, the halfword load/store for moving 16 bit data units. Complete code compatibility is not guaranteed with earlier processors because of two factors, The extended pipeline means stack calls that store the Program Counter will have a value of the PC a full sixteen bytes ahead of the currently executing instruction, rather than the more normal eight bytes. Secondly the split cache introduces problems with self modifying code being first executed, then treated as data, manipulated and an attempt is then made to execute the altered code before it is flushed from the instruction cache.
Such code fragments will break. Fortunately such code tends to be fairly rare and confined to the OS (SWI handlers in particular). Produced on a 0.35 micron process the SA110 part achieves 115 MIPS at 100 MHz, 185 MIPS at 160 MHz and 230 MIPS at 200 MHz. The SA1100 part is designed for portable applications and contains an SA core, MMU, read/write buffers (probably a Level 1 cache and write buffer akin to the SA110 part), PCMCIA support, colour/greyscale LCD controller and general purpose IO controller (including two serial ports and USB support). It can be clocked at 133 or 200 MHz and consumes less than 500 mW of power.
It is initially going to be offered as two parts, the ARM9TDMI (Thumb, Debug support, 64bit Mulitply and ICEBreaker In Circuit Emulation) - which is the base core part, and the ARM940T. The ARM940T offers, above and beyond he base core, 4kb Instruction/Data caches, a write buffer (8 words, 4 independant addresses), AMBA bus interface, external co-processor support and a protection unit for embedded applications (requires no address translation and allows eight, independantly sized and level of protection, protected areas of memory). Both parts are fabricated at 0.35 microns, clock at 150 MHz (producing 165 MIPS) with the ARM9TDMI consuming 225 mW and the ARM950T 675mW.
Initially planned versions include the ARM10TDMI core with the ARM1020T processor built around this core but adding an MMU with demand paged virtual memopry support, a 32Kb harvard style level 1 cache (most likely 16Kb Instruction and 16Kb Data caches ala the StrongARM), write buffer and an enhanced AMBA bus interface. Exact power consumption figures haven't been released but I expect the ARM1020T will consume between 0.6 to 1 Watt worth of power at 300 MHz.
The ARM Architecture is built around a programmers model of sixteen general purpose registers and a variety of processor modes. Each processor mode offers differing levels of memory access, manipulation of the PC & mode and it's own private registers.
By default the programmer 'sees' 16 User mode registers, but when in other modes various registers are swapped out with registers particular to that mode. This table summarizes the various modes and registers.
USR IRQ FIQ SVC
R13 R13_irq R13_fiq R13_svc
R14 R14_irq R14_fiq R14_svc
R15 (aka PC)
<![if !supportEmptyParas]> <![endif]>
Where a register isn't named in the table, then the USR mode register is visible.
To help keep interrupt latency to a minimum, FIQ (Fast Interrupt Request) mode has a reasonably large set of private registers allowing interrupt code to execute in register as much as possible. If there is only one FIQ claimant allowed at a time, a stricture RISC OS stipulates, a further optimization of pre-loading these registers can be performed.
By convention, and partially enforced by the instruction set, R14 is the 'link' register - commonly holding the return address of any sub routine call. The BL (Branch and Link) instruction automatically stores the correct return address in R14. All registers are general purpose, including R15 which is the Program Counter, status flags and mode register all in one. 26 bits of word aligned address, two bits of processor mode in bits 1 & 0 ( 00 - USR, 01 - IRQ, 10 - FIQ & 11 - SVC) and six bits of processor status (Negative, Carry, Overflow, Zero, Interrupt Request Disable and Fast ).
Instructions include Load/Store (Register, Multiple registers, Byte), Move (and Move NOT), Addition (Add, Add with Carry, Subtract, Subtract with Carry, Reverse Subtract, Reverse Subtract with Carry), Comparison (Compare and Compare Not), Boolean Logic (Test, Test Equivalence, And, Exclusive Or, Or, Bit Clear), Program Flow (Branch, Branch with Link) and the Software Interrupt.
This architecture added a banked R8 and R9 in FIQ mode, the LDR/STR instruction with register specified shift amounts was withdrawn and two new 'classes' of instruction were added - these being Multiply (multiply and multiply accumulate) and co-processor control (Data operation, co-processor data to ARM register, ARM register to co-processor, Load & Store).
Functionality identical to the v2 architecture this variant added one extra instruction SWP and allocated co-processor zero to be CPU identification and cache control.
This update to the ARM architecture removed the 26bit restriction to the PC counter allowing full 32bit addressing for both data and code. (Previously only data could be addressed across the full 32bit address range.) As a result the dodge of storing processor flags mixed in with the PC in register 15 was no longer possible and a new set of registers were added to hold processor state. For each processor mode the registers CPSR (Current Processor Status Register) and SPSR (Stacked Processor Status Registers) were added. Two new processor modes were added as well Abort32 and Undefined32. For backwards compatibility the chip could be set to emulate the older 26bit mode of operation. A further improvement included the ability to change the byte order of the chip from little-endian to big-endian operation.
All this required the addition of new Move instructions (SPSR to register, CPSR to register, register to SPSR, register to CPSR, immediate constant to SPSR and immediate constant to CPSR.) to communicate with the status registers for each processor mode.
This extension of the version three architecture gave extended Multiply opcodes including unsigned long, unsigned accumulate long, signed long and signed accumlate long multiplies.
The new instructions first introduced in the 3M architecture now become part of the main architecture in version 4. Additionally a Halfword (16bit) load/store instruction was added.
This version extends architecture 4 by adding instructions and slightly modifying the definitions of some existing instructions to improve the efficiency of ARM/Thumb interworking in T variants and allow the same code generation techniques to be used for non-T variants as for T variants.
Version 5 also adds a count leading zeros instruction, which allows more efficient integer divide and interrupt prioritization routines. A software breakpoint instruction and more instruction options for coprocessor designers has been added. Additionally, version 5 tightens the definition of how flags are set by multiple instructions.
The Thumb instruction set is a re-encoded subset of the ARM instruction set. Thumb instructions are half the size of ARM instructions (16 bits compared with 32), with the result that greater code density can usually be achieved by using the Thumb instruction set instead of the ARM instruction set. The trade-off includes that the Thumb instruction set loses the conditional instruction execution and can only address the first eight registers of the processor.
The Thumb instruction set does not include some instructions that are needed for exception handling, so ARM code needs to be used for at least the top-level exception handlers. Because of this, the Thumb instruction set is always used in conjunction with a suitable version of the ARM architecture.
M variants of the ARM instruction set include four extra instructions which perform 32 x 32 > 64 multiplication and 32 x 32 +64 > 64 multiply-accumulates. These instructions imply the existence of a multiplier that is significantly larger than minimum, and are sometimes omitted in implementations for which a small die size is very important and multiply performance is not very important.
E variants of the ARM instruction set include a number of extra instructions which enhance the performance of an ARM processor on typical digital signal processing (DSP) algorithms.
Developed concurrently with ARM Architecture Version 5 this is a coprocessor extension to the ARM architecture designed for high floating-point performance on typical graphics and DSP algorithms. Its provides single-precision and double-precision floating point arithmetic.
Finally for the latest information and details regarding the ARM family of processors why not visit ARMLtd's homepages where details on current and upcoming ARM processors are kept.
Main | Top<![if !supportEmptyParas]> <![endif]>