Direct Arithmetic Operations on Small-sized Numbers in RISC Architectures

574 Views Asked by At

Are there any RISC architectures which allow arithmetic operations to be applied individually to bytes, half-words and other data cells, whose size is less than the size of the CPU general purpose registers?

In Intel x86 (IA-32) and x86-64 (known as EM64T or AMD64) processors not only the whole register is available, but its smaller parts are operable as well. Intel ISA allows to perform all the arithmetic operations on the whole register, it's half, quarter and a byte (to be more precise, two bytes in the register are available, for example AL and AH in RAX). After the operation is performed, we can make an overflow check, and if an overflow has occurred during the previous operation, it can be easily handled. No matter whether we've operated on the whole word (32-bit wide for IA-32 and 64-bit wide for EM64T) or the arithmetic instruction was executed over the data of smaller size (half-word, quarter-word or a byte), if the result exceeds the size of the chosen data cell, the corresponding flag (OF or CF) will be set to 1. So in Intel architecture there is no need to emulate processing such errors, which occur in operations with small-sized data, with a chain of instructions analyzing higher bits of the result.

The question is are there any RISC architectures in which direct arithmetic operations on small data are possible, these operations are implemented by means of the processor hardware (no software emulation is required to perform them), and overflows, carries and borrows occurring in such operations with bytes, half-words etc. are traced by the processor equipment, they should not be checked in a software manner. Or perhaps this approach contradicts the whole RISC philosophy and no RISC processor neither in the present nor in the past has ever implemented it?

3

There are 3 best solutions below

0
On

TL:DR: no, AFAIK there are no RISC ISAs with flag-setting partial-register ops narrower than 32 bits. But many 64-bit RISC ISAs (like AArch64) that have FLAGS at all can set them from the result of a 32-bit op.

See the last section: this is because of a general lack of demand for software integer overflow checking, or a chicken/egg problem. Usually you just need to compare/branch on 16-bit values, and you can do that just fine with them zero or sign extended to 32 or 64 bit.

Only a RISC where the register width is 8 or 16 bits can set flags from that operand-size. e.g. AVR 8-bit RISC with 32 registers and 16-bit instruction words. It needs extended-precision add/adc just to implement 16-bit int.

This is mostly a historical thing: x86 has 16-bit operand-size for everything because of the way it evolved from 16-bit-only 286. When 80386 was designed, it was important that it be able to run 16-bit-only code at full speed, and they provided ways to incrementally add 32-bit ops to 16-bit code. And used the same mechanism to allow 16-bit ops in 32-bit code.

The x86 8-bit low/high register stuff (AX=AH:AL) is again partly due to how 8086 was designed as a successor to 8080 and to make porting easy (and even possible to automate) See Why are first four x86 GPRs named in such unintuitive order?. (And also because it was just plain useful to have eight 1-byte registers and four 2-byte registers at the same time.)


Related: Which 2's complement integer operations can be used without zeroing high bits in the inputs, if only the low part of the result is wanted? For many calculations, you don't have to re-zero the high bits after each operation to get the same result. So lack of 8-bit / 16-bit operand size is not an obstacle to efficient implementation of most code that logically wraps its results to 8 or 16 bits.

64-bit RISC machines often have a 32-bit version of at least some important instructions like add, so you can get a zero-extended add result for free without having to separately truncate it, e.g. to make code like array[i++] efficient with uint32_t i and 64-bit pointers. But never partial-register operand sizes narrower than 32-bit, on any RISC I've heard of.

DEC Alpha is interesting because it was a new design, 64-bit from the ground up, not a 64-bit extension to an existing ISA the way MIPS64 is. This table of Alpha mnemonics shows that add/sub/mul/div were all available in 32 and 64-bit forms, but shifts and compares weren't. (There are also byte-manipulation instructions that are basically SIMD shuffle/mask/insert/extract inside 64-bit integer registers, and a SIMD packged-compare for efficient string stuff.)

According to this official MIPS64 ISA doc (section 4.3 CPU Registers).

A MIPS64 processor always produces a 64-bit result, even for those instructions that are architecturally defined to operate on 32 bits. Such instructions typically sign-extend their 32-bit result into 64 bits. In so doing, 32-bit programs work as expected, though the registers are actually 64 bits wide rather than 32 bits.

(You use special instructions for full 64-bit registers, like DADDU (doubleword-add unsigned) instead of ADDU. Note that the non-U versions of add and dadd trap on 2's complement signed overflow (with 32-bit or 64-bit operand size), so you have to use the U version for wrapping signed math. (ISA reference links on mips.com). Anyway, MIPS doesn't have a special mode for 32-bit, but an OS would need to care about 32-bit programs vs. 64-bit, because 32-bit will assume all pointers are in the low 32 of virtual address space.


On a RISC load/store machine, you'd usually just use zero-extending (or sign-extending) byte/halfword loads. When you're done, you'd use a byte / halfword store to get the truncated result. (With for unsigned base2, or 2's complement signed, is typically what you want.) This is how a compiler (or human) would implement C source that used short or uint8_t.

Semi-related: C's integer promotion rules automatically promote everything narrower than int up to int when used as an operand to a binary operator like +, so it mostly maps nicely to this way of computing. (i.e. unsigned result = (a+b) * c in C doesn't have to truncate the a+b result back to uint8_t before the multiply, if a, b, and c are all uint8_t. But it's pretty bad that uint16_t promotes to signed int, so uint16_t a,b; unsigned c = a * b risks signed-overflow UB from promoting to signed int for the multiply.) Anyway, C's promotion rules sort of look like they're designed for machines without full support for narrow operand sizes, because that's common for a lot of hardware.


But you're asking about overflow checking / flag-setting from narrow ops.

Not all RISC machines even have a FLAGS register. ARM does, but for example MIPS and Alpha don't. ARM doesn't set flags on every instruction: you have to explicitly use the flag-setting form of an instruction.

CPUs without FLAGS typically have some simple compare-and-branch instructions (often against zero, like MIPS bltz), and others that compare two inputs and write a 0 / 1 result to another integer register (e.g. MIPS SLTIU -- Set on less than immediate unsigned). You can use the Set instructions + a bne with zero to create more complex branch conditions.


Hardware and software support for efficient overflow-checking is a problem in general. Putting a jcc after every x86 instruction sucks quite a lot, too.

But partly because most languages don't make it easy to write code that needs overflow checking after every instruction, CPU architects don't provide it in hardware, especially not for narrow operand sizes.

MIPS is interesting with trapping add for signed overflow.

Ways to implement it efficiently might include having a "sticky" flag, the way FPU exception flags are sticky: the Invalid flag stays set after dividing by zero (and producing NaN); other FP instructions don't clear it. So you can check for exception flags at the end of a series of computations, or after a loop. This makes it cheap enough to actually use in practice, if there was a software framework for it.

With FP code, usually you don't need to look at flags because NaN itself is "sticky" or "infectious". Most binary operators produce NaN if either input is NaN. But unsigned and 2's complement integer representations don't have any spare bit patterns: they all represent specific numbers. (1's complement has negative zero...)

For more about ISA design that would make overflow checking possible, have a look at discussion on Agner Fog's proposal for a new ISA that combines the best features of x86 (code density, lots of work per instruction) and RISC (easy to decode) for a high performance paper architecture. Some interesting SIMD ideas, including making future extensions to vector width transparent, so you don't have to recompile to run faster with wider vectors.

0
On

Are there any ...

Do you only speak about commercial CPUs on the market or also about student projects on university etc.?

I myself designed a RISC CPU for education purposes which can do 8-, 16- and 32-bit operations. So this shows that it is at least possible to do this.

64-bit embedded PowerPC architectures also have something similar: They can do 32-bit operations in the low 32 bits of the 64-bit registers.

This architecture does not have 8- and 16-bit operations. However CISC CPUs also do not support all widths supported by other computers with smaller widths:

x86 neither supports 4-bit operations nor 12-bit operations although there are CPUs (Intel 4004 and DEC PDP-8) using these widths.

After the operation is performed, we can make an overflow check, and if an overflow has occurred during the previous operation, it can be easily handled.

The 64-bit SPARC architecture is interesting here:

To enable 32-bit software to be executed on 64-bit CPUs there are some special features.

One of them is that all flags (carry, zero, ...) are duplicate: Once for the low 32 bits and once for the whole 64 bits.

So after doing an "ADD" operation (which can only be done 64-bit) you can either check for the 64-bit flags or the 32-bit flags.

3
On

Most 64-bit RISC architectures also support a limited form of what you expected by having instructions for operating on 32-bit or 64-bit words. Many also supports operations on bitfields although I'm not sure if any allows you to do arithmetic directly on bitfields

But there is one such irregular RISC architecture named Blackfin, where the data registers can be accessed as a whole or used as multiple separate parts. From it's documentation (formatted into bullets by me for ease of reading):

  • Accumulators: The set of 40-bit registers A1 and A0 that normally contain data that is being manipulated. Each Accumulator can be accessed in five ways:
    • as one 40-bit register
    • as one 32-bit register (designated as A1.W or A0.W)
    • as two 16-bit registers similar to Data Registers (designated as A1.H, A1.L, A0.H, or A0.L)
    • and as one 8-bit register (designated A1.X or A0.X) for the bits that extend beyond bit 31.
  • Data Registers: The set of 32-bit registers (R0, R1, R2, R3, R4, R5, R6, and R7) that normally contain data for manipulation. Abbreviated D-register or Dreg.
    • Data Registers can be accessed as
      • 32-bit registers
      • or optionally as two independent 16-bit registers.
    • The least significant 16 bits of each register is called the “low” half and is designated with “.L” following the register name. The most significant 16 bit is called the “high” half and is designated with “.H” following the name. Example: R7.L, r2.h, r4.L, R0.h.

Blackfin registers

It even has multiple independent carry and overflow flags in the Arithmetic Status (ASTAT) register so it's easier to mix arithmetic operations


Another interesting case is SuperH SH-5 which does SIMD operations inside the general purpose registers even though it has a separate set of 64 floating-point registers. So you can do arithmetic on the real bytes/words/double words. In other words, it's doing the SWAR technique in hardware

Multimedia data in SH-5 general purpose registers


OpenRISC also does SIMD (and even floating-point) operations in the GPRs

4.4 General-Purpose Registers (GPRs)

The thirty-two general-purpose registers are labeled R0-R31 and are 32 bits wide in 32-bit implementations and 64 bits wide in 64-bit implementations. They hold scalar integer data, floating-point data, vectors or memory pointers. Table 4-3 contains a list of general-purpose registers. The GPRs may be accessed as both source and destination registers by ORBIS, ORVDX and ORFPX instructions.

For example

lv.add.h: Vector Half-Word Elements Add Signed

The half-word elements of general-purpose register rA are added to the half-word elements of general-purpose register rB to form the result elements. The result elements are placed into general-purpose register rD


The Intel i960 is also peculiar in its own way. It's the only odd RISC architecture with 32 registers but without a zero register, and it has instructions to compare bytes and shorts even though it still can't do other arithmetic operations on bytes

cmpi    Compare Integer
cmpib   Compare Integer Byte
cmpis   Compare Integer Short
cmpo    Compare Ordinal
cmpob   Compare Ordinal Byte
cmpos   Compare Ordinal Short
concmpi Conditional Compare Integer
concmpo Conditional Compare Ordinal