The MRISC32: a vector-first CPU design

Thom Holwerda 2018-08-24 Hardware 17 Comments

In essence, it’s a 32-bit RISC ISA designed from a holistic view on integer, floating point, scalar and vector operations. In addition there is a hardware implementation of a single issue, in order, pipelined CPU. The hardware implementation mostly serves as an aid in the design of the ISA (at the time of writing the CPU is still incomplete).

As happens with some articles I post here, this one’s definitely a bit over my head.

About The Author

Thom Holwerda

Follow me on Mastodon @[email protected]

17 Comments

2018-08-25 7:13 am

Carewolf
I don’t know why he decided on mixing floating point and integer registers. Not mixing them is an easy way to double the amount of available registers without using another bit in the instruction format.

And using a separate vector lenght register means you have can’t have different lenghts of each register. And the long lengths would a problem for power saving, so it is important the CPU can put most the registers bitwidth to sleep when not needed.

Finally, it is an old school (ca 1998) pipeline processor straight out of Patterson & Hennessy. That’s very cute.

Edited 2018-08-25 07:19 UTC

2018-08-25 1:19 pm

cb88
Apparently doing integer operations on floats…

Having a special overlapping register might work if you really wanted to do that though without having to copy between register files.

He seems to really like the Cray-1 architecture but doesn’t seem to realize the drawbacks, engineers have realized in the following years. Float / Integer operations are separated, because their interoperation is very infrequent so you can run them essentially in parallel. Like large register files are slow… which is why most architectures have much smaller / faster ones. The notable exception to that being DSPs… which it does seem his processor is leaning toward being a DSP more than a generic CPU.

By bunching integer and float operations together he’s actually slowing down his CPU.

Edited 2018-08-25 13:25 UTC

2018-08-25 2:43 pm

Iapx432
By bunching integer and float operations together he’s actually slowing down his CPU.

They are bunched in the ISA, which may simplify life for the coder, but does the hardware implementation have to do that slavishly? Could it use dedicated H/W for each type and do a hidden variable transfer depending on the operand?

I guess I am not sure of how closely a hardware implementation has to follow the ISA register map. Maybe it can get creative depending on the performance goals.

An inevitable feature of modern processor design seems to be that the serial minded coder is insulated from the much more parallel capable circuitry.

2018-08-25 11:47 pm

TheNorseWind

An inevitable feature of modern processor design seems to be that the serial minded coder is insulated from the much more parallel capable circuitry.

That’s a feature of STEM education in general. See the Peter Principle.

We spend 14 years grading students on their serial performance, and then in their sophomore or junior year we give them a few data-bound problems as an afterthought. The only thing that saves the whole class from getting flunked is that they’re graded on a curve and they’re all equally unprepared.

2018-08-26 1:37 pm

viton
Like large register files are slow… which is why most architectures have much smaller / faster ones. [/q]
High performance microarchitectures have large register files.

For example Ryzen has 168 integer registers and 160 FP registers.

[q]By bunching integer and float operations together he’s actually slowing down his CPU.

Why?

2018-08-27 10:33 am

Carewolf
High performance microarchitectures have large register files.

For example Ryzen has 168 integer registers and 160 FP registers.

Yeah, but not an indexable one of that width. The problem is specifying all those register in a compact 16 or 32-bit instruction. By separating FP and integer registers, you can double the number of registers your can index, and all that it cost you are two rarely used move instructions from Fp->Int and Int->Fp, to enable the funky integer on float stuff, or you can do like Intel and have bitfield operations straight on float registers, which the integers operation you typically want to use on float (for fast fabs and copysign).

2018-08-27 7:45 pm

viton
The problem is specifying all those register in a compact 16 or 32-bit instruction.

[/q]
There are no problems with 5 bit register #. It is commonly used.

[q]By separating FP and integer registers, you can double the number of registers your can index

So why do you need 64 registers?

The code usually will be either Int or Float dominated, so you have more architectural registers at hand than highest performing x86 CPUs.

Unified register file imposes limits on pipeline design – I.E. dependency tracking, FP pipeline need to be tightly integrated.

2018-08-25 10:38 pm

peskanov
Unified registers banks is the common design chosen these days. I found them in the IBM Cell, in DSPs and GPUs.

It makes a lot of sense; in earlier CPUs, the FPU was an optional chip, and very often an afterthought in the design (ARM comes to mind), so a separated register bank was mandatory.

But many popular algorithms like video/audio codecs transform data between float and integer intensely, and architectures were registers are separated usually (like PowerPC or x86) show strong penalties there.

In PowerPC, you were forced to pass the data through memory when going from int to float! Register banks were really insulated one from the other.

I have not found this kind of extra cost in unified bank systems.

BTW, in my experience the size of of the opcodes lack any importance except in embedded programming, for very small flash mems (8KB-32KB).

2018-08-26 5:18 am

Kochise
But embedded mcus are not powerhouses either, so the point is moot. For these, you can do fixed point just fine.
2018-08-26 11:48 am

Anonymous
My main drive for a unified register file was that it makes life easier for the programmer. If it’s easier for the programmer, it should be easier for the compiler too, hence unlocking otherwkse unused optimizations.

There are many tasks that require mixing and conversions between int and float that are very impractical and slow on x86, ARM etc. Bilinear sampling comes to mind.

I also have a very naive thought: if Intel/AMD can work around all the deficiencies of the x86 ISA, it should be possible to make fast HW for MRISC too ðŸ˜‰

2018-08-26 1:28 pm

viton
I don’t know why he decided on mixing floating point and integer registers. Not mixing them is an easy way to double the amount of available registers without using another bit in the instruction format. [/q]
All successful SIMD/vector instruction sets are mixing float and integer.

32 registers is enough. x86-64 limited to 16 architectural registers for each type and can achieve highest performance.

And using a separate vector lenght register means you have can’t have different lenghts of each register.

For what? Youâ€™re only need one length for current operation.

It is very likely the register is not long enough to hold entire data array.

Vector registers >512 bits are impractical.

[q]Finally, it is an old school (ca 1998) pipeline processor

Finally? RISC-V vector extension and ARM SVE are already here.

2018-08-26 2:12 pm

Anonymous
Yes, it’s an 80’s/90’s pipeline. I use it mainly as a reference implementation to verify my ISA ideas, and to learn how to make a CPU. Some day I hope to be able to run it on an FPGA.

If I ever get the time there’s a kind of plan: A1 = single issue, single element/cycle, A2 = single issue, multiple elements / cycle, A3 = in order, multiple issue, multiple elements / cycle. B* = out of order, multiple issue.

2018-08-27 2:14 am

viton
If you’re not following Colin Riley, you should

https://twitter.com/domipheus

2018-08-27 8:55 am

Anonymous
Thanks! Following now. Neat stuff. Colin does things that are very closely related to what I want to do.
2018-08-27 12:24 pm

Kochise
What about Mill architecture ?

https://en.wikipedia.org/wiki/Mill_architecture

https://www.youtube.com/playlist?list=PLFls3Q5bBInj_FfNLrV7gGdVtikeG…

Edited 2018-08-27 12:24 UTC
2018-08-27 12:47 pm

Anonymous
It’s patented, so not very interesting to work with. It will be interesting to see what comes out of it though.

2018-08-27 10:29 am

Carewolf

And using a separate vector lenght register means you have can’t have different lenghts of each register.

For what? Youâ€™re only need one length for current operation.

Sub-function calls without having to dump all registers to memory?

I code a lot of SIMD, and variable length is pretty common, often you need to have a few temporary registers with a lot more values than the input and output registers.

Edited 2018-08-27 10:29 UTC