Arm Plans To Add Multithreading To Chip Design

Submitted by fran 2010-09-30 Hardware 28 Comments

“Arm plans to add multithreading capabilities to future architectures as it tries to boost the performance of its processors, a company representative said on Tuesday.
The company is looking to include multithreading capabilities depending on application requirements in different segments, said Kumaran Siva, segment marketing manager at Arm, at the Linley Tech Processor conference in San Jose, California”

About The Author

David Adams

Follow me on Twitter @david_adams

28 Comments

2010-10-01 1:44 am
DigitalAxis
Ok, how about 64-bit? Seriously, these ARM chips are getting as complex and powerful as desktop processors.

2010-10-01 2:08 am
miserj
Seriously, these ARM chips are getting as complex and powerful as desktop processors.
And that has me drooling.

2010-10-03 2:54 am
BlueofRainbow
These are only plans – whether actual silicon with the features being considered does come from a fabrication plant is a story still to be written.
64-bit wide registers would ease implementation of certain algorigthms. A first step implementation would not necessarily require 64-bit addressing. However, every processor with an addressing width smaller than the width of its general purpose registers either failed in the marketplace or introduced ackward addressing features which complicated coding later on.
Multithreading on a processor with multiple register sets – like the ARM – will likely look to a programmer quite different than on a single register set architecture like the X86.

2010-10-03 5:52 pm
Panajev
These are only plans – whether actual silicon with the features being considered does come from a fabrication plant is a story still to be written.
64-bit wide registers would ease implementation of certain algorigthms. A first step implementation would not necessarily require 64-bit addressing. However, every processor with an addressing width smaller than the width of its general purpose registers either failed in the marketplace or introduced ackward addressing features which complicated coding later on.
The SuperH series from Hitachi/Renesas did so quite efficiently.
x86-64 works with addressing smaller than the register’s width (it can be extended to use the full 64 bits, but right now only 48 bits are used IIRC).

2010-10-03 8:02 pm
Neolander
x86-64 works with addressing smaller than the register’s width (it can be extended to use the full 64 bits, but right now only 48 bits are used IIRC).
Nope. For some reason, the guys at AMD thought it would be more clever to limit themselves to 52-bit. It’s a definitive architectural limitation that cannot be extended, going to full 64-bit addressing would break compatibility with the AMD64 specification.
But well, it’ll take some times before RAMs reach the exabyte limit anyway…
Edited 2010-10-03 20:04 UTC
2010-10-03 8:23 pm
Neolander
(Source : AMD Manual Vol.2 rev 3.15 p129)
2010-10-04 1:12 pm
Panajev
x86-64 works with addressing smaller than the register’s width (it can be extended to use the full 64 bits, but right now only 48 bits are used IIRC).
Nope. For some reason, the guys at AMD thought it would be more clever to limit themselves to 52-bit. It’s a definitive architectural limitation that cannot be extended, going to full 64-bit addressing would break compatibility with the AMD64 specification.
Thanks for the correction, I did remember that they were not using the full 64 bits for addressing, but I had the wrong figure in mind.
It is still a valid example of an architecture with registers being wider than the addressing range.
If you do not need 64 bits addressing, you are wasting a lot of cache and bandwidth for addresses. Maybe not something Intel or AMD would push, but considering how varied ARM implementations are, it would not be bad for them to also offer a 32 bits addresses optimized 64 bits core for those realms, like the gaming space, which do not need 64 bits addresses at all.
2010-10-04 1:14 pm
Panajev
Still, it can be extended meant a change in ISA (x86-64 v2 ) of course .
2010-10-04 4:56 pm
Neolander
Still, it can be extended meant a change in ISA (x86-64 v2 ) of course .
Yup, but they would have to to re-do paging in a fashion that’s incompatible with the current paging structures. And OSs would have to adapt themselves by rewriting their whole paging code, probably just as fast as their current move to x86_64. All that for a few extra bits. At that rate, we might just as well move to x86_128 right away, I think ^^
(Current estimates are that we won’t ever get more than 2^128 (~10^38) bytes of memory in computers, anyway…)
The reason why it is so is that the 52-bit limitation is present at the lowest level of paging hierarchy… Here’s on of the possible lowest-level page tables (zomg, it’s pirated, the RIAA and MPAA are going to come and get me \o/) in the AMD spec :
http://yfrog.com/ndlowestlevelp
As you can see, the 52-62 bits are declared available for use by the OS to store whatever it wants there, so they’re not available for addressing. And bit 63 is used for NX/DEP.
Oh, and by the way, you 48-bit number was not wrong, as you seem to think in your first post Although x86_64 can only go as far as 52-bit, most of current implementations are indeed limited to 48-bit, if I remember well…
Edited 2010-10-04 17:03 UTC

2010-10-01 4:20 am
WereCatf
Ok, how about 64-bit? Seriously, these ARM chips are getting as complex and powerful as desktop processors.
And how is that a negative thing, then? ARM was designed with low power consumption as the main point and good performance as an after-though. whereas x86 was designed for good performance and low power consumption was an after-thought. Low power consumption is beginning to matter as we are starting to run low on organic energy sources like f.ex. oil, and as such ARM has a good strategic advantage compared to x86.
Performance-wise.. well, as stated several times before, a common home-user doesn’t need all that power they have even now; common tasks like surfing the web, watching porn, office work etc. don’t require all that much CPU. Even gaming usually strains graphics hardware and storage media more than CPU. So, a slight performance-drop CPU-wise in exchange for less power consumed is actually quite ideal in most cases.

2010-10-01 9:17 am
steve_s
ARM was designed with low power consumption as the main point and good performance as an after-though. whereas x86 was designed for good performance and low power consumption was an after-thought.
As it goes, the original ARM chip to make it into a commercial product, the ARM2 (running at 8MHz), was designed with performance as the main consideration, especially low I/O latency and fast interrupt handling, not power consumption. The competition at the time was the 80386 running at 16MHz. Technically the 386 scored a slightly higher MIPS count, but since chips need to talk to memory and devices to do real work ARM2 machines would be quicker at most tasks.

2010-10-01 12:29 pm
henderson101
{…} ARM2 machines would be quicker at most tasks.
It also helped that the team at Acorn designing the chip also had a hand in designed the OS. That pretty much sealed the deal.
2010-10-03 3:10 am
BlueofRainbow
The ARM was conceived by antithesis to the most complex CISC at the time – the National Semiconductor Series 32000. The stated aim of the Series 32000 design was to put a VAX on a chip and its designers did that – along with its complexity.
Acorn had some experience with the hardware and software challenges associated with the NS32016 which they used as a co-processor to a 6502 based workstation.

2010-10-01 2:41 pm
gnufreex
x86 was designed for good performance and low power consumption was an after-thought.
x86 is designed for backwards compatibility. No after-thoughts, of even thoughts. Just backward compatibility with 30 year old chips.

2010-10-01 4:05 pm
Neolander
Indeed. And it was the only option available to Intel, too : when they tried to clean up the mess with IA-64, it was a major commercial failure.

2010-10-01 5:55 pm
theosib
This an example of over-computer-sciencing it. (Re: Linus’s usenet conversation with Tanenbaum about microkernels.) I have to admit that before Itanium came out, I was really excited about the design. Based on on the literature, things like predication looked like really good ideas, before Intel began working on Merced. The problem was that all of the evidence that it was a good idea came from rather idealized test cases. In reality, it’s incredibly hard to write a compiler to really take advantage of predication well. Since the compilers didn’t meet the needs, Itanium performance was sub-par. There were similar problems with packing the instruction slots to take advantage of the explicit parallelism (since Itanium couldn’t detect implicit instruction level parallelism). There were also problems with power; the processor had to be underclocked (in terms of individual signal propagation meeting cycle time) in order to make it not overhead. One of the biggest problems with compiler-scheduled code is that they can’t predict cache misses. An OOO design can absorb cache misses if it can find other work to do while waiting on the out-standing data, but an in-order design like Itanium will just stall. It’s just too hard to predict all the places you have to put in prefetches so that data will arrive when you need it.
I saw a talk not too long ago by a Sun engineer (back when they were still Sun), talking about Rock. He described a modern OOO superscalar design as having this bursty nature, where execution is a race between last-level cache misses. The instruction window can absorb most cache misses, continuing to fetch and execute unrelated work during the 40 or so cycles required to hit in the L2 or L3. But when you have an LLC miss, you have to wait hundreds of cycles, which quickly starves the processor of work to do, because the stall is much longer than the instruction window. This is why FSB speed and on-die memory controllers contribute so much to performance improvements. The biggest contributors to recent performance gains have been that and increasing on-chip cache sizes. So, say you have a 500-cycle LLC stall time. On x86, your stall will be something on the order of 400 cycles (to over-simplify a bit), while for Itanium, it’s the full 500 cycles. That’s a substantial difference.
2010-10-02 6:00 pm
nt_jerkface
Indeed. And it was the only option available to Intel, too : when they tried to clean up the mess with IA-64, it was a major commercial failure.
The other side of this story though is that AMD really threw a monkey wrench into their plans with AMD64. The main x86 limitation was memory, not registers or backwards compatibility overhead. With AMD64 there was no longer the memory limitation plus the major benefit of compatibility.
It’s probably for the best since Intel wanted to get out of competition with AMD through IA-64.
2010-10-02 7:27 pm
Neolander
The main x86 limitation was memory, not registers or backwards compatibility overhead.
That depends on the point of view you consider. x86_64 is really a pain for OS developers. Though more painful in short term, a non-compatible and hence simple rewrite like IA64 (along with a major cleanup of the rest of the standard PC architecture) would have been better for developers in the long run, believe me…
Edited 2010-10-02 19:29 UTC
2010-10-03 12:12 pm
gnufreex
Unfortunately, Itanic is even bigger mess for OS devs, and worst nightmare for compiler devs. It turned out that Explicitly Parallel Instruction Comptuing is EPIC fail, at least in Intel’s case.

2010-10-01 7:14 am
deathshadow
Given that the biggest RAM in most ARM implementations is a whole whopping 512 megs, what exactly is 64 bit supposed to deliver apart from making the code BIGGER from rounding up data sizes?
Edited 2010-10-01 07:14 UTC

2010-10-01 7:22 am
Neolander
Well, if ARM targets the server market, it must address bigger RAM quantities. Latest ARM processors can access 1TB of memory using “LPAE”
2010-10-01 8:28 am
Panajev
1.) ARM wants a piece of the “physicalization” pie, they want a piece of the low power servers market and for that they need a bigger address range.
2.) 64 bits does not necessarily mean 64 bits addresses. They can provide, for next-next generation smartphones and tablets, single cycle 64 bits integer processing while also having server optimized variants with 64 bits addressing. The execution back-end would be shared by both implementations.

2010-10-01 4:32 pm
viton
single cycle 64 bits integer processing
What is the purpose of 64bit math?
I don’t see much use for it except some bignum/security libs.

2010-10-01 6:12 pm
Neolander
Well, don’t know about 64-bit integers, but 64-bit floating-point numbers are very common in the scientific computation area, so manipulating them at native speed can be nice…

2010-10-02 3:08 pm
deathshadow
Math? Ok, what do you call the 8087 then? 80 bit?
486DX is considered a 32 bit processor, but it can handle 80 bit floating point on die.
x87 math types —
Single – 16 bit
Double – 32 bit
Extended – 80 bit.
Edited 2010-10-02 15:08 UTC

2010-10-04 3:15 am
cb88
That isn’t a common configuration for RISC cpus… it is for CISC though. Usually RISC keeps all the instructions and data uniform so that the design can be simpler and therefore have a higher clock.

2010-10-01 8:22 pm
sergio
These additions are very important for servers.
I’d love to see cheap Dell servers using ARM + Linux. Perfect match for Java and LAMP servers.
2010-10-04 7:18 am
ndrw
Multithreading on a single ARM core would be more effective than the same thing on the Intel’s CPU.
ARM doesn’t do as much optimization and heuristics as Intel products do. This is bad for a raw single core performance but very good for maximizing processing/power ratio. The side effect is that ARM CPU stalls the pipeline much more often than “desktop CPU’s” do, which can be used to run one or more threads on the side.