Inside The Machine: Jon “Hannibal” Stokes Interviewed

Guest post by Federico Biancuzzi 2007-03-27 Original OSNews Interviews 19 Comments

Jon “Hannibal” Stokes is co-founder and Senior CPU Editor of Ars Technica. He has written for a variety of publications on microprocessor architecture and the technical aspects of personal computing. He recently published his first book Inside the Machine – An Illustrated Introduction to Microprocessors and Computer Architecture. We interviewed him to discuss how hardware bugs are dealt with, the use of reserved bits, the performance and efficiency of console CPUs and GPUs, the possibility to build a Playstation 3 cluster, and what future connection he sees between CPUs and GPUs.

Could you introduce yourself?

Jon Stokes: I’m a co-founder of Ars Technica, and a Senior Editor at the site. I typically cover microprocessors and graphics hardware, but over years I’ve covered a pretty broad array of additional topics, including intellectual property, national security, privacy and civil liberties, outsourcing and the H1B visa program, and electronic voting. I have an undergraduate degree in computer engineering from LSU, graduate two degrees in the humanities from Harvard Divinity School, and I’m currently working on a Ph.D. at the University of Chicago. I am the author of Inside the Machine.

I found a list of bugs inside Intel Core Duo/Solo that was released just 20 days after the official release of these CPUs. It’s pretty shocking. Is it something common? How do CPUs manufacturers react? With new revisions? Letting the software guys find a workaround?

Jon Stokes: Generally speaking, many of the major errata are fixed with new steppings of the processor. Other errata can be worked around using BIOS and OS tweaks.

There were quite a few errata with the Core Duo/Solo line, and maybe even a higher number than usual for which no fix is planned. However, I think Intel’s reasoning behind not planning fixes for these is pretty clear in this particular case.

Specifically, Core Duo/Solo (aka “Yonah”) is a transitional design between the Pentium M and the more advanced Core 2 Duo/Solo (aka “Conroe”). There really wasn’t much of a time gap between when Yonah was released in January of 2006 and when Conroe was released in July of that same year. So Yonah was obsolete pretty quickly. This being the case, there was no reason for Intel to put a lot of effort into updating this transitional microarchitecture.

Overall, I think too much is typically made of these errata in some forums. As I said above, the important bugs are fixable with new steppings, BIOS tweaks, and OS updates.

Browsing CPUs specs and datasheets I see a lot of reserved stuff. They are probably used for debugging the hardware during the development cycle, but I’m wondering what they could be used for once these CPUs hit the market. The paranoid android inside me could argue that maybe it’s just security through obscurity, and that some combinations of these reserved bits could be used to 0wn the system. Or maybe the always cool story about NSA conspiracy, so there could be a backdoor somewhere… What can you say about the hardware development process and the use of reserved bits?

Jon Stokes: My understanding of reserved bits is that they’re often intended not for secret features, but so that the architects can add new features to the ISA at some future point. In other words, a reserved bit gives you a “place” to insert a new option or capability, without breaking legacy software.

If I recall correctly, AMD made use of at least one reserved bit in the x86 instruction format when they created their 64-bit ISA, x86-64.

What can you tell us about the CPU inside Microsoft’s XBox 360?

Jon Stokes: The Xenon CPU is a three-core, multithreaded PowerPC processor designed by IBM. Each of the three cores is very similar to the general-purpose CPU in the Cell BE (the PS3’s processor), but it has some additional vector processing resources.

Probably the most important thing to note about Xenon is that it handles caching in a very special way that makes it more effective for media-intensive workloads like video decoding and gaming.

Streaming media applications tend to “dirty” the cache, which means that instead of storing a single working set in the cache and using that data for a while, they’re constantly moving data /through/ the cache. This kind of behavior makes very poor use of the cache, and in fact streaming data from one thread can result in non-streaming data from another thread being booted out of the cache needlessly.

Xenon’s fix for this is to “wire down” certain sections of the cache and dedicate them to a single thread. That way, a thread that is only moving data through the cache, and not storing it, can just dirty a small, dedicated part of cache. In a way, this “cache locking” mechanism enables the Xenon’s cache to function a little bit like the Cell’s “local store” memory.

About PS3 and Xbox 360. From a pure computational point of view which system could be considered more powerful?

Jon Stokes: I think that PS3 has more raw computational horsepower on paper, but in practice the two consoles will probably equal out to about the same for most game developers. However, there are some problems in high-performance computing where the Cell Broadband Engine that powers PS3 is much more powerful than anything else out there. The problem is that programmers have to design their code from the algorithm level on up to fit Cell exactly in order to see benefits. And again, this doesn’t seem to apply to games, but if some developer figures out that it does then eventually they could get more performance out of the PS3.

Do you think that the PS3 could be used to build cheaper computational clusters as happened with PS2? I was thinking at places like Google…

Jon Stokes: I think this is an interesting idea, but ultimately IBM’s Cell-based products will be a better fit for clusters than a PS3 console. The advantage of the PS3 console is, of course, that it’s cheap because Sony subsidizes it. So it’s entirely possible that someone would want to use it for a cluster. It does have gigabit Ethernet, so I guess it could work.

How do their performance-per-watt compare to that of modern power-savvy CPUs such as Intel Core Duo 2?

Jon Stokes: Although I don’t have any real numbers to back it up, I’d say that Core 2 Duo almost certainly has them both beat in performance/watt for ordinary workloads. But again, if you’re solving one of these exotic HPC problems using Cell, and you have code that’s custom-fitted to give you an outrageous performance delta vs. a traditional architecture, then those performance/watt numbers would skew pretty drastically in Cell’s favor for those applications.

And what about performance-per-watt of consumer GPUs compared to those included in PS3 and XBox 360?

Jon Stokes: That’s hard to estimate, I think, because “consumer GPUs” is such a broad category. I’m sure that for high-end GPUs that are comparable in horsepower to those in the PS3 and XBox 360, the performance/watt numbers are also comparable.

Why do you think this new generation of consoles abandoned the x86 instruction set choosing RISC CPUs?

Jon Stokes: This is a hard one to really pin down. Honestly, I think that IBM just did a great job pitching them on their chip design competency. I don’t think it had much to do with the ISA, and I also think that IBM made the sale on a case-by-case basis to each of the console makers, appealing to different aspects for Sony, MS, and Nintendo.

With Nintendo, it was about the fact that IBM had already proven they could deliver a console product with the GameCube. Nintendo was clearly pleased with the GC, and in fact Wii is basically just a GC with a higher clockspeed and a new controller.

With Sony, IBM was able to sell them on this exotic workstation chip. Sony likes to really push the envelope with their consoles, as evidenced by both the PS2 (really exotic and hard to program when it first came out) and the PS3. So IBM was able to appeal to their desire to have something radically different and potentially more powerful than everyone else.

As for MS, I have no idea how they pulled it off. I think that if the Xbox 360’s successor had used a dual-core Intel x86 chip or even an Opteron, everyone would’ve been better off. This is especially true if Intel could’ve found a way to get a Core 2 Duo, with its increased vector processing capabilities, out the door in time for the console launch. Of course, even Core 2 Duo can’t really stand up to the Xenon’s VMX-128 units, especially given VMX’s superiority to the SSE family of vector instructions, so Xenon does have that edge.

But regardless of the SSE vs. VMX (or AltiVec) issue, I’m not convinced that letting IBM design a custom PPC part for Xbox 360 was the best move, because now MS has to support two ISAs in-house, and I don’t think it really buys them much extra horsepower. But I acknowledge that I may be entirely wrong on this, and in the end you’re better off asking a game developer who codes for both platforms which one he’d rather have.

It seems that AMD (+ATI) is working on merging CPU and GPU. At the same time some projects, such as brookgpu, try to exploit GPU power to crunch numbers. What is your point of view on the evolution of CPUs and GPUs?

Jon Stokes: I don’t really have much of an idea where this is really headed right now. I don’t think anyone does. I mean, you could do a coarse-grained merging, like AMD says they want to do with a GPU core and a CPU core on one die, but I’m not convinced that this is really the best way to attack this problem. Ultimately, a “merged CPU/GPU” is probably going to be a NUMA, system on a chip (SoC), heterogeneous multicore design, much like Cell.

I also think it’s possible to overhype the idea of merging these two components. Regular old per-thread performance on serially dependent code with low levels of task and data parallelism will remain important for the vast majority of computing workloads from here on out, so a lot of this talk of high degrees of task-level parallelism (i.e. homogeneous multicore) and data-level parallelism (i.e. GPUs and heterogeneous multicore, like Cell) is really about the high-performance computing market, at least in the near-term.

At any rate, right now we’re all sort of in a “wait and see” mode with respect to a lot of this stuff, because CPU/GPU and some of the other ideas out there right now look a lot like solutions in search of a problem.

19 Comments

2007-03-27 4:38 pm
Nex6
good interview, i have been a member of ars for a long time. ars reviews are always top notch.
-Nex6

2007-03-27 7:01 pm
rhyder
I enjoyed that interview too. I’m the sort of geek who can sit and read about microprocessor design for hours, even when the information has no practical use. I’ve got a great insomnia book that compares the 6502, 68000 and Z80 processor architectures.
I am seeking medical help 🙂

2007-03-27 4:39 pm
TommyD
It is interesting to see the demarcation between PC’s and Consoles right now – x86 for (most) PC’s and PowerPC for the consoles. With the previous generation of consoles, we had x86 XBoxes and PowerPC Macs. With the current generation. there is no overlap (forgetting for now of course that Linux/BSD can run on PowerPC’s).
Just think if the Mac had not abandoned the PowerPC chip. Since Linux and BSD run just fine on the PowerPC, that would leave poor little Windows stranded in the x86 world.
I guess 2 very big surprises – Mac switching to x86 and XBox switching to PowerPC – have created this polarized PC vs. Console world.
[EDIT: can’t spell]
Edited 2007-03-27 16:41

2007-03-27 4:41 pm
Nex6
true,
But i think that intel really came out with some good stuff with core 2.
-Nex6

2007-03-27 4:46 pm
TommyD
But i think that intel really came out with some good stuff with core 2.
I agree! It used to be for me AMD was the obvious choice for price/performance. The Core 2 Duo realy reversed that for me. It wins in not just performance, but in power usage. It seems to me AMD only is a choice for the home user in the lower-end.
EDIT: Forgot to put on my flame-retardant suit. ;}
Edited 2007-03-27 16:46

2007-03-27 4:54 pm
PlatformAgnostic
And for the server users on the high end. The Opteron is still a good chip and does a fine job of scaling. Also, AMD has some advantages in its cHT design that allows for co-processors to be linked efficiently to its chips. Thus, the Opteron becomes a great controlling processor for a bunch of Cells, or for the ClearSpeed chip that is basically a chip with a grid of SPEs.
AMD has good chip designers. It looks to me like the main reason that Intel can beat them right now is that Intel has more resources and expertise for better manufacturing processes. Intel simply has smaller/faster transistors. I think that if AMD could get their chips produced on Intel’s Fabs, then you’d have an extremely close race.
2007-03-27 5:12 pm
Nex6
yeah, AMD has some good designers also. and the compitiion of having both is making for better chips faster.
2007-03-27 6:44 pm
rayiner
The Core 2 is a nice chip, but it should be borne in mind that the K8 architecture is pretty long in the tooth. K8 was over three years old when Core 2 was introduced, and the underlying design dates to the K7 about eight years ago. Core 2 is a pretty fresh start for Intel. Not a complete redo like K7 was relative to K6, but a much bigger overhaul than K7 to K8.
Overall, K8 is probably one of my most favorite designs ever. Core 2 is known for being a particularly aggressive hybrid of CISC and RISC, but K7/K8 was the one who started the whole “next-generation CISC” thing by not only accepting its x86 heritage, but actively embracing things like memory operands, variable-length instruction encodings, byte registers, etc, and using them to improve overall performance.

2007-03-27 6:38 pm
butters
When it really comes down to it, an ISA is just an ISA. It’s like the APIs of an OS. Many operating systems are (more or less) POSIX-compliant, but they differ wildly under the covers. CPUs are the same way. Whatever ISA gets used is broken down into micro-ops by sophisticated decoders so that the execution pipeline can work in a more convenient language. Assembly is the language for compilers, not for CPUs.
x86 became successful because the first generations of CPUs that used this ISA were the world’s first general-purpose microprocessors at a mainstream price point. Over the years, micro-op architectures, decoder logic, micro-op reordering, and speculative execution have evolved to the point where today’s x86 processors have vastly more in common with today’s PPC processors than they do with x86 processors from just 3 years ago. The biggest difference between x86 and PPC is that software compiled for one won’t run on the other. The ISA is the language of the compiler.
There are other important differences, though. Most notably, x86 only has 8 general purpose registers (16 on 64-bit processors), while PPC has 32. x86 tends to involve more decode logic and more sophisticated compilers, but then again optimizing compiler technology for x86 has advanced beyond that of any other ISA. PPC has more humorous instruction names such as eieio.
More mass market software is compiled for x86 than for any other ISA. That’s the primary reason why x86 continues to be so dominant. When a console comes out, the vendor is generally responsible for ensuring software (game) availability/compatibility. Backwards compatibility with third-party software isn’t generally a top concern. If the more compelling processor speaks PPC, tell the game developers to make their games speak PPC. In the console market, this is perfectly acceptable, whereas this doesn’t fly in the PC market.
Edited 2007-03-27 18:39

2007-03-27 6:57 pm
rayiner
Good points, but one quibble:
“x86 tends to involve… more sophisticated compilers”
I think the opposite is probably true. The nature of the x86 ISA, and the sheer amount of binary-only code already existing for it, has forced x86 implementations to be particularly accommodating of simplistic code generation. You can get good performance out of a K8 or Core 2 without fancy scheduling, aggressive reordering, instruction bundling, predication, sophisticated control-flow transformations, etc.
You have to be careful about instruction selection and encoding (particularly on newer CPUs in 64-bit mode), and you probably want a decent register allocator, but at the end of the day I’d much rather have to deal with a code-generator for K8 than one for PPC 970!
2007-03-28 3:28 am
Henrik
I agree on some of your points, but this statement
“…today’s x86 processors have vastly more in common with today’s PPC processors than they do with x86 processors from just 3 years ago.”
is totally taken out of thin air.
Three years ago we had basically three (very different) x86-implementations alive: Athlon (64), Pentium 4, and Pentium M.
First, in order to compare these to “todays” x86-processors, you have to specify which one you refer to. They are much too different from each other to be meaningfully grouped into a “generation”, or something like that.
Second, “todays” processor implementations are very much the same designs (minus Pentium 4), except for enhancments in physical manufacturing and various small tweaks in the logic. And the newer Core/Core 2, is really not much more similar to PPC than any of these, somewhat older, x86-processors are.
In fact, Core/Core 2 has a lot in common with Pentium M, technically, while it shares very little with PPC (of course with the exception of the basic SIMD idea, the von Neumann principles, and many other fundamental principles and technologies).

2007-03-27 4:50 pm
YNOP
at the university in the town where i live one of the cs profs is already building a PS3 cluster. i wonder how many other projects like this are out there. regardless … it will be interesting to see how the cell-be works out.
2007-03-27 7:41 pm
Al2001
After conversion that book cost £16.17 UKP in US it’s £32 to buy at Amazon.co.uk.
They just lost a sale!
2007-03-27 8:12 pm
seabasstin
John you are one of my heros.
for reals…
The only other comment is….
How could you let them use that scarily wasteful design for ars?
it should be clear and concise just like your words…
2007-03-28 12:01 pm
encia
Prior to execution, both AMD K5/K6/K7/K8 and Intel Pentium Pro/PII/PIII/PIV/PM/Core/Core2 converts (X86) variable length instruction into fix length instructions.
A typical RISC processor executes fix length instructions.

2007-03-28 6:41 pm
rayiner
It’s a lot more complicated than that. While uops in modern x86 processors are indeed fixed length, they are usually tracked in variable-sized groups. See for example the Pentium-M’s micro-op fusion, and the K8’s double-dispatch instructions.
RISC doesn’t just mean fixed-length instructions, though. x86 has conventional CISC features like fancy addressing modes, a hardware-managed stack, memory operands, string instructions, etc. Though there was originally a move towards getting rid of all those abstractions at the uop level in previous processors, modern x86’s handle those things even within the “RISC” core. Core 2 and K8 track LOAD+OP instructions as a bundle throughout the pipeline, and offer full performance for all of x86’s addressing modes. Core 2 and K8L have dedicated hardware for managing the stack. Even on a recent x86 processor, the very CISC-y x86 string instructions are still often the fastest way to move memory around.

2007-03-29 4:28 am
Henrik
I agree (while encias point is also basically right).
Some thoughts:
(1) RISC (or the more sensible name, load-store architectures) are not inherently better (or worse) than non-RISC designs (as your “rep movsb” example illustrates). However, one or the other may be significantly faster, depending on purely physical factors such as the speed ratios between main memory, cache memory (if any), microcode-ROM, decode-PLAs, and random logic, as well as currently available component densities (1945 – 2007) and other low level factors.
Of course, this is nothing new, but nonetheless ignored in 99% of all the flamable “RISC/CISC” discussions I’ve seen on the Internet.
(2) What you are describing (LOAD+OP etc) reflects the fundamental principle that it is normally a bad idea to discard information, or context, that is already (freely) available.
(3) I find it a little amusing that older non-RISC designs (such as 8085) are often retroactively labeled complex (“CISC”), regardless of whether they fit the description or not. They may well have both less complex and fewer instuctions (addressing modes) than modern RISC processors.
Edited 2007-03-29 04:29

2007-03-29 5:33 am
verbose
one of the real reasons for exotic architecture is the difficulty that it causes independent opensource (“pirate”) emulator developers.. this way, you can keep charging for your intellectual property.. ala wii “virtual console channel”
Edited 2007-03-29 05:37
2007-03-29 10:03 am
viton
Chosing ppc as xbox cpu has nothing to do with ISA.
It is about custom design and cheap architecture license.