Wired has an interesting story on a PCI card from ClearSpeed technologies which contains “a processor capable of performing 25 billion floating-point operations per second, or 25 gigaflops. ordinary desktop PC outfitted with six PCI cards, each containing four of the chips, would perform at about 600 gigaflops (or more than half a teraflop).” Such a PC would qualify as one of the 500 most powerful supercomputers in the world, but only cost about $25,000.
But the problem, as I read somewhere else on the net, is bandwidth. While they may be capable of doing that many calculations, will the slow (for this purpose) PCI bus be able to keep up with the chip? We already have problems these days feeding massive CPUs that have high speed buses from main memory. I’m not sure that the aging PCI bus would be able to keep up. I mean, isn’t that why AGP was invented for video cards?
it’s not exactly PCI bandwidth that is the limiting factor. With PCI-X you get a good enough bandwidth, but the memory bandwidth is the thing that’s important. Will DDR be sufficient for so much data?
just guessing of course, but most likely, one sends executable code to the card, and it returns a result when its done, or queues up results and sends them back in batches etc. seems pretty self contained, so there’s probably a few ways to get around the bandwidth limitation. of course, it would depend on what your doing and how. surely being in the business of making money, they’ve figured this big detail out.
Somehow I doubt we’ll ever see this thing ship…
if a seti@home client could be written for it, how much would it benefit from the hardware? What would be the average time to process a work unit?
We already have problems these days feeding massive CPUs that have high speed buses from main memory. I’m not sure that the aging PCI bus would be able to keep up. I mean, isn’t that why AGP was invented for video cards?
AGP was invented to give video cards faster access to system RAM when the card needed more RAM than was on-board, as well as isolate the video card from the rest of the PCI bus (in reality the video card probably hindered other PCI cards just as much as the video card was hindered by them).
These CPUs have on-chip RAM (though not much) and you could most likely implement more RAM on the PCI card if you needed it for your specific application. You also have to, at the very least, compile your application specifically for this processor (and more than likely you’d gain a lot if you wrote it specifically for the processor).
The idea is that anything you would farm off to this chip would be something that was computationally intensive or required a great deal of bandwidth between the chip and a small amount of RAM during computation (swapping a lot of information in and out of chip registers), so most of the processing time would be spent on the card and then the result sent over the PCI bus to the main CPU (which would decide what to do with the information based on the application).
I doubt this technology will be as useful as the article hypes it up to be. First off, the chip isn’t even in the prototype stage. Second, the chip performs well in floating point operations so its main purpose is for crunching serious numbers. It wont do much in helping you get Mozilla to load faster. Third, this would obviously not be an automatic speed increase by simple card insertion. An OS would need drivers to interface the cards. In fact, I bet synchronization between the cards and the OS would make it pretty difficult to use in real-time applications like games. Not that anybody would pay $25,000 for a game machine… or would they?
When it finds its niche though, this will be a really cool technology. It will reduce the need for massive computation farms and give more power to the small guys.
-W
Your not going to see something like this used often in the desktop space…atleast not for a long time. There are companies that make entire computer systems that you can use to upgrade your box all off a PCI card…but they’re expensive and often don’t perform too hot.
This sounds like a different cookies. A processor with some RAM on a PCI card. It should actually work in pretty much the same vain as an IBM mainframe…just I do think you’d run into some serious memory bandwidth issues. There’s no way the PCI bus would be fast enough for these types of memory accesses. In mainframes they have their own bus entirely for that. It could be that this thing would rule for finding Pi and possibly a few other calculations where there’s a massive amount of data that can be broken up into very small chunks but at the same time…then why can’t you use a cluster of $500 shuttle PC’s to do the same amount of work?
actually the technology is in every apple processor. Altivec and this chip use vector processing .
It’s a good idea. Energy efficient. Fast.
Was the leading supercomputer processor tech in the 80’s for the U.S. by Cray.
This card is targeted to replace/augment the computing clusters that are popular right now. PCI bandwidth is not going to be as big an issue as you think. These clusters commonly use Gigabit ETHERNET right now. PCI is a huge improvement over that interconnect. The problems these clusters are solving are designed with the “slow” link in mind.
okay this sounds like a dream come true for pro audio if it were cheaper (and other applications).
If you run a 24-36 track mix with anything other than spartan effects, then you will run into problems. Nothing and i mean nothing sucks as much as running out of power while you are tracking or more likely mixing. For project studios it is a pain in the rear but for studio owners it is a nightmare.
Now some people also run software samplers in addition to multi-track sequencers and effects on the same computer. That eats up even more resources. the load is extensive. PLus some pro-audio apps/effects are pushing 192 khz sampling rates on the A/d, D/As.
This is going to blow the rendering market silly. Instead of WETA needing hundreds of machines for LotR, a handful of desktops…
awesome, for math and science applications, this chip is a must have!
Is anyone reminded of this Slashdot article:
http://slashdot.org/articles/00/07/23/2158226.shtml
The article didn’t really start setting off alarms in my head until I got to the second half:
ClearSpeed said the new chip is also very low-power, operating at about 2 watts, which would allow it to run off a laptop battery and wouldn’t require special cooling.
“At 3 watts, you could put it in a PCMCIA card,” said McIntosh-Smith. “With two chips on a PC Card, you can have 50 gigaflops on a laptop, running off a battery. That’s equivalent to a small Linux cluster on your notebook.”
So we’re supposed to believe there’s so magical vector processor out there capable of 25 gigaflops of computational throughput while consuming only 2-3 watts?
Just to contrast with a more conventional low power supercomputer, the MIPS N0 delivers 1.4 gigaflops of computational throughput at 20 watts.
This certainly falls into the “I’ll believe it when I see it” category.
I was also curious about compilers for this processor, but this is all they had to say:
Strauss warned that writing software for the chip’s complex architecture might be a stumbling block, but the company has assured him that its compiler makes it easy to program.
“It’s a refreshing new approach to high-powered chips, and they seem to be pretty well ahead with it, too,” Strauss said. “I’m pretty impressed. I’ve seen lots of things like this over the years, but this breaks new ground.”
So, will we see a Fortran 90 compiler for this processor? Guess we’ll have to wait and find out…
does anyone remember the weitek math co-processor from the 1980’s? this article reminded me of that.
Reminds also of the i960 based Sbus coprocessor cards someone sold for sparcstations in the mid 90’s. Problem there, perhaps like here, was application support. However, this ClearSpeed card is probably more affordable and things are very different now as to who needs FPU power, so who knows…
On PCI? No thank you. But, if you implemented a Mobo and CPU combination that allowed you to use HyperTransport, I would be very impressed. That would be one behemoth of a vector math coprocessor.
ClearSpeed Announces CS301 Multi-Threaded Array Processor
http://www.supercomputingonline.com/article.php?sid=4753
“…The CS301 is based on a multi-threaded array processing (MTAP) architecture and includes 64 processing elements, 384 Kbytes of on-chip SRAM and I/O ports interconnecting through ClearSpeed’s ClearConnect® bus. Each processing element has its own floating point units, local memory and I/O capability, making the CS301 ideally suited for applications which have high processing or bandwidth requirements. The ClearConnect bus is a packet switched network that provides high bandwidth and low power consumption, supporting multiple concurrent transfers giving even higher aggregate bandwidth.”
“okay this sounds like a dream come true for pro audio if it were cheaper (and other applications).”
You can already get a lot of DSP power into your machine with dedicated PCI cards, much in the same way ClearSpeed are proposing…
The UAD-1 is a processor on a card that runs plugins, you can use multiple cards for more DSP. They won’t say what the chip is, but I bet it’s an Analog Devices Shark.
Also, the TC-Powercore card fills the same niche, to say nothing of Protools TDM cards, or a Kyma system.
Using multiple PCI cards for extra processing power has been commonplace in the audio industry for years.
Incidently, I’ve seen systems that get round the bandwidth problem by adding a ribbon cable plugged into the top of each card to give them a dedicated bus to themselves.
It also reminds me of those Transputer boards that Microway put out. Isn’t a future Playstation suppose to have an array processor?
The PACT XPP Processor is worthy of a good serious look by anyone doing massive computations.
“The eXtreme Processing Platform (XPP) architecture presents a new runtime reconfigurable data
processing technology that replaces the concept of instruction sequencing by configuration
sequencing.”
It’s an 8×8 array of processing units that you can reconfigure on the fly.
The following two links will take you there:
http://www.theregister.co.uk/content/3/20576.html
http://www.pactcorp.com/
If you want to learn more about this cool tech read the white papers at: http://www.pactcorp.com/xneu/px_paper.html
If I recall device similar to the PACT processor is rumored for the next Playstation. Also, a graphics chip maker is rumored to be making a similar chip for it’s next generation of video processors.
Obviously as the scale gets smaller more transisters fit into these chips which leads us to more and more computational units. The approach with these “array” chips is that they provide the raw computational units and your program “re-configures” the communication pathways between the units so that “programs” are “hardwired on the fly”. Once “configured” these chips fly faster since they are not a “custom” circuit representing your program. No need for instructions to move the data as it just flows through the configured pathways.
Obviously not all software can make use of these chips, but a suprising amount of software that is important to us can. The best examples are 3d graphics, compression, encryption, video and audio playback/recording plus much more.
The maximum speed for these chips may never be obtained in practice since that would require all the units to be busy during every cycle. That’s unlikely to happen since all programs that could make use of these processors won’t make use of all the units at the same time. You milage will vary with your applications.
May these chips enter the mainstream market soon so that our 3d graphics take off even more. Cheers.
The PACT XPP Processor is worthy of a good serious look by anyone doing massive computations.
“The eXtreme Processing Platform (XPP) architecture presents a new runtime reconfigurable data
processing technology that replaces the concept of instruction sequencing by configuration
sequencing.”
It’s an 8×8 array of processing units that you can reconfigure on the fly.
The following two links will take you there:
http://www.theregister.co.uk/content/3/20576.html
http://www.pactcorp.com/
If you want to learn more about this cool tech read the white papers at: http://www.pactcorp.com/xneu/px_paper.html
If I recall device similar to the PACT processor is rumored for the next Playstation. Also, a graphics chip maker is rumored to be making a similar chip for it’s next generation of video processors.
Obviously as the scale gets smaller more transisters fit into these chips which leads us to more and more computational units. The approach with these “array” chips is that they provide the raw computational units and your program “re-configures” the communication pathways between the units so that “programs” are “hardwired on the fly”. Once “configured” these chips fly faster since they are now a “custom” circuit representing your program. No need for instructions to move the data as it just flows through the configured pathways.
Obviously not all software can make use of these chips, but a suprising amount of software that is important to us can. The best examples are 3d graphics, compression, encryption, video and audio playback/recording plus much more.
The maximum speed for these chips may never be obtained in practice since that would require all the units to be busy during every cycle. That’s unlikely to happen since all programs that could make use of these processors won’t make use of all the units at the same time. You milage will vary with your applications.
May these chips enter the mainstream market soon so that our 3d graphics take off even more. Cheers.
Have any of these people heard of a data dependency? This is not a general purpose processor. I’ll believe 25 gigaflops or 60,000 MIPS when I see it.
I think these people are in the same category as Dennis Lee.
“Have any of these people heard of a data dependency?”
Please explain to me what the difference is with a “data dependency” on a single x86 box with 6 ClearSpeed cards in it versus a “data dependency” on a huge Linux cluster of x86 boxes.
Gil Bates> The pb is about the use of one chip.
Those chip contain 64 element with 1 MAC fpu and 4 ko of SRAM. What’s up if 4Ko is not enought and what’s up if each computation must be reused in a lot of different place in the chip ?
Presumably an application using ClearSpeed boards can still access the general purpose RAM (at least 3-4GB, right?) of the box it is running on and this process would be no slower than what goes on in a Linux cluster, no? A Linux cluster is the only thing I’m interested in comparing/contrasting to in my argument.