Researchers at the University of Maryland’s A. James Clark School of Engineering claim to have developed a computer system that is 100 times faster than today’s desktops. The research group, lead by Uzi Vishkin, developed a system based on parallel processing technology. The team built a prototype with 64 parallel processors and a special algorithm that facilitates the chips to work together and make programming for them simple.
Now when the ultra-super-duper-mega core chips come out, we’ll know that at least someone will be on top of it.
It’s interesting, but is a fairly fluff piece.
They say they will make programming for parallel programming easier but fail to say at all how they will do it.
With “rich algorithmic theory”. 😉
Dunno if that’s a feature of the chips, they way they’re interconnected, a special compiler, or programming design patterns.
– chrish
I’d have to say that parallel programming was the most difficult class that I took at school, so anything that makes it easier would be nice. However, the article seems short on details. It seems that it is an add in board for an existing desktop, or was that done just to be able to fit it into some kind of case?
More information here:
http://www.eng.umd.edu/news/news_story.php?id=2289
Namely:
http://www.umiacs.umd.edu/~vishkin/XMT/
http://www.umiacs.umd.edu/users/vishkin/XMT/spaa07paper.pdf
http://www.umiacs.umd.edu/users/vishkin/XMT/spaa07talk.pdf
And (elsewhere) a paper about “XMT Applications Programming”:
http://www.cs.umd.edu/Honors/reports/lee.pdf
The comments on that article thread reveal more detail.
The only practical way to think of programming 100s, 1000s or possibly 100,000s of nodes on a particular compute intensive task is to think of concurrent communicating processes (CSP) as modeling hardware either at a very detailed level like chip designers do (very very difficult & almost infinitely expensive) or at a much more general level using CSP like languages that allow parallel and sequential structure that look more familiar.
One such light language occam was developed 20+ years ago and ran on a special processor that supported message passing and context swapping in hardware, context swaps were in the order of 20 cycles. occam programs could be developed on one processor node and recompiled onto large N node machines with little effort. It is interesting to see in the Singularity paper just before this story how many cycles x86 currently use, more in the 1000s to context swap, it really needs to be close to a dozen instructions for concurrency to be pervasive.
Amdahl’s law is often mentioned as a limiting factor and indeed it can be when you think in terms of very course processes that really want to run sequentially. The trick is to think with a hardware hat on but not get stuck at the lowest levels of logic. It doesn’t really matter if most nodes are idle, in fact when you have 100,000 processes running on a FPGA, you really want most nodes 90%+ not to toggle to save fCV2 power so Amdahl is sort of a moot point. If you fill an FPGA with many general purpose CPUs it will run pretty hot.
The languages used to design FPGas and ASICs are completely parallel and model the nested hierarchy of a chip and fill it in with continuous statements that transform signals from inputs to outputs and store them in registers at clock boundaries on module boundaries.
In an ASIC, the configuration is really fixed but very fast, in a FPGA it may be 20x slower but reconfigurable in a ms time frames so a scheduler can be considered to swap hardware tasks.
Now if I actually had a 64pt CPU array I would casually leave most logic and database code near single threaded and only push graphics rendering and any vector tasks onto the CPU nodes, ie tile the screen over the array. The latency would be much lower for complex draws but could still live with single threading for human interface code.
One other aspect about massively parallel processors is that you can trade one really bad feature of sequential processors, the infamous Memory Wall for a Thread Wall. If you already want lots or threads, then the Thread Wall is a non issue. The removal of the Memory Wall in itself would seriously speed up many tasks that have poor locality and for which the cache can never be big enough. If it can be multi threaded too then some of the cost of going multi threaded will get paid back by removing the Memory Wall. How do you do that, well you have to use a rather different sort of DRAM that can start a full random memory cycle every single clock at a rate of about 500MHz today (Micron’s RLDRAM). The DRAM still has some cycles of latency but those are hidden by multi threading the processor core (like Niagara does). Then for each memory issue you can assume the instruction throughput is about 5x the DRAM clock rate since loads & stores occur about every 5 codes.
A search for Transputer, Occam, CSP, might reveal how many of the problems were already solved 20 years ago on very primitive hardware.
“A search for Transputer, Occam, CSP, might reveal how many of the problems were already solved 20 years ago on very primitive hardware.”
Hum… no. A lot of these problems have been adressed, not solved. That is a big difference, had they been solved 20 years ago, we would not be in the pickle we are today trying to figure out what to do with the new generation of multicore systems.
Regarding the system in the article, it is just another PRAM prototype running on a bunch of FPGAs. The lead researcher is making quite bold statements without much to back them up in terms of produced results and the reality of their system
The shape and form of the multicores we have now could never have been predicted 20yrs ago, but if back then one could have put more CSP based designs on the same process track, we would be in better shape. The original Transputer had only 2KB of store per device, end of line. If you want to compute concurrently you don’t want to slap together an ever larger no of very complex sequential processors that do not even run light threads well in a predicable way. If you can’t predict how long a path will take, you can’t do any of the sorts of optimizations that chip designers depend on. It makes sense to design a processor to support threads directly and take the benefits directly.
The FPGA world comes from the opposite direction, the concurrency is at such a low level it is largely impractical for all but the deepest pockets. FPGAs even running at 100MHz often outperform standard processors, but the speedup is often outweighed by the much higher cost.
The speedup gain has to be really enormous to justify the work and cost disadvantage and that further limits the application. And its gets worse, the easier it is made to program these FPGAs with more familiar C like languages, the less performance you get. Real performance comes from pedal to the floor grunt work in hand placing macros not unlike tweaking assembler code and trying many placement combinations to find what runs fastest.
Plenty of people who were around 20yrs ago do know what to do with multi cores and multiple threads but I fear these systems will not deliver the expected results because the costs of communication and context swapping are so high. If you want concurrency supported in hardware, you need to go all the way with it and ignore the usual ways of designing sequential processors, its the throughput per watt that matters. The Niagara and some of the other MT designs are probably as close as we will get for now. The old multiplexed DRAM model we have in abundance helps prop up the ever larger cache models everyone wants.
Personally I see a continuum of computing between pure software and pure hardware and favor a unified C like process language that could be synthesized as hardware and also run as standard code on a suitable processor or any processor with a runtime library to support concurrency. A little bit of Verilog plus a simplified C with classes & processes seems about right to me.