“[…] a new white paper by VMWare that comes to the surprising conclusion that hardware-assisted x86 virtualization oftentimes fails to outperform software-assisted virtualization. My reading of the paper says that this counterintuitive result is often due to the fact that hardware-assisted virtualization relies on expensive traps to catch privileged instructions while software-assisted virtualization uses inexpensive software substitutions.” Read more at Slashdot.
http://developers.slashdot.org/article.pl?sid=06/08/12/2028223
A new white paper by VMWare that comes to the no-so-surprising conclusion that their competition doesn’t work as well as their solution.
Right.
Their competition uses software virtualization as well, and both have been trying to add hardware virtualization. This is not FUD, it’s actual research.
The only reason there is any demand for hardware-assisted paravirtualization is because Microsoft won’t play nice.
Listen people, you know how with Xen, you run a patched version of your host (dom0) kernel and patched versions of your guest (domU) kernels? Maybe not, but that’s what happens. This is what they call “software-assisted” paravirtualization, since the software is modified to understand that it is running on a modified virtual machine.
Well before the big Xen 3.0 release, the Xen developers had a modified version of Windows running in paravirt mode on Xen. They asked Microsoft whether they could release the binary patch for Windows XP to enable this mode, and they said no way.
Instead, the Xen developers implemented two alternative ways to support Windows guests. The first is a full hardware virtualization mode, which will never be as fast as paravirtualization. The next requires Intel VT or AMD SVM technology to run Windows in hardware-assisted paravirt mode.
This whitepaper shows that hardware-assisted paravirt will always be slower than software-assisted paravirt. You don’t really need to do benchmarks to prove this, it is rather obvious from a theoretical standpoint.
The proper way to interpret this whitepaper is that both Xen and VMware want to push forward with paravirtualization, and they’re warning proprietary OS vendors that they need to cooperate in order to achieve maximum performance on their virtualization platforms. In other words, Microsoft needs to release yet another SKU for Vista: Windows Vista Virtual Client, or whatever.
The only reason there is any demand for hardware-assisted paravirtualization is because Microsoft won’t play nice.
Well that and the nice demonstration of how effective hardware virtualization is when done right provided by the IBM 370 architecture.
This whitepaper shows that hardware-assisted paravirt will always be slower than software-assisted paravirt. You don’t really need to do benchmarks to prove this, it is rather obvious from a theoretical standpoint.
So how does this theory jive with the fact that the 370 hardware virtualization was faster than the Amdahl software virtualization on otherwise equivalent systems?
If you meant to say that the semantics of x86-derived ISAs makes it very difficult to efficiently virtualize for such hardware and that software virtualization is better understood for that architecture, then sure, that’s what we figured out in the mid 90s, the last time I investigated the problem.
If you meant to say that it’s not possible to put together an ISA for which hardware virtualization is faster than software, I strongly recommend studying the 360/370 I/O architecture.
As always, Cloudy, a very good point.
Virtualization on x86 and other “new-school” architectures is a relatively new idea, and they were never designed for this purpose. SPARC has very fast call stack semantics as I recall, but I don’t think that helps here.
The s/360/370/390 (mainframe) architectures were designed from the ground up to provide full resource isolation for all processes, not just for containers or operating systems. It’s a whole different paradigm, not just a superior ISA for virtualization.
If you’re saying that enterprise IT should take another look at mainframes to solve the management, utilization, and reliability headaches of “modern” distributed infrastructure, then I completely agree. Modern mainframe systems are incredibly capable and remarkably cost effective.
Although, I do think that a discussion of mainframe systems is a little off-topic for a thread on PC virtualization technology…
Virtualization on x86 and other “new-school” architectures is a relatively new idea, and they were never designed for this purpose.
Yup. Tried to talk intel into taking certain steps back in ’95, but they declined.
Although, I do think that a discussion of mainframe systems is a little off-topic for a thread on PC virtualization technology…
oh sure. i just wanted to remind people that it hasn’t always been thus
Put one under every desk, now everybody’s happy!
Agree. The x86 is a truly horrid platform for doing any form of virtualisation, nevermind in hardware. Platforms like mainframe environments and Alphas were truly built for this sort of thing.
Butters, I agree with your sentiment, but there is one point you missed. The paper is not about paravirtualization, it is about full virtualization.
Paravirtualization will be faster than full virtualization every time, regardless of whether the virtualization is through software or hardware, but it does require some changes to the client kernel. Full virtualization doesn’t require changes to the client because it makes those changes itself at runtime. Those runtime changes are what slows virtualization down, and what the hardware is supposed to assist.
There is no such thing as hardware-assisted paravirtualization yet, and I’m not sure how there could be, or how it would help.
and my guess as to why microsoft dont want to play nice is that they have their on virtualization software, and they want to push it using their hold on things.
so if microsoft can show that windows runs better in virtual form on top of a windows server when using microsoft virtualization, they have a potential to sell even more licences. this because you will probably need a licence for the server os, the virtualization software, and one licence pr virtual windows running, atleast. and if any of those virtual windowses are used to support more then one user at the same time, guess what, you need even more licences. what fun world…
Are you smoking CRACK you copycat editor. Take off this story you plagiarist.
If you could read you would’ve noticed that this is a linked article.
If you had read you’d notice this was a link to Slashdot, which linked directly to the whitepaper.
But you were too busy insulting me to do that.
Read the Slashdot link, it’s worth it. There are posts in the comments from VMware developers, and apparently one of the developers of Intel’s VT hardware virtualization (or perhaps AMD’s Pacifica, they don’t say). These posts provide a lot of useful information.
Your reply is on the wrong post so is my mod down
In fact I was replying on the guy that made the insults
Sorry about that. I’ve modded you up to hopefully compensate.
for the results of other (independent) companies.
Besides that, software virtualization exists much longer now then hardware virtualization so i’m sure hardware virtualization will catch up in the future and be faster by far.
Software based can never be faster then hardware based, unless you’re doing something wrong.
I’m not really all that informed on this virtualization technology, so bare with me …
I was planning to buy a new PC sometime next year, with which I hoped to use some virtualaztion software to run both Windows and Linux at the same time.
Are they saying here that doing the above would actually be slower than running Linux in VMWare under Windows (or vice versa)?
In fact, they are right. No thing on earth can beat Xen’s performance when running a paravirtualized operating system like Linux or NetBSD.
Software based can never be faster then hardware based, unless you’re doing something wrong.
Did you read the paper? I think it clearly disproved that.
I hoped to use some virtualaztion software to run both Windows and Linux at the same time. Are they saying here that doing the above would actually be slower than running Linux in VMWare under Windows (or vice versa)?
VMware is virtualization software, so it can’t be slower than itself. The lesson of this paper is that you should choose your virtualization software (VMware, Virtual PC, Parallels, or Xen) based on real performance, not how it works under the hood.
Hardware assisted paravirtualisation would use some features of the virtualisation-aware instruction sets to accelerate things or even fully virtualise components of the system, whilst still exposing some parts of the paravirtualised API.
The hybrid would reduce guest OS modifications whilst still providing performance benefits over full virtualisation. It’s likely to be explored in Xen eventually, but right now I think folks are busy improving the existing full virtualisation support 😉
The paper confirms my thoughts that binary translation will be taking a larger role in the future of computing. By minimising costly hardware executions (traps, page faults and IO accesses) and dynamically reorganising code on the fly for more streamlined execution (less unnecessary IO or inefficient branches) we can get better execution speeds on a reasonable range of programs.
I expect if VMware (and other binary translators) are willing to cache results to a separate disk area and spend more time optimising off-line the results would mitigate the initial translation speed hit. Work with the OS would mean it would be possible to get “fat” binaries with pretranslated code ready to load and execute without a translational delay.
Well, hopefully binary translation will get more mature and we can start using it to consider more interesting problems (getting away from ia32 architecture, converting software -> hardware FPGA designs on-fly, other interesting things…)
The OS code that is being binary translated is some of the most highly optimized, hand-tuned code in existence. It’s already the expensive path for the OS, and no binary translator is going to improve upon that without a huge amount of upfront overhead. No, simple translation is quite sufficient in this case.
Binary translation is only an optimization for unoptimized code. I admit most code is truly unoptimized … but most code doesn’t need to be optimized, either. Good optimization can always do as well as the binary translation, and sometimes better with the larger domain knowledge avaible. (Mostly because the really high end of optimization starts to become binary translation!)
The OS code that is being binary translated is some of the most highly optimized, hand-tuned code in existence.
Yes, that’s true. However, even in hand crafted assembly tradeoffs are required due to differences in execution on supposedly similar architectural CPU’s. For example, optimisations from one ia32 CPU (Athlon) can be harmful when executed on another ia32 CPU (Pentium 4). Ultimately, unless the programmers included a lot of hand tuned code for a lot of CPU’s the final result is not a fast as it could be in order to execute well across a range of compatible CPU’s.
…no binary translator is going to improve upon that without a huge amount of upfront overhead.
That’s why I recommend caching results of translation to disk to avoid subsequent overheads. In fact, a distributed approach and regular updating of program code (tuned for your particular CPU, results gotten from other users) would reduce the overhead considerably.* (Also a feedback loop from real-life execution to the compilers would enable more optimisations to be done across the board).
* This would require a rethink of how stable program code is. If we consider program code to be the definitive expression of the program, and the optimised binary translated code to be your faster, ‘tuned’ version then in theory you’ll probably be updating your tuned code for quite a while. Ideas of self-modifying code give most programmers nightmares.
For example, a lot of CPU silicon real estate is devoted to optimising code at run time, especially on ia32 CPUs which can be considered binary translation (from ia32 -> microcode). Nobody complains about the overheads inherent in this form of binary translation however it is clear that the CPU will run hotter, and consume more power in order to translate ia32 code into something that is optimised for the CPU.
Good optimization can always do as well as the binary translation, and sometimes better with the larger domain knowledge available.
Actually compilers do a pretty bad job at run time optimisation. That’s where binary translation fits in and can squeeze out some extra speed if needed and unlike hardware, software can be updated frequently to include even more exotic forms of optimisation. Even the standard gcc -O3 can be rethought, check up on a program called Acovea as an exercise. Admittedly, human brains still get the top marks for rethinking the entire problem (like Transmeta towards ia32 support on low power consumption CPU’s).
So, is binary translation worth it? Well, if the results were redistributed to everyone else then yes, most likely. If everyone was able to take full advantage of the code optimisations and reduction in power requirements (sadly, unlikely given current CPU architectures) with proper analysis you’d probably find a noticable impact overall. It’s more interesting to think what would happen (environmentally and economically) if you could get a 5% more efficient car with slightly modified fuel at no extra cost to anyone.
Ultimately, unless the programmers included a lot of hand tuned code for a lot of CPU’s the final result is not a fast as it could be in order to execute well across a range of compatible CPU’s.
This isn’t the problem you make it out to be: the code paths that have significant per-processor differences tend to be either math routines already contained in an optimized math library OR are complex operations done in microcode (e.g. CPUID); the whole reason we are discussing hand-tuned assembly in the first place is that the compiler isn’t generating efficient code. Switching code paths for different processors is a very easy optimization.
Re: self-modifying code, rejecting self-modifying code is a sign of someone more interested in restricting the problem space to get good results than actually solving the problem – i.e. it makes for pretty PhD papers but falls flat in the real world, because good programmers already use controlled amounts of self-modifying code for optimization. Thus the PhD papers have restricted themselves to working on unoptimized code (for which it is easy to show improvements). And I point out that self-modifying code is REQUIRED of any OS-level binary translation.
Actually compilers do a pretty bad job at run time optimisation
Assuming I meant only compiler optimization is fallicious. Compiler optimization is good for removing the cruft caused by converting C into assembly into machine language, and nothing more. Algorithmic improvements and problem-specific improvements are more important, and both of these occur at the C language level at least.
To get back to the original topic: VMware uses binary translation for correctness. There are no speed gains to be had from binary translation in this problem space. And with present technology, binary translation is a faster means of achieving correctness than hardware virtualization.
I know for a fact Paralells on my Mac Mini runs faster with VTx off than on.