As a recent ACM Queue article observes the evolution of computer language is toward later and later binding and evaluation. So while one might quibble about the virtues of Java or the CLI (also known as microsoft.net) it seems inevitable that more and more software will be written for or at least compiled to virtual machines. While this trend has many virtues, not the least of which is compatibility, current implementations have several drawbacks. However, by cleverly incorporating these features into the OS, or at least including support for them, we can overcome these limitations and in some cases even turn them into strengths.
So as to head off any confusion or a premature dismissal of my ideas, let me make it clear that by operating system I DO NOT mean the kernel. So when I talk about moving support for virtual machines or JIT compiling into the OS I don’t necessarily mean putting it into the kernel. Some kernel support may be necessary but how much and what goes into the kernel will depend on the implementation and the costs associated with switching between user and kernel space. I would expect most of the ideas I suggest below to occupy a space similar to that of a dynamic linker. A fundamental part of the OS but not necessarily part of the kernel.
Current Virtual Machine Technology
The first, and most obvious, drawback to the use of VMs (virtual machines) is performance. I don’t want to get involved in the religious debate about the speed of Java compared to C++, but the current technology for executing VM code has an inherent performance disadvantage. While simple to program, interpreters are far too slow. Just in time (JIT) compiling is much faster than interpretation but still adds the compilation time to the execution time. If one could execute the same code without the overhead of compilation you would have a significant performance gain. Even if impressive development effort and the use of run-time profiling has temporarily made JIT compilers as fast as ahead of time (AOT) compilation it is only a matter of time until these advances make their way into AOT compilers and they regain the performance advantage.
There is also another performance issue facing VM code: start up time. While this may be of limited concern to large applications or to servers which only load code once, it presents a serious difficulty to writing small utility applications in code which compiles to a virtual machine. This large load time is also also a big contributor to the end user’s impression of VM code as slow. Faster processors and other tricks like leaving parts of the VM in memory might make this manageable for most applications it still doesn’t let us use this code in commonly-used libraries and other system code, denying us much of the tantalizing portability promised by VMs.
This load time overhead brings us to the second drawback to current VM technology. The issue is well explained in this article in Java Developers Journal. I won’t repeat the article but the short explanation is that to deal with the large load-time overhead what should be separate tasks end up inside the same OS process. As a result, programs written for a VM do not gain the same benefits of memory protection and other process/thread isolation features in modern operating systems. This also poses great difficulties for any attempt to write code in a VM which acquires permissions or privileges from an operating system.
Towards a Solution
One of the first things that should occur to someone when they learn how modern JIT compilers work is how remarkably inefficient the process is. The most obvious inefficiency is that every time the program is executed we spend time recompiling the exact same code. Despite what present software offerings suggest we don’t need to make a black and white choice between recompiling the program on each instance and binary incompatibility. It is completely possible to cache snippets of compiled code for use in later execution without sacrificing the benefits of using a VM.
The FX!32 binary translator/emulator from DEC is an amazing example of the power of this method. In this case the ‘virtual machine’ the code was written in was actually x86 machine language which was being run on an alpha chip. Instead of simply emulating the code inefficiently or forcing the user to wait while code was recompiled FX!32 would begin by emulating the code and then compile frequently-used code snippets into native binary which would be stored on disk for later execution. This technique of mixed emulation and stored translated code was so powerful that after enough program executions the emulated code would sometimes run faster than the original code. A detailed description of the FX!32 system can be found here .
While this might be the most obvious inefficiency in JIT compilation it is not the only one. Various sorts of code analysis and intermediate representations are often recreated every time the JIT compiler is run. In some cases these are structures already computed, in a much easier fashion, in the prior compilation from a high level language to the VM code. For instance when compiling from C# to CIL a CFG (control flow graph) is computed and then discarded but then the CFG must be computed again by the JIT compiler to convert CIL to machine code. Furthermore, having to analyze a lower level representation, and having less time to do it in, may very well produce a less detailed representation, e.g. believing the entire result of an operation is needed even though part of it may be unnecessary.
Multi-level Binaries
There is an elegant solution to both of these issues. The basic idea is to avoid discarding information in the course of the compilation process. Instead of having a binary which has only machine code, or only VM code or a text file with just source code, we combine them together. When compiling from a high level language we annotate the source with a CFG (or perhaps a SSA tree) and the manner in which the high level language corresponds to the VM code. The same idea applies to our JIT compilation from VM code to machine code. We annotate the VM code with the snippet of machine code the JIT compiler generates so next time the binary is run we need not recompile that snippet. Of course corporate software developers may not wish to include their full source code but they can still benefit from the additional information contained in the multi-level binary and the performance benefits of reducing redundant work. While developers and users gain the convenience of treating source files, at any level, as if they were executables.
The performance benefits to this approach are manyfold. By only calling JIT/compiler features when they are not already cached we remove the fundamental advantage natively compiled code has over VM code while retaining the binary compatibility. When a file is transferred to a different machine the JIT compiler voids the machine specific part of the multi-level binary. By retaining the higher level code future improvements and optimizations can increase the speed of prior programs. Since the code is not fully compiled sensitive operations can still be passed to external emulation functions so as to give the sandbox style features VM code often possesses. Furthermore we can run programs with dynamic function creation (like nearly every lisp program) without the overhead of compiling an interpreter into the program. Finally, since the multi-level binary retains more information on program structure it is easier to make use of profiling information which we could also save in the multi-level binary.
OS Integration
Of course, the benefits of multi-level binaries might be implemented at a purely application level. We might simply create a cache file for each source code file and have this loaded by the JIT compiler. Of course, this approach runs in to all sorts of difficult in copying files and the like but perhaps we just need a special file format. However, this makes things quite difficult if we want to write programs in more than one language without the overhead of RPC or CORBA. This difficult is no doubt one of the reasons integrating source and binary information isn’t more common. Still, we might address this issue by some system of standards and shared libraries used by the various compilers and JIT systems.
If we only want to address the issue of performance this might be a workable solution. However, if we want all the benefits of process/thread isolation without the overhead of loading a new VM/JIT compiler for each execution context we have no choice but to add operating system support or re-implement all the features of task management again in each JIT compiler. By making the various compiler functionalities as specialized OS modules each execution context written in a VM can share the same JIT/VM code while gaining all the benefits of process isolation. Essentially the JIT code would be a shared read only code segment which may be executed in several different processes. Thus the OS may be executing the same JIT code in several different processes at the same time. Also OS support could aid profiling with little overhead by saving the execution location in the saved snippets of machine code whenever it interrupts the process at that point.
Background Optimization
While the FX!32 style saved code snippets provide great performance after the program is run several times it seems silly for the program to under-perform the first several times it is run. Moreover, one of the performance problems with JIT compiling is that only computationally easy optimizations can be performed. Why not solve both of these problems with a background compiler(s) which scans the multi-level binaries on your system in free CPU cycles applying optimizations, interpreting profiling data and generating snippets of machine code to minimize time spent in the JIT compiler. Of course a JIT compiler or interpreter is still needed to handle sensitive operations or run code which the background optimizer hasn’t gotten to yet but it can be much smaller and faster counting on the background pass to perform many optimizations or at least tell it which optimizations should be carried out.
This raises the tantalizing possibility that OS updates would actually speed up programs you already have on disk. The background optimizer could download clever new ways to compile the source code and update your programs while you sleep. Since the multi-layer binary keeps so much program structure this could mean real efficiency enhancements. For instance if someone figures out a much more efficient way to implement the printf function the background optimizer could void the VM code in all your binaries corresponding to the printf function replacing it with the more efficient version without the need to recompile the rest of the code.
Of course the prospect of strangers changing the code on your system is a scary one from the prospect of security or even stability. Thus it is only reasonable for our background optimizer to require a proof of equivalence. Each line of code, or code snippet would be characterized by some manner of contract or even just a piece of reference code in the next level down. In either case the effect of the code would be captured in some formal system which allows for proofs of equivalence and only provably equivalent optimizations would be accepted.
A Vision of the Future
So in my vision of the future the distinction between the compiler and the OS blurs or disappears completely. In so doing we approach the holy grail of code reuse. If someone anywhere discovers a clever optimization or algorithm, and they can prove it equivalent to the standard algorithm, everyone benefits. If some clever assembly language hacker out there discovers a really fast way to implement some VM instruction he can use some specialized tool to prove his implementation is equivalent and it will spread across the internet speeding up every program using that VM instruction. The use of multi-level binaries even hints at the possibility of replacing entire algorithms by proving they are equivalent to some faster version, though our formal proof systems need some time to improve before this can happen.
While some of the ideas about downloading optimizations may be far off I think our constant evolution towards later binding languages and VMs makes the integration of the compiler and OS inevitable. It may not happen the way I suggest but hopefully by getting these ideas out there people can begin to think about it and perhaps avoid the grim possibility of these ideas being patented by a mega-corporation and strangling technology which truly could enable write once run anywhere freeing up the OS and hardware markets to real competition. At the very least I want to know why the technology of caching code snippets that FX!32 had long ago is absent in every JIT compiler I have ever seen.
—
The author, Peter Gerdes or logicnazi as he is known on OSNews, is a graduate student in mathematics/philosophy not computer science. So it is entirely possible that these ideas suffer from some simple pragmatic problem he doesn’t know about. If you want to contact him to tell him about one of these problems you can email him at [email protected] where he is bravely testing gmail’s spam filters.
Sorry to plug publications, but you may wish to look at:
http://www.cs.man.ac.uk/apt/projects/jamaica/index.html#Publication…
The Jamaica project has developed an OS (based on JNode) built around the Jikes RVM. We have also developed the PearColator DBT which can be incorporated into it. This is all written in Java and downloadable, however, its not yet fully featured.
LLVM (http://llvm.org) provides many of the capabilities that you want without the drawbacks you describe. In particular, it gives you portability and performance and the ability to adapt to changing hardware. Its compile time costs are very low: it provides a CFG, SSA form, and many other things directly in the representation. It also allows for compile-time, link-time, install-time, run-time and off-line (“optimizing screensaver”) optimization.
If you’re interested in this, check out these papers:
http://llvm.org/pubs/2004-01-30-CGO-LLVM.html
http://llvm.org/pubs/2003-10-01-LLVA.html
-Chris
http://www.ics.uci.edu/~franz/SlimBinaries.html
<disclaimer>
just an ignorant java developer who has an idea of whats being talked about, but self-taught/hobby level
</disclaimer>
i would like to say this is one of the most interesting articles ive read here in awhile. Managed languages are definately the way of the future. But if they are destined to become such an intregal part of operating systems of the future, wouldnt it make more sense to put the low level stuff into the kernel? wouldnt that allow for stuff like load handling and process optimization? maybe even a sandbox for the sandbox, have a generic, vm independant layer that would allow the kernel to know more about whats going on with the vm. i would think that the fewer levels of abstraction between the vm and the cpu, the better. would you mind explaining why it would be a bad idea?
one of the big problems with java at any rate, is that although platform independant sounds real nice, you still have platform dependant bugs, and platform dependant optimization. profiling a java app on windows can give wildly different results as on linux. and if we take cross-platform benefits off the table for a sec, dont we just end up with a slower implementation of the kernel?
There is no need for special hooks in the OS for sharing the code of the JIT between processes. Most modern OSes keep a single copy of the read-only code segment in memory and share it between processes. The JVM and its libraries will be shared.
This doesn’t apply to the code produced by the JIT. Instead of having special handling for code produced by the JIT, one solution is to use the existing shared library mechanism. This is done by gcj; it incrementally compiles Java bytecode into .so files. These can then be loaded and shared normally.
I think in general the idea is there. Cache native code is the major easily doable part.
I wouldn’t be so eager to jump in and try to force VM (java…) to be regular applications though. To a certain extent, we ‘trust’ our OS, our hardware to be perfect while programming. In a similar sense, you should eventually be able to ‘trust’ your JVM. Things like memory protection…aren’t really relevent in a VM like java.
You don’t have random access to memory, so what is there to protect? Why bother with a context switch between java applications? Also additonal VM extensions to security…
Now, moving some JVM stuff into the OS also makes some sense. But now you lose some of the isolation between the OS and the VM.
Interesting but I’m not certain how well it would work with say LISP, or Scheme.
I do see the advantage of the above, and other such efforts as great for those not so enamored with porting systems code, and libraries around.
Microsoft.NET is CLR for common language runtime, CLI is usually command line interface
I’m not really sure that a VM is worth the overhead given that you can compile to native code using tools like gcj anyway these days – given the dominance of x86 I’m not even sure abstracting the CPU is worth it either ….
I wouldn’t be so eager to jump in and try to force VM (java…) to be regular applications though. To a certain extent, we ‘trust’ our OS, our hardware to be perfect while programming. In a similar sense, you should eventually be able to ‘trust’ your JVM. Things like memory protection…aren’t really relevent in a VM like java.
Oh, but if there is one thing that history (and security researches) tells us, it’s that programs have bugs. They always do, and they always will, and nothing you do can change that.
Putting the VM into its own context is an additional security measure, that will greatly reduce the significance of any kind of exploit (or instability) that can be achieved via bugs in the VM.
CLI is, in dotnet land, the Common Language Infrastructure.
Essentially all the java “built-in” classes.
Theretically a VM can achieve higher performance than static compilation. A VM can use runtime type feedback — that is, it can look at what parameter types functions are being called with, and compile specialized versions of those functions for those parameter types, thus avoiding runtime type dispatch. A static compiler can also compile specialized versions of functions, but it does not have the luxury of runtime type feedback, so it doesn’t really know how many of what specializers to compile. Before you say that this only applies to dynamically-typed languages, even some langauges that we think of being statically-typed, such as C++ and Java, have runtime type dispatch, for example in the implementation of virtual methods, and thus they can benefit from runtime type feedback.
Sun’s implementation of Java does this, but Java has a reputation of being slow so it is not a good example. Other languages that use this technique include Self and VisualWorks Smalltalk.
Java Slowness = Virtual machine mem. hog= if you have and old computer and don’t have 256-500 megs ram
Ahh. I should have RTFA, my mistake.
The JIT and its runtime will always know far, far more about the executing state of the machine than a static compiler and can adapt accordingly.
In a similar way, assembly actually runs slower than C in a lot of cases because the C compiler can produce code that places the processor in a specific state, and the processor can easily predict what goes on from there and execute ahead.
You can do more work AND run faster just by knowing more about the executing states of the machine. This isn’t 1990 anymore.
Java Slowness = Java being slower and its gc peforming worse than similar vm languages. Not to bash java, but to point out that the jvm can probably be vastly improved.
OTOH you hit the nail on the head – when people claim java has gotten faster, they misregard that it’s their hardware that’s gone better.
(Note: I decided to address the C vs assembly comment first)
Anonymous (IP: —.tm.net.my): In a similar way, assembly actually runs slower than C in a lot of cases because the C compiler can produce code that places the processor in a specific state, and the processor can easily predict what goes on from there and execute ahead.
A human can do by hand the same thing as the compiler. The only difference, is that it’s easier to let the compiler do it. This is especially true if you are developing software for more than one platform or are new to the platform or the platform has some recent changes or if the software you’re developing is especially complicated or if you don’t have all the nessecary documentation for the platform, etc…
Anonymous (IP: —.tm.net.my): The JIT and its runtime will always know far, far more about the executing state of the machine than a static compiler and can adapt accordingly.
Wrong. You can write a “static” program manually that will know everything about the computer it’s running on and adapt, all you need to know are the appropriate instructions and techniques. (In fact, there are some optimizations that can be done manually that to my knowledge are still not done by any JIT.) And if you can do it by hand, then it’s quite possible that a “static” compiler can be developed to do the same (or similar) thing.
The real advantages of a JIT are: 1) A program can be written a long long time ago and the JIT can still use the latest optimization techniques on it. Whereas if it was compiled with a “static” compiler, it’s stuck with whatever optimization techniques were in use at the time. 2) A staticly compiled program needs to have all of it’s optimizations “built-in” and that could take up alot of space. 3) If the optimizations are done by hand, that’s going to require a fair (to alot) of effort, knowledge, and time just for the optimization, whereas with a JIT, it’s all done for you. So you can focus more on simply making the program work.
However, I can say that I feel doing some optimizations by hand can be a big help. I’ve known of situations where the programmers found tons of bugs (and higher-level optimizations which the compiler couldn’t do) simply by converting the program to assembly language by hand. The reason this helped is really quite simple, it provided them with a totally different perspective on the program and suddenly hard to find bugs (including ones which the programmers and debuggers had no idea existed) “suddenly appeared” and were “easily” eliminated.
I forgot to add something…
Anything you (or a JIT) might do to optimize a program while it is running is going to require some overhead of it’s own. Whether this overhead while outweigh the optimizations is a good question which you need to ask.
Personally, I don’t know if it does or not with the current JITs. Perhaps someone else does.
With such optimizations done by hand though, they are “easily” checked and eliminated through a sufficient level of testing. (I’d imagine that a JIT should be capable of doing the same thing automatically.)
First of all I wanted to thank everyone for the thoughtfull consideration and interesting responses. In particular I found the LLVM stuff fascinating. While it doesn’t quite address all the things I had in mind it does go a long way there.
Now a few comments in response to what people have said.
<h2>Why Use VMs</h2>
First of all there are several good reasons to use virtual machines rather than static compilation. As several people here have accurately pointed out there are some performance benefits to doing things at runtime. Additionally there are many garbage collection benefits to working in a virtual enviornment as additional information is availible letting one avoid the drawbacks of conservative GC. In particular, I think it would be very difficult to provide guaranteed finalization in a staticly compiled enviornment.
Moreover, static compilation doesn’t provide for binary compatibility. While theoretically one could simply provide guaranteed source compatibility the pragmatics of software development make it quite unlikely that this would really be effective. Even pure ANSI C programs usually aren’t write once run anywhere. Quite simply as long as the development enviornment is focused around the execution of native binaries the temptation for developers to take advantage of pure binary features incompatible across platforms is simply too great. Furthermore, without fat binaries or a solution like I am suggesting it seems difficult to provide transparent binary copying between platforms and architectures.
Still, these issues may not in themselves provide a compelling justification to make such a major change and some hack like fat binaries or automatice recompilation might offer the user live appearence of perfect binary compatability. However, VMs provide several features that simply can’t be provided in staticly compiled code.
Foremost is finely grained permissions/sandbox features. By only allowing the JIT compiler to cache/create ‘safe’ code we can force all sensitive operations to be performed virtually. While we might introduce rough grained permission features using ACLs or binary scanning these simply can’t provide the level or protection and the fine-grained distinctions a virtual enviornment can provide.
For instance suppose you download a program from the internet which edits/updates your bootloader. This program needs direct access to your disk but you don’t want to allow an error to overwrite all your data or for a trojan to maliciously modify other executeables on your system. In a virtual machine all calls to the direct disk system call would be sensitive and pass through the emulator portion which can enforce restrictions like requiring all reads and writes to be in a certain range. Since the sector being accesed may be determined by a complicated algorithm once simply can’t guarantee these resctrictions at compile time.
So while grsecurity does demonstrate that we can add access controls one at a time to certain system calls it requires specifically dealing with each function one wants to restrict by hand. A virtual enviornment provides a general solution where *any* system call be made subject to near arbitrary restrictions. One might specify that a given program is only to send UDP packets to a particular IP address, or may not start IPC with a particular process or any restriction imaginable not only those which the security people thought about. You can also guarantee the program does not read information that it still must access, for instance the program might need the information from uname but you don’t want it to read the uname field or the program might need the result of one syscall to feed to another but the program itself should not be allowed to see the information. Finally, you can implement positive security, giving a particular list of all and only the calls the program is allowed to make rather than negative security which is mostly what binary security can offer.
While some of these features might be possible to implement for native binaries with clever hacks the performance hit would be unacceptable. If we want to block IPC to a process with a particular name in one program a binary solution would require a check for authorization for *every* program seeking to do IPC. While what we really want is system level programs to have fast unrestricted access and sandboxed programs to go through the security checks. So if we want these completly general security restrictions for binaries we either must accept the overhead of every syscall checking for authorization or write a wrapper function for every system call. If we want the ability to replace arbitrary syscalls with our own code, perhaps all programs in a particular sandbox need to be given a modified list of running processes, the difficulty becomes even greater. Not to mention the inherint superiority of virtual security over binary security. Since a pre-compiled binary is directly running on system hardware it is much easier for the slightest error in your security model to allow an arbitrary exploit.
Finally, usint a virtual machine allows trusted computing and contract type programming difficult to implement in pure binary. At heart this is similar to the security issue but differnt in intent. For instance a particular program/plugin may need both to access the internet, say to check for updates or gather data, and handle personal information and a VM based system can track references to allow both but guarantee that the personal information can’t exit the local machine (yes this is hard and would have to be conservative). While I don’t necessarily like the idea this could also work to protect copyrighted content while allowing the user to load their own tools to search or format the information. It also has the potential to improve grid computing by providing better guarantees that it is really the distributed code which was executed. Finally it offers the possibility of function libraries of unknown origin with enforced contrats.
The rest of this message will be in the next post.
So in the previous post I hope I clarified the reasons I think the use of virtual machines in programming offers very compelling advantages. Now many of you seemed to accept this proposition but be unconvinced that this needed to be moved into the operating system.
My first reason is that if we don’t move it into the operating system the benefits of these new coding methodoligies and languages will be forever locked in the ghetto of application level programming. If we want to be able to use these fancy new languages in the OS itself or device drivers there simply isn’t any way around moving things into the OS. While I realize some of you may think the idea of using things like java in the guts of an OS I would remind you that people once felt the same way about using things like C or C++. So while these may not move into the kernel itself I can certainly see these in device drivers and system calls.
In particular the safety features of virtualization are particularly attractive in device drivers. Since device drivers are often downloaded from the internet from untrusted sources but given hardward level access they are particularly in need of security features. Furthermore they are illsuited to the all or nothing type of authorization present in binary only security implementations as it is not uncommon that we would want to give a driver the ability to do things like directly interface with a system bus, but only construct messages with a certain prefix (I’m out of my element here so I apoligize if the example is incorrect but we do often want device drivers access to certain arbitraril defined subsets of a device). While we could implement these protections by hand the evolution of hardware makes it difficult to continue adding protections that protect everything one needs to access and allow new device drivers enough access to do what they need without unduly slowing down trusted system drivers which we want to bypass any security checks. A VM system allows the possibility of a general solution which lets device drivers ship with a complete specification of exactly what access they need with enforcement.
Also some people have suggested that we need not add OS support to gain many of the process pretection and optimization features to our virtual machines. While it is certainly true that we can load much of the JIT compiler into a shared library and implement a small interface for each instance this doesn’t solve the thread level protection problems (which was why I mentioned threads in my article instead of just processes). In particular since threads share many context enviornments it is not sufficent to simply handle each thread as its own OS process. So unless we want to just reimplement all the thread protections and scheduling in each virtual enviornment this suggests OS level support.
This support doesn’t need to be very complex. I am thinking of something as simple as entering them as threads in the scheduler but instead of waking and sleeping them normally just calling wake and sleep functions in the virtual machine code. Actually now that I think about this it may very well be possible to do in some of the cooperative thread scheduling implementations which allow the kernel to do scheduling and then just expousing data to allow a user space process to manage the threads.
Other than a few small kernel modifications like this the rest of what I mean by OS integration is basically putting support for a common multi-level binary format in the binary loader, the same way the binary loader supports shared libraries. In other words provide a common infrastructure to support the binary caching and multi-level features I suggest for JIT compilers and apparently are even implemented in some situations. So rather than each virtual system doing these things itself there would be uniform system libraries that provide functions like get ssa representation or find code associated with ssa subtree. This is needed, as I mentioned in my article, so that executables written in several virtual machines might be executed without the overhead of CORBA or the like.
Now I agree that many of these things are forward looking and somewhat speculative. It probably isn’t time to start throwing virtual code in device drivers and the like. However, since I think it is inevitable that more and more of our applications will be written in high level languages with virtual execution it doesn’t hurt to start thinking about it now.
Ohh and also one of the other advantages of virtual machines is reflection type services (maybe I got the term wrong but I mean things like creating functions on the fly). For instance it is nearly impossible to write good lisp as a native binary (and I don’t count putting the interpreter in the binary). For the person who asked the virtual machine in question here is basically a machine which implements car, cdr and a few basic operations and has two stacks, similar to a lisp machine(tm).
There are some other nice benefits of VM-based architectures. Generally better integrated security. VMs make it much more difficult to create buffer overflows. And, while it’s still possible for coding errors to lead to privilege escalation within a VM (for example, when somebody misses a security demand prior to executing sensitive code), the sandboxed nature of a VM can provide yet another layer of security around an app (aka shell) to prevent it from horking your machine. Better portability. Easier coding and maintenance. Reflection (which makes late-binding a lot easier than using RTTI in C++).
@Deletomn
It’s true that you can manually create all the instructions C compilers produce by hand, but these ‘optimizations’ are actually extra instruction that are seen as ‘useless’ by many assembly programers and they chose to shortcut the instructions. After all, why access that register, etc, etc, when you don’t have to? However, it’s this particular order of instructions that actually places the processor in the right state such that it can accurately predict what to do next.
You’re right that it needs in-depth knowledge to do assembly optimizations; however, the overwhelming majority that use assembly to ‘optimize’ don’t have that knowledge and do it “because it must be faster.”
And again I will say, the JIT and runtime will always know more about the executing state of the machine. Not just its configuration, but all the data and instructions issued, and dynamically recompile select sections of code to adapt. Static compilers and runtimes have to rely on alternate code paths. Dynamic compilers and runtimes can ‘see’ what’s going on at the moment and adapt the generated machine instructions. Static compilers need to guess what will happen next and select a path. We’re not talking about self-tuning of the program’s internal variables and what-not, IIS and SQLServer have been doing that for years.
People like to cite the overhead of dynamic compilers and runtimes as to affecting performance in a negative way. However, this ‘overhead’ actually makes it faster. A lot of people find this very hard to accept as it seemingly goes against all conventional logic on the surface. It makes more sense when you go deeper and look at things like speculative execution and probabilities.
There is one very important trade-off with automatic compiler optimization, though. Source level debugging becomes useless after the compiler’s had it’s way, and you need to drop into the lower levels. Compiler and runtime writers dedicate a lot of effort to trying to make sure the optimizer doesn’t introduce deterministic bugs
As far as I know, you can have your code AOT compiled when installing a .Net executable. Mono runtime also supports this capability.
I cannot exactly remember the extension (but lets call it .so). If you want to load Windows.Forms.Dll in mono, it first checks for the precompiled version in Windows.Forms.Dll.so, and loads the normal version for JIT if it cannot find the precompiled one.
(For mono check: mono –aot, however I do not know about how the MS runtime does it)
Anonymous (IP: —.tm.net.my): (Assembly vs C Comment)
Yes… Yes… I know all that. However, there’s one other thing to take into account. A human understands what the purpose (among other things) of the the program (or function) is, this sometimes becomes important, because some optimizations can be “obvious” to some humans and yet evade the best efforts of the compiler.
Granted, that doesn’t always happen (perhaps not even often, I wouldn’t have any statistics) but it is a possibility. And certainly it still requires a human who actually has some sort of skill at this.
Personally, in my opinion, the nice thing about optimizations done by the compiler, is that they are nice and easy and usually very good. So that you (the programmer) can focus on making the program work rather than worrying about every little point that might be inefficient. Optimizing a program properly and throughly can take quite some time. And then if you move it to another platform (or something else happens) you don’t have to worry about having to rewrite all of your optimizations. The compiler will do it for you.
In addition, (as stated before) you need quite a bit of expertise in order to optimize a program well, and this requires research and experience. Once again, more time saved.
Anonymous (IP: —.tm.net.my): And again I will say, the JIT and runtime will always know more about the executing state of the machine.
And once again… I will say no it won’t. Not versus a properly written program. All options available to a JIT and runtime are available to a staticly compiled program. ALL OPTIONS There is not a single one which cannot be implemented with “some” effort. All techniques, all instructions, all data, everything is available. It could (possibly) be built into a static compiler. Certainly it can be done by hand. The advantage over doing it by hand is obvious, time, effort, and expertise. The advantage over static programs in general (including a compiler) is as I stated before, the static program will need to lug around everything it needs with it and of course it can’t include optimizations which haven’t been invented yet.
For example, a (simple) staticly compiled program can check it’s status as it runs and see how some or all functions are used by keeping statistics. It can then modify function calls (or for really simple programs, function pointers) and change the function to different optimized functions (you could have 100,000 different variants of one function, though this would be excessively impractical)
More “advanced” static programs could actually go through and modify individual instructions. At this point, however, it becomes very impractical to do by hand (and thus things begin to fall apart) The reasons being obvious, it will be different for each processor, it requires a high level of expertise, difficult to write, difficult to test, etc.
And so on…
As I stated, the advantage over doing it by hand is time, effort, and expertise. (And obviously after a certain point practicallity) However, it is possible, which was my point and for all I know, some innovative programmers out there might have resolved these things staticly and might not be talking (because it’s some “secret”).
One thing they can’t do is make the program add optimizations to itself, which it didn’t know about at runtime. And as I said, it’s going to have to haul all of it’s stuff around with it and that’s going to take a fair bit of space.
Anonymous (IP: —.tm.net.my): We’re not talking about self-tuning of the program’s internal variables and what-not, IIS and SQLServer have been doing that for years.
Hmmm… As far as I know modifying individual instructions is quite a bit different from self-tuning a program’s internal variables. (BTW… I’ve done some of these things to a small extent and I also know what I could have done but didn’t have time to do.)
Anonymous (IP: —.tm.net.my): People like to cite the overhead of dynamic compilers and runtimes as to affecting performance in a negative way. However, this ‘overhead’ actually makes it faster.
Overhead does not make things go faster. That’s like saying that by dropping money on the ground I’ve become richer. Dropping money on the ground clearly makes me poorer. However, you can still have a net gain, because the action that caused the overheard allows you to do things you wouldn’t be able to do. In the case of dropping money on the ground, I might now be able to pick up that wad of money “over there”. Whether I actually made a net gain or not is completely dependent on how much that wad is worth versus the wad I used to have. (Of course, we also need to take into account the walking over there and the amount of energy it took to drop the first wad, plus the amount of energy it took to pick up the second wad.)
Anonymous (IP: —.tm.net.my): It makes more sense when you go deeper and look at things like speculative execution and probabilities.
I know that, I just don’t know what the actual probabilities are, because I’ve never done extensive real world testing with it myself. Things can be different between theory and reality you know. And things can vary from one system to another.
And I’d want extensive real world tests, I’ve seen too many big announcments from local “Java supporters” which actually didn’t even apply to their own clients/company. (One comes to mind, someone was talking about how vastly superior Java is to C/C++ as far as portability goes. And his latest project was implemented in Java and I had the “pleasure” of seeing the results, about one half of workstations this program was supposed to go on didn’t have a compatible JVM at the time, but they all had C/C++ compilers. Later, (as in a couple years later) those workstations were upgraded to machines that did have JVMs. Note the “Later” part.)
Note: I’ve got nothing against VMs and Java in general, I think in the long run (if nothing else) it will be great. But in my opinion alot of people are overly excited.
Overall I thought your article was interesting. However, I believe you missed what I consider the most important point in joining the VM with the OS: True plug-and-play for hardware.
Let me explain: If a standard was developed for devices through which they could store their own device drivers and they could be easily retrieved (via the standard) by the OS, then you could simply plug the device into the computer and the OS would “instantly” understand how to use the device. The drivers would also be “crossplatform”, so if say Linux distributions and the Mac OS implemented the VM, then the device would automatically work with those OSs.
I haven’t done much hardware design work, so I don’t really know how much this would add to the cost, complexity, etc. of the devices, but I feel that it wouldn’t be too hard to add and I also feel it would be well worth the effort.
There is no way JIT can ever be faster, because whatever you do it’s all opcodes in the end – and with the JIT you have the overhead of doing the compiling at runtime. It’s revealing that I have yet to see a major java app without a splash screen, and ime performance when running is also far lower, at least in a GUI. Anyway, while it’s true that JIT code running on an Athlon can be faster than arch=i386 compiled code on the same machine, natively compiled code for your machine will always be faster than JIT.
Alright so there has been much debate over whether JIT compilation can be faster than ahead of time compilation.
Now in some *very theoretical* sense AOT compilation can match anything JIT compilation can accomplish. After all one could regard the entire JIT compiler and executed instructions as one AOT compiled program. In general no matter what compilation technique you use some sequence of machine instructions is being executed and could be coded by hand in that manner or by a sufficently good AOT compiler. So in theory AOT will always have the advantage over JIT as whatever optimizations the JIT compiler will produce with run-time data can be hardcoded into the program, i.e., in the worst case you might write a program that uses self-modifying code to duplicate whatever run-time optimizations the JIT makes use of while avoiding some of the JIT overhead.
However, whether or not some ideal AOT compiler could do a better job really isn’t the question. Producing a perfect compiler is actually mathematicall impossible (it would require solving the halting problem) so the correct question is whether a JIT compiler has practical advantages which make optimization easier than in an AOT compiler. Indeed I think it does for a couple reasons.
First of all profiling is simply easier with a JIT system as it happens automatically without forcing collection of real world data and recompilation. Moreover, unless one expects users to recompile all their own binaries with their own profiling info a JIT system has access to profiling info relevant to a particular users usage pattern which AOT compilers do not. This can make a real difference as a user who calls a function on large data sets may benefit from loop unrolling greatly but another who calls it often on small data sets may not. Similar considerations apply to optimizations for the processor the user is currently using.
I won’t continue listing various runtime optimizations that are *easy* to make with a JIT compiler but suffice it to say they are their. While those of you insisting that an AOT compiler could do this you are correct in principle. One might just build a profiling feature into the compiled program and a function which modifies the code in response. However, we simply don’t have good AOT algorithms to do this sort of thing while they are easy to do in JIT code. Moreover, at this level the distinction between JIT code and AOT code starts to disappear as one might reasonably alledge you are just incorporating the JIT compiler into your binary.
So I think the advantages of JIT compilation with respect to performance are clear the question is just whether they overcome the overhead of JIT compilation. I think the answer is clearly yes if we make good use instruction caching.
I guess I must misunderstand something. As I understand it, there appear to be two alternative models being discussed here.
1) Static Compilation *once* and for all in the development environment.
2) JIT Compilation *each time* the application is run, so that it can be optimized for a specific environment.
Surely the ideal is to compile an application *once* on a particular hardware configuration? So why not employ a third model:
3) Compile the whole application *once* when the application is installed. Recompile only when significant changes are made to the execution environment.
Or am I being too simplistic?
Kramii.
So some people have been praising AOT C style compilers because they always put the processor in a known state. I think this is a mistake and this is actually a great deficency in their operation tolerated because it is too hard to do anything else.
In any program the processor should always be in a known state, i.e., the processor state is completly determined by the input and the prior instructions. What we really mean by known state in this case is that the compiler has a convention about what the processor state must look like at particular times in the program. This means extra instructions are being used to bring the state into accordance with this convention even when it is not needed.
For instance consider the convention that after a function call the return value is stored in some particular register. This may often make sense but the return value may be entierly ignored by the calling code on every call, or always immediatly stored back to memory and not used for some time so it would make much more sense for the function to ignore the return value or place it in a temporary memory location itself so as to avoid spilling a register.
In fact much of compiler optimizations are about violating these conventions. The state of instruction selection theory today, I hope someday it will change to a more general algorithm, seems to be to start with strict conventions that are known to produce the correct result and then apply optimizations which abridge these conventions in a manner known not to screw up the program.
Unfortunatly, most of these processor level optimizations are performed using peephole analysis after code has been generated, i.e. the generated code is scanned with a relativly small window and if the optimizer recognizes a series of instructions that can be replaced with a faster version it does the replacement. As I understand it a BURS system is an advanced way to accomplish this task, basically it organizes instructions into dependency trees and then scans for matching sets of instructions it transforms via rewrite rules.
As should be apparent such a strategy depends heavily on finding efficent rewrite rules and the longer instruction sequences considered the more optimizations possible. As an example (which may or may not be real world reasonable) imagine a bunch of code which rests between a call saving the DS register and restoring it which makes no use of the data segment in between these points. If this entire sequence is subjected to a rewrite at once it may be able to make optimizations to store a variable in DS while if it only considers sequences of a smaller length it can never know it is safe to overwrite DS. However, one can hardly go through all instruction sequences of even length 20 so if one wants this optimizaiton to have a large window it depends on succesfully identifying commonly used code blocks and appropriate optimizations.
It was exactly this understanding and problem which lead me to suggest OS updates improving JIT compilation. As coders identify commonly used code segments and an optimization of that segment this rewrite rule can be sent to JIT, or even AOT, users. Since it is easy to verify that two code sequences are equivalent users can benefit from this optimization in all of their programs without any security risk.
JITs and VMs are system-level software. There’s no getting around that. Application programmers have to rely on them doing their job accurately. Given the stability requirements for VMs, one has to assume that their code is very good and their bugs are few and far between. If so, why not push all of this code into the kernel? Eliminating user-mode entirely can speed up programs because all hardware accesses can be direct calls without the costly stack switch and access checks involved in a kernel-user transition.
In answer to your quesiton Kramii I am proposing a combination of 2 and 3. Instead of fully JIT the application each time cached code snippets would be used but it would still execute in a JIT/emulator style enviornment so that sensitive calls can be emulated for security and other reasons. Furthermore, I am suggessting that the compilation proccess be continous and in the background so that the proccessor is always sitting around optimizing applications, or at least whenever it gets a small improvement in optimizations.
That’s exactly what completely VM based OSes (e.g. JNode) do. Such OSes can run device drivers, applications and even most if not all VM code in the VM and thus in kernel mode.
(The trick to run the VM in itself is a bit more complex to explain and I don’t have the time now, but JNode is doing that or will do it in the future AFAIK, so you can read up there).