Performance of Java versus C++

Submitted by aaa 2004-07-07 Benchmarks 55 Comments

This article surveys a number of benchmarks and finds that Java performance on numerical code is comparable to that of C++, with hints that Java’s relative performance is continuing to improve. Then they describe clear theoretical reasons why these benchmark results should be expected.

About The Author

Eugenia Loli

Ex-programmer, ex-editor in chief at OSNews.com, now a visual artist/filmmaker.

Follow me on Twitter @EugeniaLoli

55 Comments

2004-07-07 7:07 pm
Anonymous
Hm, now how can this be true…every day we hear “Java is just soooo slow”, and “java is dead”, so this article must be fake.
2004-07-07 7:18 pm
Anonymous
IMO, startup time of desktop java programs are still slower then the average C/C++ counterpart. This is what most people base their opinions on and thus this is why people still think that Java is slow.
The second biggest reason that java is ‘slow’ is because many people never upgraded their VMs and are running the old MS vm, or the first version of Java they ever downloaded. The new VMs are really really good and have decent performance, but most people would never know because the download is 10+megs just for the VM.
2004-07-07 7:19 pm
Anonymous
you hear “Java is dead” from ms marketing reps or people who listen to them. you hear “Java is slow” from people who have experience with gui apps writeen in swing from sometime in the last few years.
2004-07-07 7:23 pm
Anonymous
Here… Read the oodles of posts regarding this from about 3 weeks ago on Slashdot before duplicating them all again here please:
http://developers.slashdot.org/article.pl?sid=04/06/15/217239&mode=…
2004-07-07 7:28 pm
Anonymous
Don’t know if it’s applicable to the code they’re are using, but it’s well known that using -O3 with GCC will result in slower code than -O2 in some scenarios thanks to the resulting code being bigger.
Interesting read though.
2004-07-07 7:32 pm
Anonymous
Really? Any examples of cases where -O3 generates code that is slower than -O2?
2004-07-07 7:33 pm
Anonymous
If money and developer resources keep being poured in java, there is no reason why it cannot be a contender against any natively compiled language.
There may be some invariant overheads (jit compile time, VM startup, etc), but in terms of run-time performance, there really should be little difference.
It is useful to remember that back in the late 80’s, COBOL compilers generally produced faster code than the C compilers from the era. Why? Because COBOL had 30+ years of developer time and financing from financial institutions who were willing to pay to incremental improvements.
John
2004-07-07 7:33 pm
Anonymous
The bits about GC improving performance and pointers hindering analysis are quite true. His overview of the advantages of runtime compilation are a bit overblown — powerful static compilers can do most of those things already.
In general, the potential performance of Java is due to the fact that it uses the Lisp/Smalltalk machine model, and there is literally two decades of intense research available on making that model go fast. For a very long time, much of that research went unused in commercial compilers, because they weren’t very useful in C. Now that Java and C# both use the same machine model, however, a lot of that research is suddenly applicable. Indeed, compiler research seems to be picking up not too far from where it left off, and asa result, there are some new original contributions to the field that are useful for not just Java, but any language with a Lisp machine model.
2004-07-07 7:41 pm
Anonymous
@Wee-Jin Goh
check out http://www.coyotegulch.com in particular
http://www.coyotegulch.com/acovea/acovea_4.html
You will see that -O3 can be slower than even -O1 (huffman compression)
John
2004-07-07 7:49 pm
Anonymous
Thanks for the link. I remember reading that page before, but I couldn’t remember what it was called 😉
2004-07-07 8:02 pm
Anonymous
I can think of two other reasons why Java is still perceived as slow:
1. Swing. It’s an ugly, slow behemoth that needs to be put out to pasture. SWT-based applications like Eclipse and Azureus (a very nice bittorrent client) feel much faster, and integrate much better as well. This doesn’t affect server applications of course, but that’s not where people ‘perceive’ Java…
2. Memory footprint. Java is still a huge memory hog and if it pushes the computer into swap you can forget about all the other reasons why Java feels slow, they pale into insignificance. An instance of Java running Azureus takes about 45MB on my machine, and that’s without actually doing anything.
2004-07-07 8:20 pm
Anonymous
How fortunately that several good GC implementations exist for use with C++ …
Its good too that in C99 “restrict” was introduced, which alleviates the problems of possible pointer aliasing. Hopefully that makes it in C++ someday (gcc already supports it in C++ compilation).
2004-07-07 8:26 pm
Anonymous
Put simply, -O3 tends to slow down looping conventional code a little. Only a little, mind.
The trouble with Java benchmarks, is that they are typically pure integer tasks with a few loops (and then, not too deep).
Even the most braindead compiler/VM should be able to produce decent running code from integer add/multiply and a few loops. All code should be more or less the same.
Of course, every time a Java fan does this, it’s all pretty stupid, because once you add deep looping, non-Integer types etc. everything goes wrong.
Java goes bad as soon as you even look at objects (which is a problem in an OO language!) or complex data types. Instantiation is especially bad, which is why the classic Don’t-New-In-A-Loop optimisation improves Java performance so much.
In short, it’s a useless benchmark, and in real world tests, you wouldn’t use Java like this at all, and it would run far slower than the C.
Fin.
2004-07-07 8:30 pm
Anonymous
Discussing compiler optimizations is rather abstract, so I thought I’d give some concrete examples:
1) Call-via-pointer:
Say you have a function qsort() that takes as a parameter a pointer-to-function used for comparision. Say you want to pass compare_strings() as that parameter. So you do:
qsort(…, compare_strings, …);
Even though compare_strings is a global constant, your C++ compiler will not inline that call in qsort. As a result, the call will be about 10x slower than a qsort written by hand. Many high-level compilers will do this optimization.
2) Virtual-call:
Say you have an abstract base class Iterator:
class Iterator {
…
void next();
…
};
Ideally, you could do:
for(Iterator i = foo.begin(); i != foo.end(); i.next())
{
…
}
The problem with this is that it’s not patently obvious to the compiler what the concrete type of the iterator is (ListIterator or VectorIterator, etc). So it’ll do a very slow virtual call. If i.next() is simple (eg: a pointer increment) this will be 10-100x slower. On the other hand, if an advanced compiler can figure out the exact type of foo, it can infer the exact type of it’s iterator, and inline the next() call.
This particular optimization makes the STL’s map<int, vector<char>>::iterator nonsense unnecessary.
3) Range-checked iteration:
Say you’re a good, careful programmer and use the STL at(), which does range-checking, instead of the [] function which does not. So you do:
int sum = 0;
for(int i = 0; i < q; ++i)
sum += vec.at(i);
Most C++ compilers won’t optimize out the range-check, but compilers for language that mandate range-checking will.
4) Stack allocation:
Consider the code:
big_struct bs;
bs.field = 10;
do_foo(bs);
This works, but consider what happens when you change your API and decide that your function should return bs instead of calling do_foo() on it. So you change it to:
big_struct* bs = new big_struct;
bs->field = 10;
return bs;
In a Lisp and Smalltalk (and Java, discounting primitive types), that change would be unnecessary. Conceptually, everything is always allocated on the heap, so you can return a reference to a local variable. However, the compiler can use something called an escape analysis, and stack-allocate objects that it knows cannot be accessed after the function returns. The optimization doesn’t make the code any faster, but allows the programmer some convenience without trading performance.
5) Automatic unboxing:
Say Vector is an old-style C++ vector class, that stores everything as a void*. So you do:
Vector v;
for(int i = 0; i < 10; ++i)
v.append(some_func(i));
Your C++ compiler will happily store a void* to a heap-allocated integer (essentially, a boxed representation). This will be the case even if some_func() always returns a floating-point value. Certain compilers for Scheme and ML can infer that v can only ever hold floats, and make the Vector store floats directly.
This particular optimization makes vector<float> mostly unecessary, although you might still need to give a hint to the compiler’s type-inference engine.
The end result of all this is the following:
0) The fastest code will always hand-written ASM. C is not as fast as ASM in the best case, but is in the *average* case.
1) Hand-optimized C code will always be faster than what even the most powerful high-level compilers can generate.
2) Almost no C programmers have the time to hand-optimize all of their code. Especially taking into account the silly things (manual memory management, debugging array out-of-bounds errors) that they must waste their time with.
3) This is especially true in the commercial world, where optimization is a luxury, one easily dismissed in the face of a looming deadline.
4) Because the compiler can micro-optimize each line of a high-level program, a high-level language might not be as fast as C in the best case, but might be in the average case.
2004-07-07 8:32 pm
Anonymous
What do people find faster, .Net or Java?
2004-07-07 8:32 pm
Anonymous
There are *decent* garbage collectors for C/C++, but nothing as good as what’s available for languages that properly support GC. In particular, the fastest GC algorithms (eg: generational) depend on the ability to move objects in memory, which conservative C++ GCs cannot do.
2004-07-07 8:39 pm
Anonymous
Mention the startup time and memory usage of java ?
I agree with alot of previous messages.
Java is a compiler now.
Any arguments about that ?
My question is JAVA is GOOD for SERVERS with tons of memory and proc. power.
Isn’t a scripting language better than java for alot of uses ?
I am not the biggest java fan.
Are there any languages out there faster than java ?
2004-07-07 8:57 pm
Anonymous
The Stalin Scheme compiler is freakishly fast:
http://groups.google.com/groups?q=Stalin+vs+C&hl=en&lr=&ie=UTF-8&se…
For things other than microbenchmarks, SBCL, d2c, Bigloo, Clean, and Ocaml are likely to be faster, because they generally have more well-developed analysis.
2004-07-07 8:57 pm
Anonymous
No, nothing faster! hahahahahahaha<evil laugh> j/k
really assembly is. But just because it isn’t slow anymore doesn’t mean you have to use it(java). Use the best tool for the job, depending on the application. Although I just saw a game(3d) written in java http://www.oddlabs.com/ , so maybe its not useful(not perfect) for many things.
I find that it is a very good language for not shooting yourself in the foot with, but still being able to shoot. .Net has returned a couple pitfalls from C++ that Java(wisely) got rid of.
Java also has the advantage of being truly cross platform.
2004-07-07 8:59 pm
Anonymous
Didn’t we just have this conversation?
2004-07-07 9:05 pm
Anonymous
I enjoy your comments on this board.
And i have really tried to learn ocaml but is tough going.
Ocaml speed at Bagleys computer language shootout
http://www.bagley.org/~doug/shootout/
is impressive so i thought i would try to learn but is tough for me. It’s totally different.
What do you think of this language ?
http://www.rapideuphoria.com/
Fast but not object oriented.
2004-07-07 9:13 pm
Anonymous
http://osnews.com/story.php?news_id=5215
Good stuff. Check it out.
2004-07-07 9:14 pm
Anonymous
The first benchmark is concluded with this quote:
“On Intel Pentium hardware, especially with Linux, the performance gap is small enough to be of little or no concern to programmers.” [referring to a 20% performance gap].
I can’t help but feel that a 20% performance gap is nothing to sneeze at if you are doing scientific computing (as this benchmark was — fast fourier transforms, n-body problem, etc).
If you are running with a moderately sized dataset a 20% performance difference could mean a difference in several hours for which execution lasts.
2004-07-07 9:29 pm
Anonymous
The typical malloc()/free() approach to memory management is an awful one from a performance standpoint. The first blow comes from the fact that all threads making malloc()/free() calls hit a global mutex for each call, which can take, on certain architectures, up to 400 machine instructions.
Garbage collection, while still a leaky (sometimes literally) abstraction, can provide intelligent and adaptive usage of memory, and take several steps to avoid the typical gotchas such as memory leakage and fragmentation.
There are several lower level approaches to avoiding the performance problems of manual memory management, such as memory partitioning on Win32 through the HeapCreate() command and memory pooling. My preferred approach has been to create thread-local memory pools which can be used to avoid hitting a global mutex on the process heap, as well as using other memory pools where necessary.
We see *once again* a benchmark which takes a Java-oriented implementation approach (using virtual methods), despite the fact that any first year C++ student is taught that virtual methods are slow as they entail a log(n) vptr table lookup. Yes, Java is very fast at executing virtual methods thanks to run-time inlining, and C++ is slow. However, a simple move to templates instead of virtual methods will mitigate all performance issues.
In fact, the power of templates can be leveraged to make C++ an order of magnitude faster than Java in numerical code, especially in FFTs and other discrete transforms such as the cosine transform. Trigonometric functions can be implemented as a fixed length Taylor series expansion, and transforms can be partitioned into fixed length blocks which execute in a loop. The fixed length transform, when implemented as a template which invokes templated trigonometric functions, is reduced to a single mathematical expression. Mathematically redundant operations can then be eliminated at compile time, and on many compilers/architectures the Taylor series expansions will be reduced to a single machine instruction. The result is a fast, fixed length transform with no branches or loops, usable in a readable high-level manner.
I have still never seen a Java versus C++ test which utilizes the language features of C++ to their full potential, however I have seen many touting the performances of dynamic optimization on very well written Java code.
In any benchmark which makes the claim that one language is “faster” than the other, even for the implementation of something as specific as an FFT, step one, especially if the claims seem outlandish, is to ensure that both implementations are optimal. In this case, clearly they are not.
What I would personally like to see, and I’m sure Rayiner Hashem would as well, is C++ with manual memory management pitted against compiled Lisp with garbage collection and macroes. I would expect Lisp to beat C++, and I would certainly expect it to beat Java.
At any rate, some of you might remember these Java benchmarks I did awhile ago, posted on OSnews and Slashdot, which show C walloping Java on procedural code:
http://fails.org/benchmark.html
A suboptimal implementation in any language is still suboptimal, and unless the implementations in each language are optimal the benchmark is unfair.
The only language benchmark I would trust would be one where it was coordinated by a 3rd party and different groups of language lawyers tried to construct the most highly optimized implementation they could for a given task.
All of that said, neither Java nor C++ are ideal languages for scientific computing. The atmospheric model used by my research group is written in Fortran 95, as are most other scientific applications in the known universe.
2004-07-07 10:05 pm
Anonymous
I thought that Java was much worse than C but in tha article it doesn´t say anything.
2004-07-07 10:26 pm
Anonymous
If your system was largely comprised of Java programs it would be usable. The memory cost would about weigh in ok if you simply load it all at startup. The trouble is though, that I have Java programs (Eclipse) that use 100-250MB of RAM. Anjuta uses about 70.
You can write decently fast code in any decent language, anad Java is no exception. And like any “interpretted” language it has advantages in certain areas. Now, since Java is a VM it’s may be a bit different from PERL and Python’s advantages.
I think a lot of people are annoyed that it has taken until Sun is almost crying chapter 13 before they make Java what it can be. The language is what, 9-10 years old now? Was c++ just becoming speedy in 1992 (I’m asking cause I would have been too young to care)? And let me qualify that, properly written c++; not obsessively over objectified c++.
Is the memory usage on 1.5 coming down to something more reasonable?
2004-07-07 10:53 pm
Anonymous
the lookup of a virtual method is not log(n). The index of the virtual method can be determined at compile time, so no matter how large your vtable is, there is constant lookup-speed, no matter how many vtables you´ve got (leaving all cache-issues aside, but if you´ve got vtables THAT large, you got different problems). Templates are extremely cool, i agree here, but sometimes you just need virtual methods and no matter how you tweak your templates, they won´t do.
In general, why don´t these benchmarks use what is called a “working set” of memory, and that is anything that stretches beyond registers and first-level chache. For any reasonable task, i expect C++ to beat the crap out of any pseudo-compiled language that are so “in” these days.
2004-07-07 11:15 pm
Anonymous
For any reasonable task, i expect C++ to beat the crap out of any pseudo-compiled language that are so “in” these days.
It is a well known fact that a programmer’s intuition about the performance of programs is extremely inaccurate. Truth is running code through a profiler, not uttering some unfounded expectations.
(And I say that as a C++ programmer)
2004-07-07 11:19 pm
Anonymous
You’re right that a virtual call is O(1). *However*, it’s also a call via a pointer, which is phenomenally slow. If there are less than a couple of hundred possible targets for a given virtual call, it’s actually faster to use a giant switch statement on the object’s type field. The switch statement will be O(log(n)) but it’ll be much easier for the processor’s branch predictor to deal with, since the target addresses will be constants. See my post on comp.lang.asm:
http://groups.google.com/groups?q=Rayiner+group:comp.lang.asm.*&hl=…
2004-07-07 11:33 pm
Anonymous
There is almost everything wrong with your ar
Concerning 1) and 2): Code that uses generics, especially if its as powerful as C++ can become pretty ambiguous syntactically. Thats why the STL requires you to specify types. The problems you describe don´t exist. A templated sort is very likely to be non-virtual and inlined and there is absoluetly no need for iterators to be virtual since the compiler knows at any time what type they are referencing.
3) Using the at() operator instead of the []-operator for vectors does not make you a good a programmer. I would never hire a programmer that uses the at()-operator where it isn´t necessary and to be honest, it almost never is. If you write loops where you´re not sure that your access index is correct, your loop (and code in general) sucks. Besides that, your standard-conforming compiler is not ALLOWED to eliminate the index check, because any user of at()-operator is entitled to her personally thrown out_of_range-exception.
4) This is not consistent. Of course in a heap-only language you would not only be not required to make that change, you would not be ABLE to do it, since there is ONLY ONE way of allocating objects . C++ gives you more, not less. If you don´t like it, don´t use it.
5) Your sample is flawed. No sane programmer would use such a class. Besides, a vector always stores objects of the same type. That might be ints, voids* or Objects*, or whatever. It can never store (at least not in my definition and that of 99% of other programmers) objects of a different size. If it “stores ojects, like base and derived” than it can only do so by storing POINTERS to these objects, and believe me, these pointers have the same size. There also must be some way of knowing the type of the pointed-to-memory, either you keep it, or the compiler/interpreter/language (this leads to ugly Object*-based hirarchies which fortunately are not necessary in C++). If neither you nor the compiler/interpreter stores it, you can´t use it. simple as that.
2004-07-07 11:46 pm
Anonymous
Yes, but this would only make sense if you actually had that many valid entries. The compiler is able to make these entries “non-spanning”. If it uses id´s up to 62thousand that would only make sense if there actually WERE 62thousand. The performance-results of your method vary largely with the ratio of entries. If you have 62k entries in both cases, index lookup will be faster.
What you´ve explained is btw, why sorted vectors of pairs are usually way faster than maps.
2004-07-08 12:16 am
Anonymous
Okay, so when is the Java’s libraries implementation going to be designed as a multi paradigm library?
2004-07-08 12:17 am
Anonymous
And I mean the library implentation itself!
2004-07-08 12:35 am
Anonymous
Concerning 1) and 2): Code that uses generics, especially if its as powerful as C++ can become pretty ambiguous syntactically.
Actually, it’s completely unambiguous. After template expansion, the compiler knows exactly functions will be referenced.
Thats why the STL requires you to specify types.
No, the STL requires you to specify types because C++ compilers don’t do type inference. Say I declare ‘v’ to be a vector of int. Now, when I do v.begin(), the compiler can, through an analysis of ‘v’, infer the precise type of the iterator. There is no reason for the programmer to get bogged down doing it himself.
there is absoluetly no need for iterators to be virtual since the compiler knows at any time what type they are referencing.
The whole point of iterators is that they allow you to not care about the type of the container. But template-based iterators completely defeat the purpose of that. Since each container has it’s own type of iterator, all code has to know about what it’s iterating over. This is not an ideal design, but was done because of performance concerns. However, type inference mitigates these performance concerns in the common cases.
Using the at() operator instead of the []-operator for vectors does not make you a good a programmer.
Yes it does. operator[] is an optimization. It should only be used when your profiler tells you that at() is causing too much overhead. As our great sage Knuth tells us: “Premature optimization is the root of all evil.”
I would never hire a programmer that uses the at()-operator where it isn´t necessary
For the last several weeks at work, I’ve been hunting memory corruption bugs caused by that sort of thinking. The input data set to our program grew by 100x, and all sorts of assumptions got broken. The bottom line is that when you don’t do a range check, you create an implicit dependency in the code. By doing range and parameter checks, you make those implicit dependencies explicit. Now, either you can make the programmer waste his time writing those checks, or you can let the language handle it for you. Obviously, it’s not a good idea to let the programmer do it, because it’s farking 2004 and buffer-overflow exploits are still common.
Besides that, your standard-conforming compiler is not ALLOWED to eliminate the index check, because any user of at()-operator is entitled to her personally thrown out_of_range-exception.
If you’d read up on how array-bounds-check elimination is done, you’d learn how they get around that. They break an array iteration into two phases, one that can be proven safe, and one that might not be safe. Thus, bounds-checks can be eliminated in the safe phase.
This is not consistent. Of course in a heap-only language you would not only be not required to make that change, you would not be ABLE to do it, since there is ONLY ONE way of allocating objects.
There is no point in distinguishing between heap allocation and stack allocation. The only argument for stack allocation is performance, and compiler analysis can eliminate the performance argument.
C++ gives you more, not less. If you don´t like it, don´t use it.
One good option is better than two bad options. I can stack allocate everything, but I have to rewrite the code if I want to return a reference to an object. I could heap allocate everything, but I’d pay the cost of the heap allocation even when it wasn’t necessary.
5) Your sample is flawed. No sane programmer would use such a class.
Um, the vector example is hardly unusual. Every generic data structure library in C uses essentially the same principle. Indeed, the Java containers work in the same manner. The problems with containers can be described at a more abstract level:
Containers can be heterogenous or homogeneous. Heterogeneous containers are more flexible (a homogeneous container is just a special case of a heterogenous container), but in the general case, pay a performance penalty. A homogeneous container is faster, but is generally much less flexible. In particular, polymorphism is much less convenient in homogeneous containers.
The STL attacks the problem by making all containers homogeneous. This preserves performance, but limits polymorphism. In particular, the only way to store different types in a container is via a pointer (which eliminates the automatic memory-use benefits of the STL), and even that only works when all the types have a common base class.
Other languages attack the problem by making the observation that for most cases where a container is used in a monomorphic way, the compiler can infer that fact, or be made to infer that fact, and an unboxed container can be used.
2004-07-08 12:47 am
Anonymous
If it uses id´s up to 62 thousand that would only make sense if there actually WERE 62 thousand.
IDs are assigned to each class. Each target of a virtual function is associated with one such ID. That means for each call, there is an arbitrary subset of the whole ID space associated with it. There could be 62 thousand classes, and thus 62 thousand IDs, but only a 100 arbitrary IDs associated with any given call. The switch would thus only contain those 100 IDs as it’s cases.
The performance-results of your method vary largely with the ratio of entries. If you have 62k entries in both cases, index lookup will be faster.
Undoubtedly. After all, the limit condition is log(n) vs O(1). However, it’s a pathalogical case for a virtual function to have 62k possible targets. In the general case, there will be at most a few dozen possible targets, and the switch method will be faster.
What you´ve explained is btw, why sorted vectors of pairs are usually way faster than maps.
Actually, it’s nothing alike. Sorted vectors of pairs are usually faster because of cache effects and the cost of chasing pointers. Switch statements are often faster because of branch-prediction effects.
2004-07-08 1:05 am
Anonymous
You all talk a lot of smack, but hardly any of you have any clue.
Hotspot’s server compiler is extremely sophisticated. One optimization that it performs is called “polymorphic inline caching” that was developed more than ten years ago for the Self and Smalltalk systems. The idea is that the callsite of a virtual method often has a high-temporal locality–it often calls the same method several times, then it may in fact call a different method. No static analysis with detect such runtime behavior; to static analysis, the callsite is not monomorphic, and will not be devirtualized or inlined. But with ploymorphic inline caching, the JIT will generate a guarded inline; it may choose to inline one of those possible receiver methods when it has determined that the invocation frequency at that call site is high enough. The guard is simply a one or two instruction test that determines whether the method that is currently inlined is the receiver. Each time that callpoint is reached, if the “hot” method is the one being invoked, the inlined version will be run, saving the cost of a virtual dispatch. If it is a “cold” method, a full virtual dispatch will occur. The great thing about PIC is that this “hot” method can vary for EACH callsite DYNAMICALLY. Thus, if you have 1,000 invocations of method M1 at site A, and then 1,000 invocations of M2 at site A, at first M! will be inlined at site A, and then, later, that code will be dynamically modified and M2 will be inlined in its place.
This ONE optimization is completely transparent to the program. The VM guarantees the correct semantics. And that ONE optimization TRIPLED the performance of my application.
Java made a believer out of me.
You guys have NO IDEA how advanced VM technology is these days. It’s astounding.
2004-07-08 3:07 am
Anonymous
#developer is defently faster than say NetBeans.
But it feels slower (on my machine) than Elispse.
2004-07-08 3:50 am
Anonymous
Hotspot’s server compiler is extremely sophisticated. One optimization that it performs is called “polymorphic inline caching” that was developed more than ten years ago for the Self and Smalltalk systems.
The basic technique is actually about 20 years old, polymorphic inline caching is an optimization to the basic inline caching model. And I’d hardly call it “extremely sophisticated.” The technique is pretty much SOP for Smalltalk environments that try to match the speed of native code.
No static analysis with detect such runtime behavior; to static analysis, the callsite is not monomorphic, and will not be devirtualized or inlined.
You’d be surprised how far you can go with a static analysis. See:
http://www.ai.mit.edu/people/jrb/Projects/pd.pdf
The only thing a static compiler cannot do is perform the actual inlining at runtime. This is only an issue if the target function is so small that the overhead of a regular function call (not a virtual function call, which is 10x more), overwhelms the cost of the target function. Even that might be largely mitigated if you take the logical step of inlining small targets into the generic functions dispatcher routine.
2004-07-08 4:15 am
Anonymous
There seemed to be no ambiguity regarding C – it won every time. It is C++ that is being called into question, and it is clear that many of the high level conveniences are…surprise! expensive
2004-07-08 5:08 am
Anonymous
Anybody who tries to compare modern C++ (templates, meta-programming, STL, boost etc) to Java is misinformed.
The two are different, and should be used for different things.
In all the C++/Java benchmarks I have read the C++ has been woeful, and the results often bulls**t.
To me arguing the relative speeds is moronic, is it the right tool for the job?, this is the important question.
Is it fast enough for what YOU want to do?, if so, who gives a s**t.
It’s been a tough day at the office.
2004-07-08 6:04 am
Anonymous
You’d be surprised how far you can go with a static analysis.
Well, it just so happens that that is exactly my area of research. There are a lot of interesting analysis techniques that have been developed in the past 10 years. The paper you referenced is from 1995: the state of the art has advanced significantly since then.
I am not some runtime optimization bigot, but Java VMs have come a long way, kicking the shit out of (IMO) statically compiled languages, especially C++. Most people have no idea how advanced VMs are these days.
But as many things, the pendulum has swung back the other way, and there are many more static optimizations available (pointer analysis, escape analysis, inferred regions, write barrier removal, etc) these days. Most have yet to see the light of day in a commercial compiler. I am more interested at the moment using static analysis for verification; i.e. software model checking. I’m actually working on a software model checker for machine code.
But anyway, I really believe that the Java platform is extremely viable; notwithstanding the hoggish implementations out there now. I’m disappointed with Sun (even having worked for them at one point) that they never took Java on the desktop seriously, but who can blame them, since they made their bread and butter on servers for so long.
My two cents
2004-07-08 6:06 am
Anonymous
http://tribaltrouble.com/
the FPS is pretty impressive
2004-07-08 9:11 am
Anonymous
Seems like you know what you’re talking about, unlike most people here 🙂 I see huge potential for dynamically optimizing VM’s. Do you know if there is a way to inspect the machine code generated by Hotspot? This would help a lot when writing performance critical code.
I would also like to add to all people that think Java is slow just because all methods are automatically virtual (unlike C++). A simple optimization in the JVM is that it if a method is not overloaded by any class it uses non-virtual method calls to that method, so it can easily be inlined. That’s why using the ‘final’ keyword doesn’t affect performance.
2004-07-08 9:29 am
Anonymous
>Concerning 1) and 2): Code that uses generics, especially if its as powerful as C++ can become pretty ambiguous
>syntactically.
>Actually, it’s completely unambiguous. After template expansion, the compiler knows exactly functions will be
>referenced.
Yes, of course it does. But it is sometimes impossible for the compiler to derive the correct type of the argument
from the source-level in the first place. One example might be:
Vector<int> i;
i.push_back( 5 );
vector<unsigned int> ui;
ui.push_back( 5 );
compared to:
vector q;
q.push_back( 5 ); // what is 5 ? signed, unsigned, long, char ?
You need to disambigue this, unless there is a catch-all-worry-free type system that does it for you, with all the complexity that comes with it.
>Thats why the STL requires you to specify types.
>No, the STL requires you to specify types because C++ compilers don’t do type inference. Say I declare ‘v’ to be a
>vector of int. Now, when I do v.begin(), the compiler can, through an analysis of ‘v’, infer the precise type of
>the iterator. There is no reason for the programmer to get bogged down doing it himself.
>there is absoluetly no need for iterators to be virtual since the compiler knows at any time what type they are
>referencing.
>The whole point of iterators is that they allow you to not care about the type of the container. But template-based
>iterators completely defeat the purpose of that. Since each container has it’s own type of iterator, all code has
>to know about what it’s iterating over. This is not an ideal design, but was done because of performance concerns.
>However, type inference mitigates these performance concerns in the common cases.
>Using the at() operator instead of the []-operator for vectors does not make you a good a programmer.
>Yes it does. operator[] is an optimization. It should only be used when your profiler tells you that at() is
>causing too much overhead. As our great sage Knuth tells us: “Premature optimization is the root of all evil.”
I knew you would say that. But operator[] is not an optimization. It´s an element used to create fundamentals.
compared to that at() is an application.
>I would never hire a programmer that uses the at()-operator where it isn´t necessary
>For the last several weeks at work, I’ve been hunting memory corruption bugs caused by that sort of thinking. The
>input data set to our program grew by 100x, and all sorts of assumptions got broken. The bottom line is that when
>you don’t do a range check, you create an implicit dependency in the code. By doing range and parameter checks, you
>make those implicit dependencies explicit. Now, either you can make the programmer waste his time writing those
>checks, or you can let the language handle it for you. Obviously, it’s not a good idea to let the programmer do it,
>because it’s farking 2004 and buffer-overflow exploits are still common.
I prefer to have my playfield levelled before i play. I don´t wear a helmet when i play table-tennis, just because
the floor might brake. Buffer-overflows are a sign of weak checking of input or incorrect handling of the working
set. Relying on the at()-operator somewhere in the middle of your code to catch it is hardly an efficient way of
preventing these kinds of attacks or finding broken assumptions. Both also come way more subtle than an unchecked []-operator. An index operator, be it at() or [] accessing an invalid element is a clear sign that the state of your program is in serious trouble. Don´t get me wrong, during debug-mode it is ok for [] to check. But in a release
version the difference is more philosophical than technical.
>Besides that, your standard-conforming compiler is not ALLOWED to eliminate the index check, because any user of
>at()-operator is entitled to her personally thrown out_of_range-exception.
>If you’d read up on how array-bounds-check elimination is done, you’d learn how they get around that. They break an
>array iteration into two phases, one that can be proven safe, and one that might not be safe. Thus, bounds-checks
>can be eliminated in the safe phase.
Yes, but this can only be done runtime. And it takes time.
>This is not consistent. Of course in a heap-only language you would not only be not required to make that change,
>you would not be ABLE to do it, since there is ONLY ONE way of allocating objects.
>There is no point in distinguishing between heap allocation and stack allocation. The only argument for stack
>allocation is performance, and compiler analysis can eliminate the performance argument.
>C++ gives you more, not less. If you don´t like it, don´t use it.
>One good option is better than two bad options. I can stack allocate everything, but I have to rewrite the code if
>I want to return a reference to an object. I could heap allocate everything, but I’d pay the cost of the heap
>allocation even when it wasn’t necessary.
But where is the one good option ?
>5) Your sample is flawed. No sane programmer would use such a class.
>Um, the vector example is hardly unusual. Every generic data structure library in C uses essentially the same
>principle. Indeed, the Java containers work in the same manner. The problems with containers can be described at a
>more abstract level:
I mean the self-written, type-unsafe vector-class that you indicate. std::vector<double> stores doubles, not voids*
or raw memory. There is no cast involved when accessing elements (of course, in the end all programming boils down
to raw memory at the right time at the right place).
>Containers can be heterogenous or homogeneous. Heterogeneous containers are more flexible (a homogeneous container
>is just a special case of a heterogenous container), but in the general case, pay a performance penalty. A
>homogeneous container is faster, but is generally much less flexible. In particular, polymorphism is much less
>convenient in homogeneous containers.
>The STL attacks the problem by making all containers homogeneous. This preserves performance, but limits
>polymorphism. In particular, the only way to store different types in a container is via a pointer (which
>eliminates the automatic memory-use benefits of the STL), and even that only works when all the types have a common
>base class.
I have never seen an implementation of a heterogeneous container (With that i mean a container that stores binary
objects of different size in contigous memory). In the end they all store pointers (or equivalents) to ojects. It
does not make a difference whether the common-base class/typeinfo feature is generated by the programmer (via a
type-information-base-object-mess) or is in the realm of the compiler/language (via type-information/retrieval only
visible to the compiler). Thats what i meant with my previous post: If neither you nor the compiler knows whats in there and there is no way of finding it out, it´s useless data. In theory a homogenous container is indeed a special
case of the heterogenous container, in practice the heterogenous container is usually a tweaked homogenous
container.
>Other languages attack the problem by making the observation that for most cases where a container is used in a
>monomorphic way, the compiler can infer that fact, or be made to infer that fact, and an unboxed container can be
>used.
Yes, but this means playing catch-up, not getting ahead. It is not really impressive that a particular language is
smart enough to store data in contigous memory without referring to “pointed” memory.
2004-07-08 12:58 pm
Anonymous
So, it looks like Sun is continuing to improve Java so that it writes better code then 99% of the C/C++ coders out there.
– It knows more about the code
– It ( -server switch ) profiles the code, which most C coders don’t.
So, if you’re a professional programmer that writes, tests and profiles your code C is still viable, this year. If your like the rest of us, Java is the solution. ;^)
2004-07-08 12:59 pm
Anonymous
“Do you know if there is a way to inspect the machine code generated by Hotspot? This would help a lot when writing performance critical code.”
Sure. You can run HotSpot through any good debugger and step through the machine code generated. When I was at Sun (briefly) we used Forte’s debugger (now renamed at least twice), and could trace the execution through the interpreter, native code of the VM, native code of the JDK, OS libraries, and JIT-ed code.
“I would also like to add to all people that think Java is slow just because all methods are automatically virtual (unlike C++). A simple optimization in the JVM is that it if a method is not overloaded by any class it uses non-virtual method calls to that method, so it can easily be inlined. That’s why using the ‘final’ keyword doesn’t affect performance.”
This optimization is called class hierarchy analysis (CHA). It is a particularly cheap way of getting most of the bang of type analysis without much of the cost. Most studies show that, averaged over 10 or 30 popular benchmark applications, 70%-80% of (the static count of) virtual call sites can be monomorphized with CHA. More powerful analyses generally only add 10%-20% more. The great thing about online CHA is that it is just that–online, meaning that JVM performs CHA only on the set of classes that happen to be loaded, so subclasses that override methods but are not loaded on a run do not worsen the results.
One thing about “final”. Final should almost NEVER be used for performance (i.e. to limit virtual dispatch). I have performed extensive performance tuning of very complicated applications (i.e. > 100kloc) and superfluous use of final as a method modifier is invariably slower (at least on HotSpot). The reason for this is murky; it may be that declaring a method final distorts the inlining heuristic in HotSpot, leading to poorer inling choices. I found a similar trend with using “javac -O” which does constant propagation and sometimes inlines private methods. Classes compiled with “javac -O” nearly always run slower than classes compiled with no optimization flags.
With things as they are now, I am very much of the opinion to compile Java code as it is, and let the VM kick the shit out of it at runtime.
2004-07-08 3:39 pm
Anonymous
[/i]You need to disambigue this, unless there is a catch-all-worry-free type system that does it for you, with all the complexity that comes with it.[/i]
Yes, there are times you need to disambiguate things. That’s why languages like ML have type declarations. However, you rarely need to insert type declarations in anything except function headers, because the compiler can infer from there.
In any case, you’re missing the point. I’m talking about heterogeneous containers, not homogeneous containers. Specifically, anything you can do with a homogeneous container, you can do with a heterogeneous container. But the reverse is not true. So why are STL containers homogeneous? Because of performance concerns. My point is that if C++ compilers did unboxing analysis, they could make all containers heterogeneous without unduely affecting performance.
I knew you would say that. But operator[] is not an optimization. It´s an element used to create fundamentals.
compared to that at() is an application.
*Using* operator[] as opposed to using at() is a premature optimization. Goto is the fundemental element upon which loops and function calls are built, but that doesn’t mean I should program using goto.
I prefer to have my playfield levelled before i play. I don´t wear a helmet when i play table-tennis, just because
the floor might brake.
But do you wear a helment when you go biking? Because in the real world, buffer overflows are more common than car/bike collisions.
Buffer-overflows are a sign of weak checking of input or incorrect handling of the working set.
Undoubtedly. But they are also something that can can turn a 10-second fix into a 10-hour fix, because they are silent errors. That means that the actual cause of the problem is often far removed from the code that actually causes the segfault.
Relying on the at()-operator somewhere in the middle of your code to catch it is hardly an efficient way of
preventing these kinds of attacks or finding broken assumptions.
If C++ compilers did range-checking elimination, it *would* be an efficient way of catching these kinds of errors.
Don´t get me wrong, during debug-mode it is ok for [] to check. But in a release version the difference is more philosophical than technical.
You tell the customer about philosophical differences. All he cares is that with mandatory range-checking, a buffer-overflow simply causes a program crash, instead of letting some hacker into his computer. Or, for those who have done contracting (I haven’t, but I’ve heard stories), when your boss tells you to go fix a bug in some important customer’s software, and not to come back until it’s done, you can bet that you’ll be glad that range-checking makes the bug easier to find. The performance hit of range-checking is so small in practice that it is not worth it to turn it off in a release build. Indeed, the heavy-weight mechanisms that have evolved to protect unsafe C code from each other (memory protection, kernel/user boundry) cost way more performance than just range-checking all pointers.
Yes, but this can only be done runtime. And it takes time.
No, it is done at compile time. This is not a new technique, everybody and their momma has used it in a compiler.
But where is the one good option ?
The one good option is to make everything conceptually heap allocated, and then let the compiler stack-allocate things that can be stack-allocated. It results in the same performance as what C++ programmers do now, but when the code changes and a value can no longer be heap allocated, it’s the compiler that does the rewrite, not the programmer.
I mean the self-written, type-unsafe vector-class that you indicate.
Like I said, type-unsafe vector classes are very common in C and Java code.
std::vector<double> stores doubles, not voids*
or raw memory.
The STL containers are not what I’m talking about. I want a heterogeneous container, not a homogeneous one.
I have never seen an implementation of a heterogeneous container (With that i mean a container that stores binary
objects of different size in contigous memory).
Vectors need not be stored in contiguous memory. The only distinguishing quality of a vector is that elements are accessed by limited-range integer index.
Yes, but this means playing catch-up, not getting ahead. It is not really impressive that a particular language is
smart enough to store data in contigous memory without referring to “pointed” memory.
You’re completely missing the point. You yourself said that “in theory, a homogeneous container is a special case of a heterogeneous container.” Unboxing analysis let’s the programmer write to an API that logically follows the theory, without twisting the API for performance reasons. When the container is used in a homogeneous manner, the compiler automatically uses that to apply the performance optimization.
2004-07-08 3:40 pm
Anonymous
I was disappointed to not see any mention of GCJ. My guess is that AOT compiled Java would be faster than the same code run under a JVM, but I’d love to see numbers for programs larger than the smallish ones I’ve been working on.
Why would the article’s authors leave out something so obvious? Do they work for Sun or something?
2004-07-08 3:50 pm
Anonymous
The typical malloc()/free() approach to memory management is an awful one from a performance standpoint. The first blow comes from the fact that all threads making malloc()/free() calls hit a global mutex for each call, which can take, on certain architectures, up to 400 machine instructions.
One of the powerful concepts of C++ is that you can easily replace the heap implementation. There are several alternate heap implementations which target specific computing requirements. Once such heap implementation, which I can’t seem to remember the name, targets multithreaded heap heavily applications. It addresses heap conflict by maintaining multiple heaps plus a common heap. Performance for some applications can be an order of magnitude better.
This is one aspect where java benchmarks constantly attempt to hide the truth. Often, java benchmarks go out of their way to hide collection and then force a benchmark, whereby, the C++ compliment heavily pays. These types benchmarks do an excellent job of showing how poorly educated the java audience is and how little they understand what’s going on under the covers.
Another aspect of java is its huge memory footprint. It’s not hard to create applications which will simply exhaust available vm memory, whereby, a C or C++ application will happily chug along with a shadow of the footprint that the jvm exhuasted.
Many java guys point at x benchmark and say, “see, java is faster”. What they fail to realize is, with C++, it can almost always be made way faster. With Java, you’re stuck. So, the question becomes, will you allow me to fix it?
In the last round of C++ vs java benchmarks, I took about two hours and made all of the c++ benchmarks faster than the java benchmarks. Generally speaking, if C++ is slower than java, it’s because someone horribly screwed up rather than because java is fast.
2004-07-08 6:02 pm
Anonymous
[i]
Vectors need not be stored in contiguous memory. The only distinguishing quality of a vector is that elements are accessed by limited-range integer index.
[i]
Rayiner, you make generally very knowledgeable posts, but this quoute must somehow have slipped through. A vector does not only mean index access.
A vector is a container with certain performance characteristics. One of them is constant time element access. If you have a container where the size of elements varies, you can only meet that specific performance requirement if you take measures such as storing the base address of all elements somewhere contiguously. So you will somehow end up with something which uses pointers of some sort most certainly.
2004-07-08 7:03 pm
Anonymous
A vector is a container with certain performance characteristics. One of them is constant time element access. If you have a container where the size of elements varies, you can only meet that specific performance requirement if you take measures such as storing the base address of all elements somewhere contiguously. So you will somehow end up with something which uses pointers of some sort most certainly.
Sure. That’s generally how it’s done. Heterogeneous vectors are always implemented as a chunk of pointers to variable-sized pieces of memory. However, the person I was responding to claimed that a vector’s data must be stored contiguously, which is not true. There is the method of using pointers to the data, and also methods that have multi-level vectors (like a page table), so sparse vectors don’t take up so much memory. In all of these cases, the vector’s data is not contiguously in memory.
2004-07-08 7:32 pm
Anonymous
but Java VMs have come a long way, kicking the shit out of (IMO) statically compiled languages, especially C++.
C++ is hardly a good example of a compiled language amenable to powerful optimizers. I’d rather compare the JVM’s to a good Lisp, Dylan, or Smalltalk implementation. The current batch of Java compilers don’t do a lot of nifty optimizations that some static compilers do on a regular basis.
Some analysis techniques that Stalin uses:
– Object lifetime analysis
– Reachability and escape analysis
– Points-to analysis and escape analysis
– Type inference (not completely applicable in Java)
As a result of these methods, Stalin can do certain optimizations not found in mainstream Java compilers:
– Region-based allocation of values with known lifetimes
– Stack allocation of non-escaping values
– Elimination of generic dispatch
– Closure conversion (applicable to Java in the case of inner classes).
– Inlining of calls to function-valued objects (not applicable to Java, but probably applicable to C#’s delegates)
– Unboxing of objects with known types (eliminates the need for Java’s primitive integers/floats and C#’s structs).
Most other compilers for Lisp-like languages don’t have quite such a laundry-list of optimizations, but they usually have several of the above. So I think it’s still safe to say that static optimizers are still comfortably ahead of the dynamic optimizers in mainstream Java VMs.
2004-07-08 8:01 pm
Anonymous
Actually I know Java Gurus who can tweak Java apps to run as fast if not faster than their C/C++ counterparts, because they understand the VM like the back of their hand. But in reality, benchmarks only take you so far. Benchmarks are theoretical and analytical tools for scientific analysis that have little significance is everyday practical usage of common applications.
The issue however is, many of the existing Java applications aren’t written by gurus or hackers, but rather by corporate entities, who want you to buy more RAM and the latest Gigahits is computer processing power because feature X which I don’t need will be incorporated in the next version of their app.
Or by wannabe hackers who write code but have no clue what the feedback from the debugger or profile is saying. The issue isn’t really whether or not Java is slow. Because for many general purpose tasks, it isn’t.
The issue is that I need terabytes of RAM and sometimes processing power to run decent Java applications. The issue is that SUN needs to invest a lots brainpower into making swing and other java-gui component seem fast, at least to the end-user.
Until SUN reduces system resource consumption and apply every trick in the book to make end user think swing if “fast”, either by hook or crook, Java isn’t desktop ready. Sure, write your java applications that will run of servers powered by terahertz processor and terabytes of RAM, but many end users don’t 512MB of RAM and 3GHz of processing power to read email, surf the net, listen to music, watch dvds, or even partake in leisurely coding.
It’s not about speed SUN, it’s about perception and efficient system resource management. When people complain about the misdeeds of garbage collectors, it’s not necessarily about the perfomance hits when removing objects.
No. It’s that fact that many garbage collectors aren’t activated/called until the application runs out of memory or needs more memory. So you have these piles of needless object clogging your memory which might never get deleted if your java app isn’t starved for RAM. Hence, one of the reason java apps are humongous, need teras of RAM and should be run prefferably alone!
When your whole system starts thrashing your hard disk( using swap) because you have one Java application running (eclispe), I think there’s a problem.
I also need to emphasize that there are not as many good Java coders as there C/C++ coders. For many Java developers it is satisfactory is their application runs, regardless of whether I need a 15GHz processor to run the bloated crufty app. In the wishful minds many of the Java developers, SUN’s next VM will make it run smaller and faster.
As a result many Java apps remain unprofiled and unoptimized both for size and speed. Afterall the touted advantage of Java isn’t optimization and profiling, but rather fast development time.
2004-07-08 9:43 pm
Anonymous
What about Groovy?
The Java scripting “language”…. what do (all) you think abou it?
2004-07-09 3:39 am
Anonymous
Greetings again John,
I’m writing an application that does “Optimal Image Subtraction” on several megapix astronomical images – hoping to find planets with gravitational microlen sing – this image subtraction compenstates for atmospheric effects and essentially involves doing autocorrelation and matrix inversion (straightforward numerical stuff). When I tried Sun Java 1.5 beta 2 on Linux vs. gcj 3.3.2 the JVM ran 1 second faster (55 seconds for the JVM vs. 56 seconds for gcj). I was *very* surprised by this. Several months earlier I had tried g++, gcj, and Sun 1.4.0 and found that ratio to be 1:2:3, which is much more like what I expected. I had switched my implementation from g++ as I got sick of the g++ headers changing on every minor release (particularly the continual switches between ios and ios_base for the iostream flags) when I compile with -Wall -ansi -pedantic. With Java I don’t have that problem. However, I can completely hanging out for the gcj libraries to get done, because I think being able to provide users with a single binary is easier for them than making them download (and set the classpath for) a JRE.
So, surprisingly, gcj doesn’t beat JDK 1.5 for my application.