“During the 4th Semester of my studies I wrote a small 3d spaceship deathmatch shooter with the D-Programming language. It was created within 3 Months time and allows multiple players to play deathmatch over local area network. All of the code was written with a garbage collector in mind and made wide usage of the D standard library phobos. After the project was finished I noticed how much time is spend every frame for garbage collection, so I decided to create a version of the game which does not use a GC, to improve performance.”
Seems like a smart student, but it can not be said often enough, since new programmers are coming out of out of school all the time who haven’t learned it yet: Garbage collecting is for interpreted languages and short scripts, not for anything else.
There is no reason to try to argue with this, or waste your time proving it over and over again. We have all been there and have had to learn and accept it, and it is not something better garbage collectors are changing. The only thing changing is how big programs you can build these days with interpreted or semi-interpreted languages.
Edited 2012-09-06 22:06 UTC
Thats not entirely true though, there are some benefits to limited garbage collection for some data structures and certain language features.
I would certainly want one when implementing closures in a language feks.
Well, I’ll argue.
– Garbage collection is not directly related or “reserved” to interpreted languages and, of course, with JIT compilation…
– All sorts of operating system structures are doing special forms of garbage collection.
– There are all sorts of applications (which are often not games) where doing manual, precise memory deallocation is not practical. The only question is to build an ad-hoc garbage collector or use the general-purpose collector from your favorite langage.
– There are some occasions where GC can be faster than manual memory management, because of memory fragmentation, cache thrashing…
Why would you build your own GC instead of improving your allocation mechanisms in manually managed memory? I consider the latter far simpler.
A combination of the two is usually needed to have performance. For example, when you have a lot of similar objects you may want to pre-allocate a memory pool, then the allocation would be fairly simple. De-allocation may not be needed, or if it is, it is done by simply setting a flag in a bitmap.
Garbage collection IS a smart thing to do, as long as you know what happens, and it doesn’t pop unexpectedly.
Bullshit. GCs are completely applicable in applications that are I/O bound or CPU bound; essentially GCs are great in any application where the GC drawbacks are acceptable.
Most people don’t write operating systems or drivers.
I’m sure you realize that your native code is running on an operating system which can block on memory allocations and unpredictably preempts your code to run other tasks and interrupt handlers. You’re not running on bare metal. You have a runtime as well.
But more importantly, if you take all the time you would spend working on memory management and instead spend it working on your data structures and algorithms, then you’re likely to end up with faster code. The high-level optimizations are more impactful than the low-level optimizations.
This guy seems to have done both, replacing generic garbage-collected data structures from a standard library with purpose-built data structures and manual memory management. Which accounts for a larger share of the the benchmark delta: the memory management or the data structures?
CPU-bound tasks are few and far between these days, and many of these tasks perform much better on GPUs via runtime interpreters in the device drivers. Most applications are IO-bound — and also bound by the complexity of development and maintenance.
Interpreted languages are going to win in the long-run, because they are better positioned to help address the elephant in the room, which is the difficulty of writing code that scales on thread-parallel hardware.
Improvements in the Hotspot JVM, for example, extracted thread-level parallelism from existing code, and Google designed Go with garbage collection in large part because it greatly simplifies its concurrency primitives: goroutines and channels.
Also it should be noted that while garbage collection may increase the cost of freeing memory, it can significantly reduce the cost of allocating memory. The garbage collector runs asynchronously, so trading for fast synchronous allocation is a good deal.
Most modern garbage collectors implement an algorithm in which reachable objects in a full memory region are copied to a free memory region in depth-first traversal order. The remaining objects are unreachable, so the whole region is freed, and the new region is populated with excellent spatial locality of reference.
Because the memory allocator understands the garbage collector, finding free memory is cheap and easy. Heap implementations like malloc have to do the hard work of scanning implicit free lists on allocate, where garbage collected heaps do the hard work in the background.
Well, for ephemeral objects, then generational GC will speed up both allocation *and* deallocation. If it dies during its first collection, then there is zero overhead from the garbage collector.
It’s actually the objects that survive a collection that incur the GC overhead, as they must be copied into the older generation.
There’s a puzzling statement in the blog post that suggests the student may not have fully internalized this very counterintuitive fact:
But with a generational GC, it should not leak memory, and it should also not perform that badly. The temp strings should be ephemeral, and should die with no overhead. In other words, this is precisely the scenario that should perform faster with a GC than without a GC.
No wonder he ran into performance problems (and memory leaks) when he took out the GC!
Right, since high performance and non-leaking code can only be achieved by using a GC. I’d probably need a run to Rekall for that sentence not to make my head explode.
If so, then you’re making your own head explode. Because I never wrote that.
I pointed out that the student was complaining about “precisely the scenario” that performs better under GC than without GC, and then said that “no wonder” he ran into problems when he took out the GC.
In other words, you will write your code differently if you have a GC, compared to if you do not have a GC. If you have a GC, you do not worry too much about short-lived objects, but worry a great deal about medium-lived objects. If you do not have a GC, then you do not worry too much about medium-lived objects, but must take care to avoid excessive overhead from constructing and destructing short-lived objects.
Since the student directly ported code that assumes a GC, it is not surprising that he ran into problems when he took out the GC. Indeed, by tailoring the code to the non-GC platform, he was able to eliminate the performance penalty.
It’s puzzling how you make the leap from that to the straw man that “high performance and non-leaking code can only be achieved by using a GC.”
Edited 2012-09-08 17:38 UTC
tanzam75,
“If you have a GC, you do not worry too much about short-lived objects, but worry a great deal about medium-lived objects. If you do not have a GC, then you do not worry too much about medium-lived objects, but must take care to avoid excessive overhead from constructing and destructing short-lived objects.”
I still don’t think it’s safe to assume these generalisations are accurate without benchmarking them in an application.
At the very least you need to qualify the quoted statement for objects which are on the heap, since objects which are on the stack won’t suffer any heap allocation overhead. In a GC language, it can be difficult for the compiler to determine whether an object can be stored on the stack or heap since it doesn’t know what other functions will do with pointers. In a non-GC language, it’s possible to allocate objects on the stack and let the object destroy itself as it goes out of scope, which is free.
I’m not trying to promote one side over the other, but the issue is much more complex than can be surmised by generalisations. Programmers just have to benchmark their own code.
Of course, benchmarking is always the first step in attacking performance problems. Every problem is unique.
That does not mean that we can’t give rules of thumb. They may not always be true in every scenario, but they give you a starting point. They tell you what to pay attention to during the benchmarks, and what tweaks to try.
Benchmark, benchmark, and when you’ve established that there isn’t any possible bottleneck other than memory allocation, then it still isn’t the memory allocation system.
Some time ago, I designed and implemented an abstraction layer for an embedded OS and when the app it was serving was reported as being too slow by the customers my system allocator wrapper was singled out as the problem by the “profiler”.
This was in a world of custom prematurely optimized everything. So my decision to keep it simple was met with little sympathy.
So, I rewrote it with a custom memory management algorithm to the point allocation and release were barely slower on average than a void (library) function call.
Guess what? It still was the bottleneck.
Some good old printf profiling revealed that the OOP client application was hitting it like crazy.
Classic allocation, Reference Counting, Garbage Collection, it wouldn’t have made a difference.
The problem I see in all of GC, RAII(for system memory) or ARC is that they encourage the everything is an object mentality. And that mentality generally turns into programs that hit the allocator way too much.
sakeniwefu,
“Benchmark, benchmark, and when you’ve established that there isn’t any possible bottleneck other than memory allocation, then it still isn’t the memory allocation system.”
Well, GCC’s allocator is not very good especially on SMP systems where it can be a bottleneck. So it might make sense to tinker with new allocators but I wouldn’t want to imply that allocation is typically a bottleneck in most applications.
“So, I rewrote it with a custom memory management algorithm to the point allocation and release were barely slower on average than a void (library) function call.”
I’m curious about the implementation, I’ve written my own too where objects were pooled in thread local storage and therefore incurred no overhead on SMP until the global allocator was hit to replenish the pools.
“Guess what? It still was the bottleneck. Some good old printf profiling revealed that the OOP client application was hitting it like crazy.”
This should have been uncovered much earlier. Didn’t the profiler give function counts?
“The problem I see in all of GC, RAII(for system memory) or ARC is that they encourage the everything is an object mentality. And that mentality generally turns into programs that hit the allocator way too much.”
Absolutely, one of the easiest ways to improve low level performance is by reducing the number of objects we need to accomplish our tasks. If a process always allocates A+B+C, then those could be combined into one monolith object instead, reducing object overhead by a factor of 3 for both GC or non-GC languages. However many people are going to be on edge about where to draw the line between OOP purity and overhead. Even though ideally one might want to manipulate pixels as objects, the OOP paradigm just breaks down due to the inefficiency of having so many numerous objects. Some day we could have a language with virtual objects which don’t exist in memory, but allow us to manipulate the pixels as if they were real objects.
You’re right!
That is why all major programming languages besides C, are now offering some form of automatic memory management, be it GC or reference counted based.
Garbage Collection is a balance between convenience and performance. The non determinalistic nature of collections means that without clever engineering, you can a hit a GC pause during something crucial like rendering.
I think the author, while a fascinating article, threw the baby out with the bathwater. You couldve likely kept the GC and reduced collections by being more conscious to it, and employing various tactics like pooling to reduce collections or simply cleaning up hisoobject heirachies. You can go a long way before you abandon allocation safety.
Of course, I don’t know much about D’s GC or if its even generational, but from a .NET POV, I’ve been able to reduce its impact while writing XNA games for Windows Phone (where the GC pauses can be even more devastating)
It’s an interesting read, but it’s a read about performance of D’s GC mechanisms and developing without garbage. Not like it’s a good comparison of GC vs non-GC. in all scenarios.
You can get to a real time system with a GC’d platform, as long as you know how what is the longest part of the GC process and minimizing garbage in that area.
Although is nice to see the numbers, ain’t nothing most of us didn’t know.
Edited 2012-09-06 22:28 UTC
Well, the author don’t tell us how much additional development time the manual memory management had taken. So I will take into account only performance:
Best GC Version: 128.6 FPS, 7.72 ms frametime
Manual Version: 142.8 FPS, 7.02 ms frametime
This a 11% performance improvement for an unknown amount of additional work.
And here We’re taking about a worst case scenario for a GC: a game is 100% interactive and real time application. Even so We only lost 11% of performance.
In other words: the max price you have to pay for using a GC is 11% performance loss.
How much performance improvement would We had in a traditional online non-real-time system? 2%? How many memory leaks We “gain”? How many additional programmer hours would We have to pay?
Hey, I’m think this study is the perfect apology of GC use. We have to use GC all the time even with games!
I do not know D, but C++ provides a lot of tools to deal with memory that make its handling easier (the simplest: RAII; the nicest: unique_ptr, shared_ptr, weak_ptr; the most complex but most powerful: placement new).
Using RAII and smart pointers is almost as easier as writing anything in a GCed language; so I do not see why writing the version that manages memory manually could be hard.
Easier or not, the point is: memory management will be in the hands of the programmer. Why should we want that? 10% performance increase?
I prefer outsource memory management to a bot and use the expensive programmer’s time to do something else.
That comment is almost totally accurate. I say ALMOST, because:
1. The programmer still must handle the memory at some extent in managed languages: he needs to make sure that his unused objects are not being referenced for any reference, he needs to use weak references or he needs to call explicitly to the garbage collector [I remember I had a problem with a behavior similar to a memory leak in C# that occurred because my big array was stored as a 2nd generation object, so the garbage collector did not clean it in a timely fashion; or remember having some “loitering objects” in Java because I forgot to remove some listeners after my frames were closed].
2. Though memory is handled automatically by the garbage collector, other resources are not, so, they still need to be managed by the programmer (file handles, registry handles, database connections, window handles, locks, etc. etc.). This inconsistency makes me crazy because you need to call .Dispose() explicitly or need to enclose your code in a “using” statement to free system resources (but not memory). Handling all resources in a good way has the same complexity that handling memory. Using best practices in C++ avoids this problem because any system resource (including memory) is handled automatically and in a deterministic way using the RAII pattern.
3. Real good programmers will be productive no matter the language they will code on.
Edited 2012-09-07 03:55 UTC
This is generally a lifetime issue and not really a memory leak, though it can look and quack like one. Especially with event handlers, once the event source itself goes out of scope, any listeners with zero references thereafter will be collected.
So sometimes not unregisering event handlers manifests itself as a memory leak, or sometimes it just lives on a little longer than usual. Makes it quite annoying to track down at times.
If you neglect to dispose of an object, eventually the finalizer is called and the resources are freed. But “using” makes this mostly a non issue.
Completely untrue.
All in all, I find dealing with the nuances if even modern c++ to be more if an annoyance than an empowerment.
Try C++11 – with the recent developments I was able to produce a benchmark for an embedded platform, and the development time was equal to the same code being written in C#.
And I’m talking about lambdas and lots of other goodies. Of course, I wasn’t doing any Linq but who in their sane mind would use Linq for testing something or for production code? (rethorical, don’t answer).
No, we’re talking here about a competent that was able to think about the time the garbage collection eats from his run times.
The programmer that takes garbage collection for granted usually gives much worse times. Think about 100% or 200%. Why? Because someone that doesn’t keep in mind the metal under the framework (s)he uses will never be able to understand what a good trade-off is.
Well, outsourcing will not make you stand out for long. And the quality of a programmer lowers in time as he is used to do it the easy way. In the end, there’s a reverse effect of this ‘doing the easy way’ – in which with the tools the programmer knows (because the lower level will be inaccessible) he will have to do fine-grained work. It’s like picking needles with a boxing glove; the framework being a boxing glove, it looks nice and shiny.
But keep in mind. 80% of any project is doing the last 20%. And the last 20% IS picking needles. Or you can deliver 80% of your product, and hope that you’ll fool your client next time too. That’s what most businesses do today in the western world, so I guess you may be Ok.
Way to miss the point.
The whole point is and he is right, is that extra performance really worth the extra development time?
If it isn’t then it isn’t worth it, it is a trade off that should be considered part of the development process.
The other ridiculous argument about someone losing some skills because they did stuff the easy way is ridiculous.
Most of the code I work with is pretty much WTF, because somebody wanted to do it the “clever” way. The lost productivity due to this is massive when making minor modifications.
There’s always the question if the extra performance is really worth the time. That’s for everyone to decide, their business, not mine. But sometimes it’s too late to optimize.
A lot of people in software development forget about the psychological impact of the work that developers do. You might think that there’s no impact, but please, think about it one more time.
Give a senior developer menial tasks, and you shall have a mediocre experienced developer. Remember, a developer is as good as the job he does, not as the time he invested into the job.
No, I’m not talking about ‘the clever way’ to hell. I’m talking about skipping some of the essentials. People that never done a delete nor ever thought about memory consumption will have a hard time when such limits comes into place. I recently had a wtf moment, when someone was caching an entire table in memory, then copying the cache as a backup. His verdict was “sometimes it crashes, don’t know why”. There’s a ‘too stupid’ too, not only ‘too clever’.
The best programmers are lazy programmers(the essence of DRY principle). Outsourcing memory management has always been the trend. First it was manual, then bare metal, then OS’es managed it and now runtimes manage it.
And needles today are mostly in the form of business requirements, not getting that extra 1ms. But then again, you can actually squeeze those 1ms out of almost any system.
Oh yeah. There have been occasions when I could’ve killed for another 10%. Also, in such occasions it’s the code’s coder who knows best how to squeeze out that additional 10%, not an outsourced code-monkey.
Well, can be nice for that expensive programmer, being able to say he’s not responsible for memory management errors and leaks But your bot might be your doom.
Except in most projects better hardware will be cheaper than one month salary for a top programmer.
So sadly in the corporate world, no one would care for those 10% gain thanks to better coders, and would rather invest in better hardware.
Except in most cases today the program you write starts out at using 2% of total CPU and then you spend a month getting to 1.8% …
Performance outside of Databases and Games and sometimes embedded systems and web server is mostly not relevant today.
You might be amazed, but performance matters. On the mobile phones, where it saves batter, on the server, where it’s not always easy to add extra RAM or processing power. That 10% can be vital.
Performance matters, all the time.
On mobiles, Objective-C is moving to ARC (reference counting) and WP8 is having C++ with the CX extensions which make use of reference counting, alongside the usual .NET runtime.
On the Android, while you can make use of the NDK, actually the whole userspace is Dalvik VM based, even when calling the APIs from C or C++. So the only way to avoid the GC is by using the NDK only APIs, which restrict you to games which don’t interact with another applications.
Yes, but even with this in mind, performance matters. I wasn’t talking about GC anymore; GC can be a performance improvement for a lot of use cases, including most UI bound mobile software
Ah ok, there I agree with you.
One should always take care to use the more adequate set of algorithms and data structures for the problem at hand, while taking advantage of the language strengths.
Well, that’s an optimizing compiler with GC, vs. a poorly-optimizing compiler without GC. He couldn’t get the non-GC version to compile on the optimizing compiler. The argument is that if he’d gotten it to compile, then it’d show a similar performance difference
Of course, if it doesn’t compile, then it doesn’t make his game perform any faster, does it?
Note that his figures are useless because I suspect that they are averages, for real time responsiveness what you need is worst case execution time not averages.
I like C++[11] (and C99 onwards) in that it makes it more easier never to have to dynamically allocate memory in the first place.
Even in Java, objects with local lifetime tend to be optimized by stack allocation rather than dynamic allocation anyway.
I was wondering about that recently : don’t you run into stack overflow problems when you try to allocate everything on the stack including large arrays ? I know that stacks can be dynamically grown by some OSs, but surely at some point the top of that huge stack must hit something in the process’ virtual address space right ?
Now, if the language is smart enough to dynamically allocate large objects under the hood and still manage them like stack objects (i.e. silently liberate them when leaving the scope of the mother function), that’s a different story.
Edited 2012-09-07 10:15 UTC
As I understand it, the theory and data suggests that most objects have a very short lifetime – in fact the life time of a function – by which time the stack space would be reclaimed by the function exiting.
C99 introduced variable length arrays, but the C++ equivalent (vector and bounded arrays) often does dynamic allocation under the hood. So a vector on a stack is quite small – probably just a little more than a pointer and a length. The vector object cleans up the dynamic array underneath when the stack unwinds, either through a normal function exit or an exception being thrown (either directly, or by the environment).
C++ library containers also allows you to supply your own allocator so you can customize a vector to only allocate storage from a limited pool. That also provides another advantage in the form of memory allocation speedups.
This is what RAII basically is, except it’s not the language itself, but the language allowing libraries to be written that behaves in the manner you’re talking about.
Well, variable length arrays are indeed a new feature to C99, but RAII in library form is older than C++11 so I don’t see what the new release brings there.
Edited 2012-09-07 11:49 UTC
I did put the “11” in brackets. In some forms of BNF, it means optional match.
Benchmarks are very tricky things. Details can make all the difference.
Thus, I would not necessarily take the student’s results at face value, without investigating further. I don’t know D, so I can only throw out some thoughts that come to mind from reading the blog post. Perhaps someone else could look through Github and check out the code?
First, doesn’t D have a generational garbage collector? (http://dlang.org/garbage.html doesn’t say — but it does approve of generational collectors.) The student says that he’s collecting on every frame. Is he doing an ephemeral collection, or a full collection? 7 milliseconds seems kind of long for an ephemeral collection …
Second, what if he collected every 5 frames instead? He’s collecting on every frame because he’s paranoid about making the 60 Hz cadence. But he made it — so he now has some room to experiment. He gets 7 ms of overhead by collecting on every frame, but what if he gets 8 ms of overhead by collecting every N frames? This would still make the cadence — but save a lot of battery! (Suppose he frequently creates objects that are referenced in the next frame — but that almost always die within N frames. Then he’s incurring a lot of unnecessary overhead by forcing the collection on every frame.)
Third, he just dove right in and completely removed GC. But did he really have to do that? Profiling-driven optimization tends to be amenable to the Pareto principle. 10% of the work might get you 90% of the benefit. Sometimes, tweaking 1 line of code gives you 90% of the benefit. The poor performance of the garbage collector might’ve been caused by just a few classes.
Edited 2012-09-07 06:08 UTC
So, GC is slow when doing complex stuff with lots of memory handling and threads. In other news, the sky is blue. Unfortunately I see this all the time these days when we have new students or fresh graduates coming to work for us. Real progress. Now, I don’t dislike GC and the associated line of thinking, design and coding practices. But man, when they only know that and only use that, my hair’s grayness transfer function just gets steeper.
When one programs in a language which uses GC, then one must keep it in mind that the lessa garbage you create the less garbage has to be collected.
For example Java is one of the most widly used GC language and EVERY java programmer must read the book Effective Java. That teaches you to see the problems that can introduce huge performance impact. Typical thing is the strings and other temporary vairables greated behind the scenes abnd thrown away just a millisecond later. If you would elliminate such things with constants your program wouldalready get 25% performance boost.
So Manual memory management is not a good answer for GC issues. It is the attitude of the programmer that makes problems.
Agreed. Also targeting the points where most of your garbage is created, you can try to re-use existing data structures instead of re-allocating them on each call. That’s a form of manual memory management that is used in GC languages like Java.
Other thing a GC language programmer must be very familiar with is the GC mechanism. In Java ther are several to choose from and you really need to understand how they work in order to choose the approperiate. Even more, you need to understand how to tune your GC settings in order to get the required performance out of it.
One example.
Java 6 HotSpot server will default to Parallel GC. That is every once in a while the program freezes and memory is collected. Not god for webservers. Better is Concurrent-Mark-Sweep (CMS) GC. There are few tricks however. The default settings for CMS GC is quite poor. Basically your heap is devided into to regions Eden space and Old Gen. All memory allocations are initially made in Eden space which is CG-d very often. The objects surviving the GC will be moved to Old Gen. Looks nice. The very small Eden space is very cheap to collect. The huge Old Gen is collected at much much lower interval (up to once an hour) because that is an expensive operation.
The problem is that if the requested memory allocatio is bigger than fits into the Eden space, then the allocation is made directly in Old Gen which will fill up FAST needing very frequent Old Gen cleanup which is expensive as I told. That will make application ru even slower than with default GC.
The solution – increase the Eden space to reasonable size so that ALL allocations will happen there and elliminate the direct Old Gen allocations. Other hand try to keep the Eden space as small as possible because it is cleaned often and the smaller it is, the cheaper the cleanup.
One example from real life. I have Tomcat server that has 10GB RAM allocated to heap. Initially I enabled the CMS GC and it wasnt very fast. Discovered that 75% of alocations went directly to Old Gen which was sweeped very often. The default Eden Space size was 32MB. I initially increased it to 64MB, not much better. Then 128MB but still some allocations to Old Gen and the Eden Space was also collected too often – CPU load was too high. I increased the Eden space to 256MB and now I got the sweetspot. All allocations (except very eextreme Hibernate searches) are allocated in Eden Space, the collection happens around 50-75 times an hour and the Old Gen is very slowly increasing. The JVM analyses the situation int the Old Gen and actually analyses the rate of increase so even as I have 10GB available, the GC will kick in already around 2GB max 3GB level. On busy days, the Old Gen is collected 3 times an hour and on quieter days it is collected once an hour. Performance and app responsiveness is excellent.
Same app in the same server but with default Parallel GC had poor performance and was getting OutOfMemory errors with 6GB Heap. Now I don’t use more than 3GB Heap and have excellent speed.
So you need to understand the GC algorithms and options to make the right choice and tune it correctly.
This is the whitepaper where I got all my knowledge – pure gold:
http://java.sun.com/j2se/reference/whitepapers/memorymanagement_whi…
Edited 2012-09-07 15:16 UTC
Everyone please! There is no ONE solution that will always “win” under all conditions.
There’s nothing wrong with going with GC and there’s nothing wrong with managing memory oneself. I’ve been allocating my own objects for over a decade and it’s honestly not been much trouble, it’s second nature. I can see how inexperienced programmers often get in trouble with memory, but lets not exaggerate that all programmers are bad with managing objects. I can also see why some experienced programmers prefer GC and it’s not a case of laziness. They’re both valid options. Even the performance will vary on the specifics of the application, the benchmarks here will probably not apply directly anywhere else.
The best thing a programmer can do is to use the language that they are most comfortable with and best fits the task – don’t worry about what anyone else says about GC.