The Hoard Scalable Memory Allocator 3.6.2 Released

Thom Holwerda 2007-05-25 General Development 16 Comments

“Hoard is a scalable memory allocator (malloc replacement) for multithreaded applications. Hoard can dramatically improve your application’s performance on multiprocessor machines. No changes to your source are necessary; just link it in. Hoard scales linearly up to at least 14 processors. The supported platforms include Linux, Solaris, and Windows NT/2000/XP.”

About The Author

Thom Holwerda

Follow me on Mastodon @thomholwerda@exquisite.social

16 Comments

2007-05-26 5:42 am
gerryxiao
i have downloaded and build it on my box, and setup LD_PRELOAD envrion variable, but how did i know that i’m using libhoard instead of standard ones?

2007-05-26 7:30 am
Ponto
1. Use man ld.so to get some information whether hoard is loaded at all.
2. Use a debugger and trace a malloc. You should at least see a hoard method in the traceback.
2007-05-26 2:46 pm
Elektronkind
On Solaris, just run: pldd <pid>
Where <pid> is the PID of the process you want to make sure is runing with the Horde library. pldd is just like ldd, but is for use against running processes rather than just binaries.
I don’t know if Linux or other OSes provide a tool for doing this.
Edited 2007-05-26 14:46

2007-05-26 3:59 pm
Doc Pain
“I don’t know if Linux or other OSes provide a tool for doing this. “
Besides Solaris, I’ve seen the pldd command only on HP-UX.

2007-05-26 3:01 pm
big_gie
i have downloaded and build it on my box, and setup LD_PRELOAD envrion variable, but how did i know that i’m using libhoard instead of standard ones?
Compile your program. Then run “ldd” on it to see which (dynamic) library it is linked to:
> ldd myapp

2007-05-26 3:14 pm
gerryxiao
Compile your program. Then run “ldd” on it to see which (dynamic) library it is linked to:
> ldd myapp
i’m not using it for developing, just want some programs in my box apply libhoard.so to improve performance
there aren’t any programs in linux which have the same functions as pldd in solaris, but i’m not sure
pmap seems working in linux
pmap <pid>
Edited 2007-05-26 15:19

2007-05-26 3:27 pm
big_gie
i’m not using it for developing, just want some programs in my box apply libhoard.so to improve performance
So you want to replace the existing library with that one? Interesting thing. Can’t help for that though

2007-05-26 3:49 pm
gerryxiao
i’m not using it for developing, just want some programs in my box apply libhoard.so to improve performance
So you want to replace the existing library with that one? Interesting thing. Can’t help for that though
libhoard.so is a dynamic link share lib file which includes much same functions suchas malloc(),free() etc with standard GNU c lib, if LD_PRELOAD variable has been setup, any program depending on share libs first look at libhoard.so, if found any functions which is needed , it will not call standard GNU c share lib functions

2007-05-26 5:33 pm
bnolsen
another allocator is called “nedmalloc” which claims to thrash horde in real world applications.
This article reminded me. When I get back to work on Monday or Tuesday I’ll test horde and nedmalloc on our 8 core dell workstation with our threaded ortho rectification and see if there’s any noticeable performance improvements.

2007-05-28 12:54 am
EmeryBerger
Hi,
I just downloaded and tried nedmalloc against a particularly brutal benchmark called “larson” on a dual processor Linux box. With Hoard, its throughput was 790,614 memory operations per second, while with Nedmalloc it was 188,706 ops/sec. Compare to GNU libc (the default allocator), whose throughput was 192,485 ops/sec. In short, Hoard outperforms Nedmalloc by more than a factor of 4X.
I’d be interested to hear what happens with your application.
Best,
— Emery Berger

2007-05-29 3:24 pm
bnolsen
I got to work and ran some tests.
Machine: 8 core clovertown 1.6GHz, 8GB ram. gentoo ~amd64
Process:
threaded ortho rectification.
3 pipelines, 8 threads each
– Input IO
– Ray production & solid intersection
– Pixel Rasterization and output tiling.
Test dataset: 210MP input image (smallest I coul find)
Run each data set twice, take second timing.
Process is 64bit optimized. Memory usage numbers are somewhat misleading because of the kernel paging, but hover between 3.2GB & 3.8GB regardless of the allocator.
Interesting observation:
During libhoard runs occasionally the system levels would dramatically spike (seen in both gkrellm2 & htop)
Timing:
default gcc allocator
real 1m19.477s
user 9m10.210s
sys 0m8.993s
hoard:
real 1m9.135s
user 7m23.508s
sys 0m37.102s
tcmalloc:
real 1m2.222s
user 6m58.014s
sys 0m5.032s
Edited 2007-05-29 15:28 UTC

2007-05-29 4:48 pm
bnolsen
nedmalloc numbers:
real 0m58.323s
user 6m11.111s
sys 0m11.933s
Looks like the nedmalloc guy has the most realistic claims.

2007-05-29 11:09 pm
bnolsen
Yet another test case:
Normal system accumulator.
Equations are objects which are assembled into a normal system and then solved via SVD.
Tons of 3×3 and 4×4 matrices.
Single threaded application.
Base case:
real 0m3.596s
user 0m3.560s
sys 0m0.036s
hoard:
real 0m4.112s
user 0m4.076s
sys 0m0.036s
nedmalloc:
real 0m3.692s
user 0m3.664s
sys 0m0.028s
tcmalloc: CORE DUMP (hehe)

2007-06-01 12:58 am
EmeryBerger
Hi –
Could you please send me your benchmarks? There is some sort of performance issue with 64-bit that I am trying to resolve.
Thanks,
— Emery

2007-05-26 8:19 pm
Wes Felter
http://goog-perftools.sourceforge.net/doc/tcmalloc.html
https://labs.omniti.com/trac/portableumem/
(Emery Berger seems pretty good at marketing for a university professor.)

2007-05-28 1:17 am
EmeryBerger
Hi,
Thanks for the pointers.
I ran the “larson” benchmark with both of the allocators you pointed out.
libumem: shockingly poor performance. 76,110 ops per second — Hoard is 10X faster. Which makes me unsure whether this port is true to the original libumem algorithm (from Bonwick).
tcmalloc: for this benchmark, it outperforms Hoard by about 20%, but also consumes 40% more memory (172M versus 122M). In addition, tcmalloc does some things that Hoard doesn’t do because they can seriously harm performance. I tried another benchmark (“cache-thrash”) which tests whether an allocator can contribute to false sharing of cache lines (really bad for performance). On this benchmark, Hoard performed around 6.5X faster.
Best,
— Emery
P.S. I’m going to consider ‘good at marketing’ as a compliment… .