When running tests, installing operating systems, and compiling software for my Ultra 5, I came to the stunning realization that hey, this system is 64-bit, and all of the operating systems I installed on this Ultra 5 (can) run in 64-bit mode.
I wondered if it would be best to compile my
applications in 32-bit mode or 64-bit mode. The modern dogma is that
32-bit applications are faster, and that 64-bit imposes a performance
penalty. Time and time again I found people making the assertion
that 64-bit binaries were slower, but I found no benchmarks to back
that up. It seemed it could be another case of rumor taken as fact.
So I decided to run a few of my own tests to see
if indeed 64-bit binaries ran slower than 32-bit binaries, and what
the actual performance disparity would ultimately be.
Why 64-bit?
In the process of this evaluation, I came to ask
myself: Why go through all the trouble to make it 64-bit anyway?
Other than sex appeal, what other reasons are there for 64-bit?
Checking out Sun’s Docs site, I ran across this
article: Solaris 64-bit Developer’s Guide
(http://docs.sun.com/db/doc/806-0477).
It gives some good detail on when 64-bit might make sense, and is
relevant for Solaris as well as other 64-bit capable operating
systems.
The benefits for 64-bit seem to be primarily
mathematical (being able to natively deal with much larger integers)
and for memory usage, as a 64-bit application can grow well beyond 2
GB.
Operating systems seem to be benefiting from
64-bit first, allowing them to natively handle greater large amounts
of RAM (some operating systems such as Windows and Linux have ways
around the 2/4 GB limit on 32-bit systems, but there is funkyness
involved). There aren’t many applications that use more than 2
GB of memory and greater, but there are more on the horizon.
Given that this Ultra 5 cannot hold more than 512
MB of RAM (it’s got 256 MB in it now), there’s not much benefit in
the memory area, for either the OS or applications. Still, I want to
see what issues and performance penalties there may be with 64-bit
binaries, so I’ll give it a shot anyway.
Here are some links for further reading on 64-bit
computing.
http://developers.sun.com/solaris/articles/64_bit_booting.html#Q1
http://arstechnica.com/cpu/03q1/x86-64/x86-64-1.html
Applications
The first step was to select some applications to
run these tests against. They would have to be open source so I
could compile them in 32-bit and 64-bit versions, and there would
need to be some way to benchmark those applications. I would also
need to be able to get them to compile in 64-bit mode, which can be
tricky. I tried a few different applications, and ended up settling
on GNU gzip, OpenSSL, and MySQL.
Test
System
My test system is my old Sun Ultra 5 workstation,
for the specs please refer to the intro article. The operating
system install is Solaris 9, SPARC edition, 12/03. The 9_Recommended
public patch cluster was also installed.
For the compiler, I used GCC 3.3 .2 that I got
from http://www.sunfreeware.com
which is built for producing both 32-bit and 64-bit binaries. To see
if it can successfully produce 64-bit as well as 32-bit binaries,
I’ll run a very simple test of the compiler.
I create a very simple C file, which I call
hello.c:
main()
{
printf("Hello!\n");
}
I’ll just run a quick
test to see if it compiles in regular 32-bit mode:
gcc hello.c -o hello32
The complier gives no
errors, and we see that a binary has been created:
-rwxr-xr-x 1 tony tony 6604 Jan
6 13:24 hello32*
Just to make sure,
we’ll use the file utility to see what type of binary it is:
# file hello32
hello32: ELF 32-bit MSB executable
SPARC Version 1, dynamically linked, not stripped
So the file is 32-bit,
and runs SPARC Version 1, which means it should run on any SPARC
system. And of course, we’ll run it to see that it runs correctly:
# ./hello32
Hello!
All is well, so now
let’s try compiling hello.c
as a 64-bit binary. For GCC, to create a 64-bit binary, the CFLAG
is -m64.
gcc hello.c -o hello64 -m64
No errors were given,
and we see that a binary has been crated:
-rwxr-xr-x 1 tony tony 9080 Jan
6 13:24 hello64*
But is it a 64-bit
binary? The file
utility will know.
hello64: ELF 64-bit MSB executable
SPARCV9 Version 1, dynamically linked, not stripped
The binary is 64-bit,
as well as SPARCV9, which means it will only run on SPARC Version 9
64-bit CPUs (UltraSPARCs). So now we’ve got a 64-bit binary, but
does it run?
# ./hello64
Hello!
OK, so now we’re set.
Now we can test to see if 64-bit binaries are really slower, but
we’ll have to use something a little more intensive than “Hello!”.
OpenSSL 0.9.7c
I’ll start with OpenSSL
and its openssl
utility. I used OpenSSL 0.9.7c, the latest version at the time of
this writing from http://www.openssl.org.
Running the ./config
utility in the openssl-0.9.7c root directory detects that the Ultra 5
I’m running this on is an UltraSPARC system, capable of 64-bits, and
gives instructions on how to specify 64-bit compilation:
#
./config
Operating
system: sun4u-whatever-solaris2
NOTICE!
If you *know* that your GNU C supports 64-bit/V9 ABI
and
wish to build 64-bit library, then you have to
invoke
‘./Configure solaris64-sparcv9-gcc’ *manually*.
You
have about 5 seconds to press Ctrl-C to abort.
The first compilation
I’m going to do will be the 32-bit, so I’ll ignore this for now. The
config
utility runs and prepares the build for solaris-sparcv9-gcc.
Configured for solaris-sparcv9-gcc.
Here are the CFLAGS
from the main Makefile:
CFLAG= -DOPENSSL_SYSNAME_ULTRASPARC
-DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H
-DOPENSSL_NO_KRB5 -m32 -mcpu=ultrasparc -O3 -fomit-frame-pointer
-Wall -DB_ENDIAN -DBN_DIV2W -DMD5_ASM
Two important flags
here are -m32,
which while GCC defaults to 32-bit binaries, explicitly sets 32-bit
binaries. The other is -mcpu=ultrasparc,
which sets the compiler to use optimizations for the UltraSPARC CPU
(versus a SuperSPARC or older SPARC processor platform).
If you’ve done
OpenSSL compilation on the x86 platform, this optimization is akin to
x86’s -march=i686
, which produces faster code for Pentium Pro processors and above
(there’s no benefit that I could measure by optimizing for new
processors, like the P3). Most of the time OpenSSL and a few other
applications, as well as the kernel, are released with i686
optimizations. These CPU-specific optimizations make a big
difference in OpenSSL performance for both the SPARC and x86
platforms.
The only thing left to
do is a make, which worked flawlessly. In the apps/
directory is where the openssl
binary sits, and we can check to ensure it’s a 32-bit binary:
# file openssl
openssl: ELF 32-bit MSB executable
SPARC32PLUS Version 1, V8+ Required, UltraSPARC1 Extensions Required,
dynamically linked, not stripped
I went on and built
OpenSSL with 4 variations: 32-bit and 64-bit version with shared
libraries (where libssl.so and libcrypto.so are separate), and 32-bit
and 64-bit versions without external libcrypto and libssl libraries.
I ran each iteration a few times, and took the first run. There was
very little disparity between the runs.
In general, if you’re
using OpenSSL, you’re probably using it with at least OpenSSH and
possibly other SSL or crypto applications. Thus, building shared
libraries is probably your best bet.
The test I ran was
openssl
speed rsa dsa, which runs through various RSA
and DSA operations. I ran the tests 3 times, averaged the results,
and rounded. There was little disparity between the three runs.
Here are the results:
OpenSSL 0.9.7c: Verify operations per second (longer bars are better)
OpenSSL 0.9.7c: Sign operations per second (longer bars are better)
In this first test, we
can see that 32-bit binaries were usually faster than 64-bit
binaries, although in some cases the results were nearly identical.
However, the speed difference wasn’t all that great, topping out at
about 12%.
GNU
gzip 1.2.4a
GNU’s gzip is also a
useful benchmark, and it’s one of the tools used on SPEC’s CPU2000
ratings, so I grabbed gzip’s source from the main GNU FTP site. I
picked the latest available on the site, 1.2.4a.
To test gzip, I needed
something to zip and unzip. I ended up using a tar of my /usr/local/
directory, as it had a nice mix of text files, binaries, tar balls,
and even already gzip’d files. Also, it created a 624 MB file, which
is big enough to negate disk or system caching.
I then created a 32-bit
binary and 64-bit binary using GCC 3.3.2. I used “-O3
-mcpu=ultrasparc” as the compiler CFLAG
for both (with “-m64”
for the 64-bit version). I used the time
utility to measure how long it took to run gzip and gunzip on the 624
MB tar file. I ran each operation for the each binary three times and
averaged the results (rounding to the nearest whole number). The
three runs were very consistent.
GNU
gzip 1.2.4a: gzip and gunzip
For the gzip operating,
the 32-bit binary about 20% faster than the 64-bit binary. For the
gunzip operation, the 32-bit binary was nearly identical to the
64-bit runs (91 seconds versus 92 seconds for completion).
MySQL 4.0.17
MySQL was the most challenging, as compilation is
quite a bit more involved than either gzip or OpenSSL. I ran into
several problems getting it compiled for 64-bit, but I was able to
sort them out and ended up with a 64-bit binary. The added
“-mcpu=ultrasparc”
for the compile flags (for both the C compiler and the C++ compiler,
MySQL uses both), and of course “-m64”
for the 64-bit version. The MySQL configure script added “-O3”
as a compiler option.
To test MySQL, I used the included sql-bench,
which is a benchmarking toolkit for MySQL and a few other RDBMs. It
consists of a Perl script runs that through a set of operations using
the DBI and DBD::mysql Perl modules. On this system, the full tests
take about 4 hours to run, so I only ran the tests twice, and took
the best of the two. There was very little disparity between the
tests.
It should be noted that for both the 64-bit and
32-bit tests, I used the 32-bit build of sql-bench and MySQL client
libraries, as the Perl I used (5.8.2) is 32-bit. This was done to
keep the client end consistent. To complete each run takes about 4
hours, so I only ran them twice, and used the best run. Again, there
was little disparity between each run.
MySQL:32-bit versus 64-bit (shorter is better)
MySQL:32-bit versus 64-bit (shorter is better)
The MySQL results were
a bit surprising, as two of the operations, insert and select, showed
faster results for the 64-bit binary of the MySQL server than the
32-bit version.
The
Size Factor
Another argument against 64-bit binaries that I
see frequently is their larger size. And indeed, all of the 64-bit
binaries and libraries created for this test were larger:
Binary | 32-bit size (bytes) | 64-bit size (bytes) | % Larger |
---|---|---|---|
mysqld | 3,993,792 | 4,864,832 | 22% |
gzip | 75,472 | 115,976 | 54% |
openssl (shared) | 465,256 | 539,992 | 16% |
openssl (static) | 1,568,640 | 1,935,736 | 23% |
Library | 32-bit size (bytes) | 64-bit size(bytes) | % Larger |
---|---|---|---|
libcrypto.so.0.9.7 | 1,404,964 | 1,733,864 | 24% |
libssl.so.0.9.7 | 244,280 | 294,056 | 20% |
However, the difference wasn’t all that huge, only
around 16% to 54% larger for 64-bit than 32-bit. Unless the system
was an embedded system with very limited storage space, I can’t see
this being all that much of a negative factor.
The
Compile Factor
Getting applications to compile as 64-bit binaries
can be tricky. The build process for some applications, such as
OpenSSL, have 64-bit specifically in mind, and require nothing fancy.
Others, like MySQL and especially PostgreSQL (I was originally going
to include PostgreSQL benchmarks) took quite a bit of tweaking.
There are compiler flags, linker flags, and you’ll likely end up in a
position where you need to know your way around a Makefile.
Also, building a 64-bit capable compiler can be an
experience to behold. That is to say, it can suck royally. After
spending quite a bit of time getting a 64-bit compiler built for (one
of my many) Linux installs, I ended up just going with the
pre-compiled version from http://www.sunfreeware.com.
Both compiling a 64-bit capable compiler or
getting an application to compile 64-bit can be time intensive, as a
compile will often start out fine, and die somewhere down the line.
Then you fix a problem, start the compile again, and it will die
maybe 10 minutes later. Then you’ll fix another issue, and repeat
until (hopefully) the compile finishes cleanly.
The
Library Factor
One important factor in considering whether to
compile/use 64-bit binaries is the problem of shared libraries.
Initially, I hadn’t even thought of the library issue, but when
building 64-bit applications, I came to find it was significant.
The issue is when using a 64-bit application
that’s requires external libraries, those libraries need to be
64-bit; 32-bit libraries won’t work.
Take OpenSSH for example. It’s typically
dynamically linked, and requires libcrypto.so.
If your libcrypto.so
is compiled 32-bit and SSH is compiled for 64-bit, you’ll get an
error:
> /usr/local/bin/ssh
ld.so.1: /usr/local/bin/ssh: fatal:
/usr/local/ssl/lib/libcrypto.so.0.9.7: wrong ELF class: ELFCLASS32
Killed
This means you may very
well need to keep two copies of your shared libraries: One set for
32-bit binaries, and another for 64-bit binaries. Solaris keeps
64-bit libraries for in directories like /usr/lib/sparcv9.
You’ll also need to adjust your LD_LIBRARY_PATH
environment variable to point to where these 64-bit libraries are
located when running a 64-bit binary.
I ran into this time
and time again throughout these tests and throughout the entire
evaluation as a whole, and it was a huge pain in the ass.
Conclusion
While these tests are limited in scope, and there
are far more sophisticated tests that could be performed (such as raw
integer and floating point), this is a start, as I haven’t seen any
32-bit versus 64-bit tests out there. The lack of other benchmarks
seems strange to me; perhaps I didn’t look in the right places. (If
you know of any published benchmarks comparing 32-bit binary
performance versus 64-bit, please let me know).
Keep in mind these tests were performed on the
UltraSPARC platform for Solaris 9, and while they probably would have
relevance to other operating systems and platforms (such as Linux on
x86-64, or FreeBSD on UltraSPARC), specific tests on those platforms
would be far more revealing.
So while the tests I ran were on only a few
applications and in limited ways, the results seem to show that
indeed 64-bits do generally run slower. However, there are there are
a few issues to consider.
One issue is that the difference in performance
varies not only from application to application, but also what
specific operation a given application is performing. Also, the
largest disparity I was able to see was around 20%, and not the
many-times-slower disparity that I’ve seen some claim.
Since this was a limited number of applications in
limited scenarios, the best way to know for yourself is to give the
applications you’re concerned about a try in both modes.
In the end it’s the library and compiling issues
that are the most compelling reasons to stay with 32-bit binaries,
and not so much performance or size. I think it’s safe to say that
you’re not missing out by going with the simpler-to-manage 32-bit
binaries, unless your application can specifically benefit from
64-bit.
Related reading: My Sun Ultra 5 And Me: A Geek Odyssey
Benchmark Player Haters
Benchmarks are always such a contentious and embattled topic. No matter what benchmark you run, invariably someone has a negative comment to make (often rude and obnoxious). If you think otherwise, then write an article
and include benchmarks of some sort and get it published. Then sit back and enjoy the belligerent emails and rude comments.
So why is this? Well, I think it’s a combination of a couple of factors.
There are two pervasive emotional factors in benchmark-bashing. One is when a beloved and heroic operating system or beloved application ends up on the losing end to some vile, contemptible waste-of-time operating system or application. Such choices in operating systems, applications, hardware platforms, choice in databases, etc., are very personal, so it’s easy for some people to take results of a benchmark as an affront to one’s
manhood.
The problem is that benchmarks, by their very nature, are narrow in scope and fail to encompass the complexity of an operating system, application, or hardware platform. As a result, someone with even a mediocre knowledge of the technology can easily poke holes, and make themselves seem smart in the process.
But that’s not what benchmarks are varying depths of exploration into unknown territory. Sometimes they can be very comprehensive, and other times they can be very simple. They answer only the questions they are asked, and can provide a basis for asking other questions.
For instance, in my review of UnixWare 7.1.3, I ran some OpenSSL tests to see if there’d be any performance hit from the Linux emulation layer, known as the LKP. Why did I do this? I had no idea if there would be any performance penalty. No one had tested it before to my knowledge, so predicting the outcome was impossible.
Even someone intimately familiar with the inner workings of UnixWare, Linux, and the methodology that enables the LKP to work, could not know the outcome without running the test. Any supposition as to the results would be just that: a supposition. Anyone who says they know doesn’t know
what they’re talking about, and it’s easy to say so after the fact. Now that those benchmarks have been run, now we know. It was an easy test, and one that I could run.
But still, there were a few he doesn’t know what he’s talking about comments, including one particularly obnoxious guy who posted the same ignorant chest-beating comment in Slashdot and OSNews, like some sort of
cyber-geek-stalker-player-hater. He relies on the virtues of hindsight, looking back and saying of course the results would be such!
If I ever do a review and you want to make a point, drop me a line. If you’re polite about it, I’m happy to discuss it and I’ll even take suggestions for other benchmarks. If you enjoy making obnoxious remarks
about benchmarks done by myself or others, then do your own benchmarks, write it up in an article, and get it published. No one’s stopping you, and it’s not all that difficult.
So keep that in mind as you read my reviews and benchmarks, and as you read benchmarks from others.
There’s a huge difference between simply compiling 32bit software with a 64bit compiler and actually writing them so they make the best use of the enhance capabilities.
There is also no need at all to use 64bits ints in cases where a 16bit int would more than suffice ..
Of course, I use quantum chemistry programs. Believe me, you can do a lot more interesting and bigger things with 64-bit compared to 32-bit. And the programs are much faster as the blokes who code these babies use every iota of register if they can.
And, I suppose, I’m colored by my main 64-bit exposure being on my Alpha XP1000 and Digital/Compaq/HP’s F90 compiler. Oooh…so nice, so well made. But, I’m guessing Opteron will be my next…then it’ll be gfortran (someday) or pgf90 for me.
This set of benchmarks is actually very nice, since it compares pure 32bit code on a system without too many differences between 64bit and 32bit implementations.
On x86-64, however, recompiling to 64bit also gives the compiler access to additional registers, so it doesn’t have to waste cycles re-arranging contents in the limited number of x86 registers. This offers a performance boost beyond simple recompilation and excluding the benefits of 64bits on operations which need it.
Therefore, readers considering Opterons and Athlon64s should be aware that the figures here do not represent the results they would discover from 64bit recompilation.
A synthetic test suite I’ve used for benchmarking unix systems is nbench ( http://www.tux.org/~mayer/linux/bmark.html ). It’s available as source and has a small variety of different individual tests which, though are each detailed individually aren’t reproduced here.
Ultra-10, USIIi 333MHz/2MB:
——–
gcc -s -static -Wall -O2 -mpcu=v9 -m32
nbench: ELF 32-bit MSB executable, SPARC32PLUS, V8+ Required, version 1 (SYSV), statically linked, stripped
==============================LINUX DATA BELOW===============================
CPU :
L2 Cache :
OS : Linux 2.4.21
C compiler : gcc-3.2
libc : ld-2.3.2.so
MEMORY INDEX : 1.076
INTEGER INDEX : 1.257
FLOATING-POINT INDEX: 1.760
——–
gcc -s -static -Wall -O2 -mpcu=v9 -m64
nbench: ELF 64-bit MSB executable, SPARC V9, version 1 (SYSV), statically linked, stripped
==============================LINUX DATA BELOW===============================
CPU :
L2 Cache :
OS : Linux 2.4.21
C compiler : gcc-3.2
libc : ld-2.3.2.so
MEMORY INDEX : 1.102
INTEGER INDEX : 1.203
FLOATING-POINT INDEX: 1.661
——–
Running tests multiple times produces some variation, but in general they are all fairly close.
Since 64 bit support is rather young in gcc it may have been more fair to use Sun’s compiler for the tests.
Unless things have changed recently I believe that you can demo Sun’s c compilers without buying them.
It is nice to see the different binary formats when executed on the same system, this gives a good idea about how much overhead their is in 64 bit compiled binaries (vs. 32 bit) and how it effects performance given the same number of registers.
Another factor of interest, and also one much more relevant to system performance is that Debian on sparc is actually compiled for SPARC v7, which is broadly equivalent for compiling an x86 distribution for a 386.
The most visibly represented inefficiencies are on floating point code, which is done under FPU emulation regardless of whether or not an actual FPU is present (as it effectively always is nowadays, even on the oldest generally available sparc systems second hand).
Typically it’s not a major issue as integer arithemetic dominates the programs supplied with the packaging system, but those running debian on sparc v8 or above (SPARCStation 4-20, Ultras) can benefit from recominpiling OpenSSL, which will speed up anything that depends on it fourfold.
I’m pretty sure though that there will be a bigger difference when compiling x86 binaries as x86-64 binaries because there will be more register available to the compiler. I’m guessing the reason that the 32vs64 bit difference is so small in nitrile’s case is because the SPARC was RISC already so the UltraSPARC just added 32 bits… the x86-64 IIRC adds new registers. So as interesting as this is I’d like to see the smae tests performed on Solaris x86 and x86-64 when it comes out.
DISCLAIMER: I didn’t RTFA but scanned it. I’ve never looked at the UltraSPARC ISA but have programmed in both x86 and SPARC assembly.
Wonder why OpenSSL is faster in 64bits than 32bits? And why it’s the oppiosite for the other applications?
One hint: floating point registers. They are usually 64bits and therefore applications that make heavy use of floating point variables/registers directly benefit from a 64bit binary.
Well, of course you’ll benefit from 64bits if you are dealing with more than 4GB of RAM, or memory-mapped space (e.g. dealing with files larger than 4GB).
On the other hand using 64bit registers/pointers take up more memory. IE if you are deadling with 32bit integer variables, then you are essentially waisting 32bits of memory for each variable, and loading/storing from/to memory takes also more time. Therefore your applications slow down if you use 64bit builds that only deal with 32bit (integer) data.
OpenSSL is using lots of floating point calculations, therefore performance is better with a 64bits build: the overhead of the 64bit pointers is not as bad as the benefit from using 64bit floating point registers directly.
MySQL/gzip on the other hand don’t use much floating point stuff, and therefore the overhead of using 64bit pointers is slowing it down.
I’m sure this is mostly concerning the recent Athlon64, as it is the first 64 bit x86 chip that I have heard of.
The benefit with it isn’t 64 bit integers, as stated above it’s that it has 16 registers per unit where Athlon has 8. Pentium 4 has 8, and I believe I read somewhere it has 112 other invisible registers. Doubling the registers is definitely a good thing as one can lose clock cycles even when requesting information from cache; and with an 18/20 stage pipeline losing a cycle is a big deal.
Also, we have to move eventually as 32bits can only hold the epoch till 2038 if I remember correctly. There are of course other reasons why a 64bit machine is cool. And a less than 20% performance loss in those tests is a small price to pay, as die size decreases we will make up for those speed issues in increased clock speeds within a year or two.
The benchmarks seem pretty predictable though, and it’s true that for most people’s needs there is no point in having a 64bit integer unit. But in 5 years there most likely will be, so we might as well start upgrading now.
If possible, I’d love to see a correct benchmark of a Athlon64, since all I have seen so far have been 32bit comparisons. I want to see it’s 64bit mode compared to it’s 32bit compatibility mode.
You might want to check the article again. I believe it shows 32bit compiles as being faster on SSH.
Strictly speaking, almost all 32-bit processors support at least 64-bit floats – in fact, the P6 and above support 80bit floats internally to reduce rounding erros. What you do see with a “64-bit” system is a 64-bit *integer* (long long) type in hardware. I wouldn’t be surprised if OpenSSL made use of that fact.
Apart from tricks like PAE and whatever is in Panther that does the same thing, your memory addressing comment is quite correct. An Opteron running Linux x86-64 is a 64-bit system. A Xeon with 8 GB of RAM – and using it – is not.
It’s irrelevant.
@Uruloki A 64-bit machine doesn’t necessarily have “enhanced capabilities” over a 32-bit machine. It can use 64-bit integers, and address > 4GB of memory. Unless you are doing either of those, there is no real way to write your program to be faster on a 64-bit machine. Precisely what sort of code changes are you thinking about? Besides, in this day and age, you should almost never micro-optimize your code for a specific CPU architecture. Unless you are comfortable holding all the details of a 20-stage pipeline, different latencies for dozens of instructions, complex branch prediction, and the state of 128 rename registers in your head at once, the compiler will do a better job than you. And all your micro-optimizations will be useless when the next Intel chip comes out with different performance characteristics.
PS> Using a 16-bit integer is often a bad idea. CPUs like word-aligned data. 16-bit integers are quite often slower than 32-bit integers unless you are working with them in a way where the CPU can load two of them at a time. Often, if you put a 16-bit integer in a struct, the compiler will ignore you and pad it out to 32-bits, to maintain alignment for the fields after it.
@Gandalf
What you said is not true. On everything >= SuperSPARC and everything >= i387, the FPU is at least 64-bits. So whether or not you use a 64-bit build, double-precision (64-bit) floating-point math will run at the same speed as single-precision (32-bit) floating-point math.
Praytell, how do you “optimize” for 64-bit processors?
thank you for a nice article, it gives noobs like myself good insight into the debate of 32v64
> You might want to check the article again. I believe it shows 32bit compiles as being faster on SSH.
Yeah, sorry, I was wrong in that point.
But as you can see the difference in performance is smaller than for gzip. As for MySQL, there are obviously some operations that benefit from 64bit aswell (maybe he was a large database with more than 4GB), but things like ‘connect’ doesn’t benefit at all (as you can see in the graphs).
Very interesting article on 64bit computing http://arstechnica.com/cpu/03q1/x86-64/x86-64-1.html
gives some answers to the questions…
You are wrong.
Almost all processors that supports floating point support native 64bit even in 32bit implementations (also 80bit for IA32/AMD64).
The reason that OpenSSL (and many other softwares that encrypt/decrypt data) can be faster is because it can use 64bit integer registers to implement the algorithms in fewer steps than with 32bit registers. There are very few algorithms besides encryption that needs 64bit operations therefore very few applications gain as much.
Example (Add two 64bit numbers):
32bit IA32:
add eax, ebx ; add lower half
adc ecx,edx ; add upper half
64bit AMD64:
add rax,rbx ; add whole number
Not only does the 32bit version take two instructions instead of one, the second instruction is dependent on the first so they can not execute in parallel. Another problem is that the 32bit version uses 4 registers to represent the numbers while the 64bit uses only two.
If we do the same thing on a RISC processor (MIPS-like in this example) the 32bit version would be even slower as we don’t have hardware support for carry:
32bit MIPS-like:
addu r10,r10,r05 ; add lower half
subu r09,r10,r05 ; r09 is negative if carry was generated
srl r09,r09,31 ; shift MSB to LSB position (r09=1 if carry else 0)
addu r11,r11,r06 ; add upper half
addu r11,r11,r09 ; add “carry”
64bit MIPS-like:
addu r10,r10,r05 ; add whole number
The performance difference is even bigger if we compare 64×64->64bit multiplications that many encryption algorithms need.
“Also, building a 64-bit capable compiler can be an experience to behold. That is to say, it can suck royally. After spending quite a bit of time getting a 64-bit compiler built for (one of my many) Linux installs, I ended up just going with the pre-compiled version from http://www.sunfreeware.com“
Pre-compiled version of what. GCC?
“Besides, in this day and age, you should almost never micro-optimize your code for a specific CPU architecture. Unless you are comfortable holding all the details of a 20-stage pipeline, different latencies for dozens of instructions, complex branch prediction, and the state of 128 rename registers in your head at once, the compiler will do a better job than you. And all your micro-optimizations will be useless when the next Intel chip comes out with different performance characteristics.”
Do you really belive that compilers model processors to that detail? They don’t. There is no need to do that (and is in reality impossible) to get good performance out of the processor, and that applies for human coders also.
You might want to read the article again, OpenSSL 32-bit actually beat OpenSSL 64-bit. The only place 64-bit beat 32-bit was a couple of the MySQL operations.
And while none of these applications seemed to be specially written for 64-bit (I couldn’t find any), in many cases they do use 64-bit integers, depending on how the header files worked out. For PostgreSQL (I didn’t end up doing benchmarking for that) I actually had problems compiling 32-bit because it kept trying to use 64-bit long long integers. I had to go in an manually edit the header files to get it to compile.
I also went out of my way I think to explain these tests were limited.
Yeah, the compiler (explained a bit earlier in the article) was GCC 3.3.2 from Sunfreeware.com.
To begin to witnessing the effect of 64-bit processing, you need a large CPU cache — >=1MB, and a large amount of very fast memory — >=2GB, wider and faster Buses, and most definitely SCSI subsystems. When we talk 64-bits, the amount of data we are manipulating is very large. The speed of manipulating/processing such data is not the problem. The speed of moving and storing such data is were the problem lies.
You better have a pretty souped up workstation to even begin thinking about 64-bit desktop. I’m talking gigabytes of Dual DDR2 RAM, SCSI RAID, the fastest processor in the market, and chip sets that supports > 800MHz FSB. Then we can begin talking 64-bit computing. I agree with Intel when they say 64-bit desktop computing is still way off, except of course you are a power user or gamer.
Sorry, I didn’t mean to make my comments sound rude. I hate rude comments, so I apologize if they came off as such.
> What you said is not true. On everything >= SuperSPARC and everything >= i387, the FPU is at least 64-bits.
I didn’t deny that. Actually quite the opposite.
Often it’s 80bit or even 128bit.
FPU Datatypes (differ from platform to platform):
float: most often 32bit
double: >= 64bit (eg. on ix86 it’s 80bit internally)
long double: 80bit or 128bit
But … the load/store of the 64bit/128bit values is slower for 32bit builds than for 64bit builds, amongst other things (eg. conversation etc.)
The performance for >=64bit floating point arithmetic (at least double precision) is *slower* for 32bit systems than on 64bit systems. It’s not a huge difference, but noticable.
> You are wrong.
Ummm, I disagree. Read on.
I posted earlier:
> One hint: floating point registers.
> They are usually 64bits and therefore applications that make
> heavy use of floating point variables/registers directly
> benefit from a 64bit binary.
Note “that make heavy use of floating point variables/registers directly”
Maybe not the best wording; maybe I should’ve said
“that often directly use floating point registers”
or even
“that often load/store floating point registers”
Your examples are brilliant, as they show exactly what I wanted to express.
RE: Chris (IP: —.student.iastate.edu) – Posted on 2004-01-22 22:34:05
I’m sure this is mostly concerning the recent Athlon64, as it is the first 64 bit x86 chip that I have heard of.
The benefit with it isn’t 64 bit integers, as stated above it’s that it has 16 registers per unit where Athlon has 8. Pentium 4 has 8, and I believe I read somewhere it has 112 other invisible registers. Doubling the registers is definitely a good thing as one can lose clock cycles even when requesting information from cache; and with an 18/20 stage pipeline losing a cycle is a big deal.
True, however, RISC chips have historically aways had a huge number of registers, also, regarding the “112 other invisible registers”, I assume you’re referring to the renaming registers which are nice hack to providing more registers without actually providing more.
The thing with RISC is that, in theory, if an application is tweaked to the max, that is, using ever possible feature of the chip; the huge number of registers, ISA enhancements like VIS and use a top notch compiler, the RISC chip should perform better, however, in reality one cannot spend the amount of time required to actually get software performing that good.
Also, we have to move eventually as 32bits can only hold the epoch till 2038 if I remember correctly. There are of course other reasons why a 64bit machine is cool. And a less than 20% performance loss in those tests is a small price to pay, as die size decreases we will make up for those speed issues in increased clock speeds within a year or two.
One also has to take into account that we’ve reached a point of deminishing returns. Instead of the Intels of the world working on making their processor more efficient, that is, doing more work per-clock cycle, they push the pipe line out to the moon and back just so they have the ability to boast in the clock speed hype-a-thons that occur in the local computer rags like zdnet.
The benchmarks seem pretty predictable though, and it’s true that for most people’s needs there is no point in having a 64bit integer unit. But in 5 years there most likely will be, so we might as well start upgrading now.
If possible, I’d love to see a correct benchmark of a Athlon64, since all I have seen so far have been 32bit comparisons. I want to see it’s 64bit mode compared to it’s 32bit compatibility mode.
IIRC, x86-64 has compatibility and long mode. Longmode is a native x86-64 application. Long mode covers both 32bit and 64bit, meaning, one can have a long mode 32bit application featuring all the benefits of x86-64 ISA without the need to re-tune the application.
RE: JCS (IP: —.schaferhsv.com) – Posted on 2004-01-22 22:39:30
Apart from tricks like PAE and whatever is in Panther that does the same thing, your memory addressing comment is quite correct. An Opteron running Linux x86-64 is a 64-bit system. A Xeon with 8 GB of RAM – and using it – is not.
What is not spoken about is the fact there are limitations and issues with PAE. Firstly applications must be able to understand PAE, without that, applications will still only see the max, also, there is a performance penalty for it too. btw, A Xeon with 8GB of memory uses PAE with 36bit addressing thus giving a max of 32GB of addressable memory. x86-64 on the other IIRC< addresses around 42-44bits.
But … the load/store of the 64bit/128bit values is slower for 32bit builds than for 64bit builds
————-
All CPUs >= SuperSPARC has 64-bit internal datapaths, even running in 32-bit mode. That means that loads/stores for the FPU take exactly the same amount of time. Conversions between integers and floating-point values are extremely rare, so the performance of conversions isn’t a big factor.
There is a very good primer on 32-bit vs 64-bit SPARC performance at:
http://www.sun.com/sun-on-net/itworld/UIR951101perf.html
You can get an even better feel for these things by sitting down and designing a cpu core. Thats not very practical if you chase the big guys for 1-3GHz cycle rates but if you pull down to 200MHz its doable in FPGA now. I will refer my comments to Spartan3 family being introduced by Xilinx. I am using a free download Webpack tool to synthesize my design in Verilog. I am limited to the smaller parts upto about 400K logic gates equiv which is more than enough to hold several small 16-64b cpu cores.
The core I am developing is parameterized so that I can choose the cpu width, register file hight, and how deep the alu pipeline is as well as the size of internal sram memory (might be cache or fixed).
The main limiting factor of most cpus that is very hard to work around is the speed of 1st level memory or cache. In FPGA case it can cycle about 200MHz on random access and is also dual ported. Each independant block ram can be 512 by 32 or 16K by 1 or somewhere in between. Using the wider widths allows more bandwidth so I use the 32b form. Further the smallest FPGA may have 4 of these (sp3-50 about $3) and the largest about 104 (sp3-5000 about $100). Prices are in high volume. Block rams can be ganged into super blocks to make bigger rams with little speed impact. But for N cpus, I would want to limit 1 ram per cpu core. A more parallel uber core might use these block rams for super deep register files, tlbs, mmus etc.
Next is the width of the alu or adder. In ASIC/VLSI design the width delay can reduce to a mostly log cost so a 64 v 32 is similar to 32 v 16 delay difference due to propagate generate schemes forming N way trees. This starts to break down a bit after 64b as the doubling widths becomes wire limited.
In FPGAs general logic is about 5x slower than those found in 1GHz level ASICs BUT built in logic thats is simply instanced can be just as fast since they are still circuit level designed by the FPGA company. In FPGAs, carries are almost always ripple carry type as propagate generate circuits are irregular and are not in the FPGA fabric.
So a 16b adder can cyle at 200MHz, 32b at 150MHz and 64b at 100MHz. IE the cycle time follows the width. In all these cases the latency through the datapath is the same in cycle count. One way to speed up the wider cpu is to break the carry every 16b and add another pipeline. You can see now why cpus start to get very deep. Now a cpu can perform at 200MHz for any width if extra N/16 latency cycles are accepted. But that introduces hazard headaches which in turn have to be compensated for by various schemes.
Hazards can be reduced by making the compiler work harder, or by adding hazard detection and register forwarding logic and finally by adding multithreading to make near opcodes independant. There are more costs associated with all of these as well, I will be using some of each of these.
The hight of the register file is likely to be 16 or 32. The penalty here is not speed but sheer no of logic cells that form dual ported 16×1 teeny rams. These can be ganged into 4 wayported regfiles 1w, 3r paths. A 64b cpu with 64registers sucks up 64x64x2x3/16 luts or 1536 cells out of a few thousand. If a cpu is going to have really large reg file w & h, then it should use the blockrams but there are few of those also.
The piece de resistance is that since the cpu is a message passing transputer core in spirit, it can be scaled up for N way supercomps with far less overhead than the shared memory designs. Now if each cpu node even a 64b cpu is close to $1 per instance, it begs the question, would I rather have N Opterons that do not scale in cost albeit 10x faster per node, or would I rather have 100x cheaper but 10x slower nodes. Theres far more to the story than that, ie FPUs are going to be fairly weak in FPGA and scarce and there is a compiler to support C,Occam & HDL etc….
Hope this tidbit is of interest.
johnjaksonATusaDOTcom
Do you really belive that compilers model processors to that detail? They don’t. There is no need to do that (and is in reality impossible) to get good performance out of the processor, and that applies for human coders also.
———-
Compilers do model the processor to that level of detail. GCC’s code generator uses DFA (Deterministic Finite Automata) instruction scheduler. Each processor has a DFA description called an MD file. These MD files describe the details of the processor’s pipelines. The MD file for the i386 architecture is 23,000 lines, plus another 1000 lines for each specific CPU model. Several thousand additional lines of code are dedicated to GCC’s register allocator. And the i386 is a relatively lenient architecture!
Precise modeling of the processor is even more important for a processor like IA-64 and PPC-970 that have complex rules for instruction grouping and instruction dispatch. Its even more important on IA-64 which doesn’t do any internal reordering or optimization.
See http://kegel.com/crosstool if you need to build
an x86 -> x86_64 cross-compiler for Linux. Might
work for other 64 bit chips for Linux, I haven’t checked.
Bits obviously make no difference- lets all switch back to 16 bit. J/k
Oh yeah, and Linux dosent run well on alot of 64 systems and alot of the 64 bit systems like like solaris run slow on 32 bit systems… example..
x86 – Linux vs. FreeBSD vs. Solaris
Winners are 1) FreeBSD 2) Linux 3) Solaris
It’s expected that Linux will overtake FreeBSD due to its rapid development.
64-bit – Linux vs. FreeBSD vs. Solaris
Winners are 1) Solaris 2) FreeBSD 3) Linux
well..well, well
I appreciate the author’s attempt to provide an objective and unbiased review of the performance aspects concerning 32-bit vs. 64-bit computing on his Ultra 5. However, there are a number of details that he did not explore, and a few critical mistakes concerning his testing methodology. I’m going to start with the larger issues and work down from there:
First, his comparison between 32 and 64 bit applications is not correct. From the article:
# file openssl
openssl: ELF 32-bit MSB executable SPARC32PLUS Version 1, V8+ Required, UltraSPARC1 Extensions Required, dynamically linked, not stripped
While the binary format of this executable is ELF-32, the application in question is not a true 32-bit application. The SPARC32PLUS, V8+ required indicates that this application is compiled to use the SPARC v8plus architecture. V8plus uses 32-bit addresses but allows an application to registerize its data in 64-bit quantities, so realistically these comparisons are between programs that use 32 vs 64 bit addresses but all have 64-bit registers. This distinction isn’t explored in the article, but it is important. To get a true characterization of 32-bit addresses and registers, the benchmarks ought to also be compiled to the v7 architecture. I think this may make differences more observable between pure 32-bit and pure 64-bit applications.
The v8plus benchmarks show the obvious benefit of 64-bit registers to compute intensive applications while not suffering from the drawbacks of having a 64-bit address space. My suspicion is that if these tests are re-run for the v7 architecture, the results will find that the 32-bit applications perform better on workloads characterized by lots of load/store behavior, while the v9 applications trump the v7s at computations. This is because there’s more register space on v9, allowing more data to be computed at once.
The reasons for 64-bit apps to slightly lag in performance are various but there are some important things to keep in mind when examining these kinds of problems. With 64-bit addresses, you’ve doubled the size of your pointers, so this is one reason why size of the compiled binaries increases. These addresses have to go somewhere. Also, since you have larger addresses, your cache footprint increases which means you get fewer lines in the cache. More cache misses == poorer performance as you have to go further down the memory heirarchy to satisfy your requests. As a point of fact, the SPARC v9 architecture only allows you 22-bits for as immediate operand, so to construct a 64-bit constant you have to issue more instructions. SPARC uses register windows, and when you take a register spill/fill trap in a 64-bit address space, you’re going to have more information in a 64-bit trap than in a 32-bit. These are just a number of factors that characterize the behavior between 32 and 64 bit address spaces.
I also have some concerns about the author’s static vs. dynamic linking. In two cases the author compares v8plus vs. v9 using completely dynamically linked binaries, and in the other cases, he compares v8plus to v9 using mostly dynamically linked applications only statically linking to libcrypto and libssl. The problem here, is that there is still dynamic linker overhead both as the application is started up, and as it runs. While the “statically” linked binaries obviously benefit from having to take fewer detours through the PLT, these apps are still dynamically linked to libc, libthread, and probably others. So, the full benefit of statically linking them is lost. The 64-bit dynamically linked apps take longer than their 32-bit counterparts for reasons which include more instructions in the PLT to generate the function address to which to jump.
I’m sure there are plenty of other performance aspects that I forgot to touch upon, but my biggest frustration with this article is that it fails to tease out the details about which applications perform better on 32-bits and which perform better on 64-bits and why. I hope my comments were able to fill in some of those gaps. By running his benchmarks on a v8plus architecture, the author has successfully demonstrated what an effective compromise 32-bit addresses and 64-bit registers can be, but he hasn’t characterized actual 32-bit application performance. That said, I do appreciate his fair, factual, and un-evangelical approach to the benchmarking. It certainly provided a good starting point for discussions on 32-bit vs. 64-bit performance.
Praytell, how do you “optimize” for 64-bit processors?
By designing your huge-dataset algorithms around being able to address, day, each cell in a volume reconstruction, instead of using a technique similar to overlays from the old 16-bit days.
Even in “common” tasks like video editing, 32 bit addressing has reached its limits. It’s not for nothing that serious file systems have 64 bit off_t’s; but it would be cool to be able to mmap() such a file…
I don’t think that’s the sort of “optimization” the original poster was talking about, especially since the technique is entirely irrelevent to the benchmarks presented in the article.
Plus, what you are talking about is not so much optimization as it is getting rid of hacks that are no longer necessary because of a more permissive CPU architecture. Optimization takes general code and tunes it for a specific situation. What you are describing is actually the reverse — taking code tuned for a specific situation (low-memory) and generalizing it.
Also, since you have larger addresses, your cache footprint increases which means you get fewer lines in the cache. More cache misses == poorer performance as you have to go further down the memory heirarchy to satisfy your requests. As a point of fact, the SPARC v9 architecture only allows you 22-bits for as immediate operand, so to construct a 64-bit constant you have to issue more instructions.
Let’s think for a second. The bigger the pointers the less number of cache lines, does that make any sense? The number of cache lines are the same regardless of the bits in an address. A cache line is identified by a tag and 32-bit and 64-bit address will eventually hash down to similar tags thus occupying all the cache lines in the cache. Line size and number of cache lines are always constant for caches.
Sparc uses a 22 bit immediate field only for the sethi instruction. There more ways to construct a 64-bit instruction. At max you will need 3 instructions to build a 64-bit constant.
“Oh yeah, and Linux dosent run well on alot of 64 systems and alot of the 64 bit systems like like solaris run slow on 32 bit systems… example..
64-bit – Linux vs. FreeBSD vs. Solaris
Winners are 1) Solaris 2) FreeBSD 3) Linux
well..well, well”
I have data that disputes your rankings. See here:
64bit: Linux vs. FreeBSD vs. Solaris
1. Linux
2. FreeBSD
3. Solaris
As you can clearly see, I can make up rankings as well! Come talk to me when you have some REAL comparisons to show as proof, OK? If you don’t I’ll just assume that you pulled your rankings from thin air.
Actually, compiling with v7 is actually pretty worthless. The v7 instruction set doesn’t include integer divide or integer multiply, nor does it use quad-precision floats (nor does v8 by itself for quad floats), which can mean a significant slowdown in several apps, including OpenSSL:
compiler: gcc -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -DOPENSSL_NO_KRB5 -O3 -fomit-frame-pointer -Wall -mcpu=v7 -DB_ENDIAN -DBN_DIV2W
available timing options: TIMES TIMEB HZ=100 [sysconf value]
timing function used: times
sign verify sign/s verify/s
rsa 512 bits 0.0163s 0.0016s 61.5 623.8
rsa 1024 bits 0.0947s 0.0052s 10.6 190.9
rsa 2048 bits 0.6176s 0.0187s 1.6 53.4
rsa 4096 bits 4.3433s 0.0694s 0.2 14.4
sign verify sign/s verify/s
dsa 512 bits 0.0149s 0.0183s 67.3 54.6
dsa 1024 bits 0.0503s 0.0614s 19.9 16.3
#
The result reduces performance of OpenSSL by about half.
Using -mcpu=ultrasparc (or -xarch=native) as a compiler flag represents what most people would (and many build scripts automatically) use for UltraSPARC-based systems. It represents the highest level of optimization for an UltraSPARC system, and that’s what should be used if you don’t have to worry about running your binaries on older systems. And if you do, it’s probably worth your time to make two sets of binaries: UltraSPARC optimized, and binaries for the older processors you’re using (hopefully v8 and not v7).
The test was done to see if 32-bit binaries run faster on 64-bit capable systems than 64-bit binaries, so using the maximum available optimizations for 32-bit on the UltraSPARC are entirely appropriate.
Granted, the tests were limited in scope. I hope this brings about a new round of performance tests done by others, representing other scenarios and methodologies.
I’m tired of seeing conjecture, and conjecture-taken-as-fact in regards to OS and platform performance. Even if it’s backed up by sound computer science theory (64-bit data paths, cache misses, etc), it’s still pure conjecture until it’s tested.
@Gandalf
“On the other hand using 64bit registers/pointers take up more memory. IE if you are deadling with 32bit integer variables, then you are essentially waisting 32bits of memory for each variable, and loading/storing from/to memory takes also more time. Therefore your applications slow down if you use 64bit builds that only deal with 32bit (integer) data.”
Generally not true. There are multiple levels of the memory hierarchy, and traffic between main memory and caches generally happens in blocks (i.e. 32 or 64 bytes) at a time, which can read and written in burst modes. The on-die L1 cache is what the processor does actual reads and writes from, and generally this connection is as wide as a register (i.e. 64 bits). Whether you read 8 bits or 32 bits or 64 bits to/from the L1 cache doesn’t make a damn bit of difference.
What does contribute to slowdown of 64 bit application is that more space (both static data and heap data) takes up more space, reducing the overall effectiveness of the cache.
“OpenSSL is using lots of floating point calculations, therefore performance is better with a 64bits build: the overhead of the 64bit pointers is not as bad as the benefit from using 64bit floating point registers directly.”
Generally floating point registers on modern processors are 64 or 80 bits wide (sometimes 128), regardless of the size of an integer register. There is a benefit to having a larger bus between the register file and the L1 cache, however.
@Megol
“Do you really belive that compilers model processors to that detail? They don’t. There is no need to do that (and is in reality impossible) to get good performance out of the processor, and that applies for human coders also.”
As Rayiner said, yes they do. GCC has a fairly general and configurable processor description language. Other compilers for specific architectures can go into even greater detail and perform optimizations that would be impossible to express in this language or mean nothing on a different architecture. There is a need to do that in some (very limited) situations. Generally a person will profile the code to identify hot parts of the code and begin removing bottlenecks in those areas. When two or three extremely hot paths of code are identified and the codebase is fairly stable and mature, then it may be worthwhile to optimize those few dozen or couple hundred lines to death for the best performance.
Let’s think for a second. The bigger the pointers the less number of cache lines, does that make any sense? The number of cache lines are the same regardless of the bits in an address. A cache line is identified by a tag and 32-bit and 64-bit address will eventually hash down to similar tags thus occupying all the cache lines in the cache. Line size and number of cache lines are always constant for caches.
Eh, kindof. You’re right that there will always be the same number of lines in the cache, but frequently caches are designed so that it is possible to put a number of blocks in one cache line. These blocks are selected by a tag, as you pointed out, and also an index. I should have been more specific, my comments were mostly in regard to the TLB where you will certainly be able to hold fewer blocks if you’re using a 64-bit VA instead of a 32-bit VA.
Sparc uses a 22 bit immediate field only for the sethi instruction. There more ways to construct a 64-bit instruction. At max you will need 3 instructions to build a 64-bit constant.
Indeed. However, 22 bits is the biggest immediate value you get in the SPARC instruction set. This was more to point out the obvious benefit of being able to do the load directly in x86-64.
The test was done to see if 32-bit binaries run faster on 64-bit capable systems than 64-bit binaries, so using the maximum available optimizations for 32-bit on the UltraSPARC are entirely appropriate.
You’ve missed my point entirely. The goals you stated at the beginning of the article were that you wanted to compare the performance of 32-bit applications versus 64-bit applications. However, I point out that there are some caveats since your testing methodology, in a number of cases, mixes 64-bit things with 32-bit things. You don’t even address the issues or the drawbacks and instead insist that evey possible optimization should be valid. If this is true, you should state that the purpose of your article is not to divine whether 32-bit is faster than 64-bits, but how users should go getting their apps to run the fastest on an UltraSPARC chip. Or, that your goals were to determine the difference in speed between applications that used 32-bit pointers and 64-bit pointers. If those were your claims I would have no issue. However, you’ve set out with a general goal and only tested a few cases. I argue that the cases you’ve tested are not sufficiently general that you can offer a correct conclusion on the entirety of 32 vs. 64 bit performance.
I’m tired of seeing conjecture, and conjecture-taken-as-fact in regards to OS and platform performance. Even if it’s backed up by sound computer science theory (64-bit data paths, cache misses, etc), it’s still pure conjecture until it’s tested.
I’m not sure what you’re trying to say here. However, going into a laboratory and running experiments is never going to cure cancer, AIDS, SARS, whatever. It is experimental result coupled with a body of knowledge that allows us to make conclusions that advance the sciences. If you don’t know anything about viral pathology, anatomy, physiology, etc and you go run a medical experiment, you’re unlikely to learn much from it. Do you insist that your waiters prove that real numbers exist before you pay your bill at a restaurant? Do you make people prove that gavity exists when they want to talk to you about it? This statement sounds awfully rediculous. There are plenty of things that are theoretical that we accept as fact because it facilitates the ease with which we communicate about other things. Like any other field, there is an accepted body of knowledge relating to Computer Systems which is important to understand if you hope to reach meaningful conclusions. Simply dismissing everyone else’s points as conjecture because they have not proven them to _you_ is silly. However, if you think you can prove that application performance increases if you increase your cache miss rate, please feel free to test it out. I think you’ll find that this is much more solid than conjecture.
Eh, kindof. You’re right that there will always be the same number of lines in the cache, but frequently caches are designed so that it is possible to put a number of blocks in one cache line. These blocks are selected by a tag, as you pointed out, and also an index. I should have been more specific, my comments were mostly in regard to the TLB where you will certainly be able to hold fewer blocks if you’re using a 64-bit VA instead of a 32-bit VA.
The blocks you mention are always ths same size regardless of the address size. A n-way associative cache on a 64-bit processor will have the same line size and block size regardless of 32-bit or 64-bit adresses being used. The index is the line, say two VAs hash down to line 0 thier index is 0.
TLBs are nothing special, they are fully-associative caches for the MMU if you will. TLBs cache address translations and not data or instructions. The tag protion of the TTE is always the same size regardless of 32/64 bit VAs. The Data section which holds the physical address is also the same size no matter the size of the address.
Just for fun I tried building building our grid analysis software with gcc 3.x instead of Sun’s C compiler from the Forte Compiler Collection. The software makes extensive use of 64-bit integer math (and consequently executes much faster on a 900MHz UltraSPARC III+ than it does on a 2GHz Athlon running Linux) as all values in the grid files are either 64-bit integers (representing fixed point values) or 80-bit IEEE floats.
Unfortunately I don’t have specific numbers offhand, and I’m sure gcc has matured quite a bit since I tried this (about a year and a half ago), but the binary bult with gcc performed at around 60% of the speed of the binary built with Forte 7.
Compared to the Sun compiler, gcc on sparcv9 is a joke for performance critical applications.
The blocks you mention are always ths same size regardless of the address size. A n-way associative cache on a 64-bit processor will have the same line size and block size regardless of 32-bit or 64-bit adresses being used. The index is the line, say two VAs hash down to line 0 thier index is 0.
Yes, of course. I think we’ve misunderstood each other. All I’m saying is that if you’ve got blocks in your cache that are a fixed size and you’re using a 64-bit quantity as opposed to a 32-bit quantity you’ll use up more blocks holding the 64 bit data. Assume that your cache line holds a number of 64-byte blocks. If you’re using 64-bit quantities you can old hold 8 words/block (64-bit words), but if you’re using 32-bit quantities your cache block can hold 16 words (32-bit words). If you could previously store 16 32-bit items, now you can only store 8 64-bit items, your per/item ability to use the cache has decreased. You’re right that the size itself has not decreased, but you can now hold fewer items. Obviously the 8 64-bit quantities are size equivalent to the 16 32-bit ones, but you’ve halved you’re ability to access objects in the cache.
The next article (I believe it will go up on Monday) is on GCC (both 2.95 and 3.3.2) versus Sun’s compiler. There are some surprises in the results.
Yes, of course. I think we’ve misunderstood each other. All I’m saying is that if you’ve got blocks in your cache that are a fixed size and you’re using a 64-bit quantity as opposed to a 32-bit quantity you’ll use up more blocks holding the 64 bit data. Assume that your cache line holds a number of 64-byte blocks. If you’re using 64-bit quantities you can old hold 8 words/block (64-bit words), but if you’re using 32-bit quantities your cache block can hold 16 words (32-bit words). If you could previously store 16 32-bit items, now you can only store 8 64-bit items, your per/item ability to use the cache has decreased. You’re right that the size itself has not decreased, but you can now hold fewer items. Obviously the 8 64-bit quantities are size equivalent to the 16 32-bit ones, but you’ve halved you’re ability to access objects in the cache.
Yes what you just described is true to a certain extent. However, in practice every data type on a 64-bit binary is not 64-bit only. 64-bit binaries might have 8,16,32 bit data objects in them and caches do allow you to address a byte in a cache line. All I am getting at is that it is not very accurate to say that 64-bit addressing automatically yields poorer performance due to higher cache-misses than a 32-bit binary. It is possible in the scenario you describe above.
In reality not every one who codes a 64-bit program makes all the data 64-bit quantities.
Yes what you just described is true to a certain extent. However, in practice every data type on a 64-bit binary is not 64-bit only. 64-bit binaries might have 8,16,32 bit data objects in them and caches do allow you to address a byte in a cache line. All I am getting at is that it is not very accurate to say that 64-bit addressing automatically yields poorer performance due to higher cache-misses than a 32-bit binary. It is possible in the scenario you describe above.
In reality not every one who codes a 64-bit program makes all the data 64-bit quantities.
I don’t dispute those statements at all. I didn’t think that I said that this was automatically the main reason that performance degrades, but that it is one possible aspect to consider when keeping track of issues affecting the performance of 64-bit applications. Certainly this is only going to apply to a subset of objects in an application. I didn’t mean to give the impression that this was the primary cause of a performance difference between 32 and 64 bit apps. It sounds like we’re in agreement, though…?
Well I reread your orignal post. You did make some really great points overall and I certainly think those are issues to consider about for new comers to the 64-bit world. However, I still got the impression that you mean 64-bit address implied more cache misses. But I am glad you clarified your statement.
No harm done, I am glad we had a good discussion here. It is seldom that a good discussion happens here at OSNews. Many seem to turn into a flameware at the blink of an eye.
However, I still got the impression that you mean 64-bit address implied more cache misses. But I am glad you clarified your statement.
Yeah, that original statement was not particularly clear. But you also made a lot of good points, where I was required to go, “oh crap, I was wrong…how do I salvage something correct from this…” but in the end I think we both managed to come to some concensus about that.
I am glad we had a good discussion here. It is seldom that a good discussion happens here at OSNews. Many seem to turn into a flamewar at the blink of an eye.
Hehe…me too. Some people take these sorts of things awfully personally, but I’m glad neither one of us seemed to. Hopefully the rest of Tony’s articles will provide equally engaging material to discuss.
Does this nbench thing run on Mac OS X? There seems not to be a “configure” script, and when I run make as it says to do in the README I get: “ld: can’t locate file for: -lcrt0.o”
The source is here:
http://www.tux.org/~mayer/linux/bmark.html
-DMD5_ASM
It looks like the 32 bit version of OpenSSL has some assembly code in it whereas the 64 bit version was compiled from C. So the performance gap isn’t as large as what your graphs show in those tests.
“As you can clearly see, I can make up rankings as well! Come talk to me when you have some REAL comparisons to show as proof, OK? If you don’t I’ll just assume that you pulled your rankings from thin air.”
I’m talking about the SPARC and I did not make up benchmarks, go research it yourself. FreeBSD performs better on than Linux in scalability –its widely known but its expected linux will exceed that due to its rapid development.
Go look for yourself. It’s better than you search the internet yourself for this information because 1) i dont have the time 2) it will give you a better understanding on the performance from a variety of benchmarks and tests rather than a biased one.
I was just browing this article and saw this and decided to reply. Now, its time for me to get back to work. Because you know, customers are #1 not sitting around trying to pull of thousands of documents prooving what you said is wrong. Oddly enough, It appears your information came from less than thin air… That’s all I have to say.
The author’s methodology is jive.
(just kidding, I couldnt help myself as everyone here’s tried to remain civil…)
interesting results
I think it can be used like you said: to ask further questions.
I think you can never have too much analysis, really.
Every sampling method you can devise can tell a story if you know how to read it.
A hearty WELL-DONE ans keep up the good work.
PS – I recently installed Gentoo Linux from stage-1, which I expect is analogous to your set-up; Starting with the compiler, the kernel and every executable on the system gets compiled specifically for your chip, using all your chip’s optimizations or CISC-tricks.
It realy runs super-smooth and quick on my 2.4 GHz AMD.
Had to scrap it because I eventualled messed it up totally trying to get the soundcard to work, but I’ll be giving it another shot soon!
The author makes the point that the Ultra 5 can only have up to 512MB of RAM so that addressing beyond 2GB is never going to be an issue. What the author seems to be complettely missing is that the Ultra 5 has a *hard drive* that is larger than 2GB. Anyone that has really tried to performance tune a production NewsNet server or *production* mysql database knows that when buffers/databases grow beyond 2GB, it is nice when fseek() is being called from a truely 64bit binary. How large was Tony’s mysql database files? He didn’t say. I keep track of customer trouble tickets for at least a year and the database very quickly exceeds 2GB. I would make a bet that Tony was using something more along the lines of 20MB.
Is it just me, or does Slashdot do everything in their power to not advertise OSNews? This article is linked as “Tony Bourke decided” with no mention of OSNews in the post. I seem to recall other recent OSNews articles going unnamed as well.
Aces Hardware has some examples e.g.:
http://www.aceshardware.com/read.jsp?id=60000279
and
http://www.aceshardware.com/read.jsp?id=60000256 (unfortunatley only windows for this page)
The problem here, is that there is still dynamic linker overhead both as the application is started up, and as it runs. While the “statically” linked binaries obviously benefit from having to take fewer detours through the PLT, these apps are still dynamically linked to libc, libthread, and probably others. So, the full benefit of statically linking them is lost. The 64-bit dynamically linked apps take longer than their 32-bit counterparts for reasons which include more instructions in the PLT to generate the function address to which to jump.
True, but Solaris has never shipped 64-bit static libraries for libc or libthread. And in Solaris Next, it will not have any 32-bit static libraries. So the comparisons he did are really the only reasonable ones to do.
A few remarks/corrections:
SPARC v7 _has_ floating point instructions, but it lacks integer multiplication/division. (hint: ‘man gcc’)
The reason why there is focus on the executable size when discussing 32/64 bit is not disk space but CPU cache size. When a larger portion of the executable fits within the CPU cache, the executable will operate faster.
It would have been interesting with a comparison against Sun CC. We have experienced that since GCC 2.95.3 Sun has not been able to match the efficiency and executable footprint of GCC.
Assuming you don’t have additional instructions or registers in 64-bit mode and that you don’t have in 32-bit mode or heavily use 64-bit integer data types, the main effect of running in 64-bit mode is that your cache is less effective. This is because the cache is fixed size, and you are now putting in larger variables (namely larger ints and pointers). You would see this effect more if you were just on the edge of thrashing the cache in 32-bit mode, 64-bit mode would completely thrash.
And another note:
The Sun Ultra 5 will accept just as much memory as Ultra 10 (2GB?) – the only constraint is that you have to remove the floppy drive from your Ultra 5 to make physical space for the dimm’s.
I have an Opteron 246 and ran the fftw benchmark in both 32 and 64-bit modes. The 64-bit mode was about 20% faster. If I’m not mistaken I believe that’s because the processor has twice as many registers available when running in 64-bit mode.
Just a thought regarding the unzipping performance. They are likely to be even (32 vs 64) because the unzipping process is a very low-CPU process. The algorithm is designed to make the decompression much easier than the compression, and therefore you are limited by I/O. You didn’t mention whether you wrote the decompressed output to disk or to /dev/null. Of course, making that change would only remove one part of the I/O, reading the compressed file would still be done.
Thanks for the interesting article.
Does this nbench thing run on Mac OS X? There seems not to be a “configure” script, and when I run make as it says to do in the README I get: “ld: can’t locate file for: -lcrt0.o”
vi the Makefile and uncomment line 67 + 68. It should now compile with a simple ‘make’
The german magazin c’t did a comparison back in october 2003 with the gcc 3.3 also.
The 64 bit versions had only on 3 of 13 tests a worse result and were 6% faster in average, exactly the 64 bit versions were compared to the 32 ones:
164.gzip + 6%
175.vpr – 6%
176.gcc +16%
181.mcf – 1%
186.crafty + 4%
197.parser +10%
252.eon +25%
253.perlbmk – 1%
254.gap + 5%
255.vortex +10%
256,bzip2 + 4%
300,twolf + 1%
But the reason is probably the 16 registers instead of 8.
They say the PPC970 should be more qualified, if only there were a true 64 bit OS for it.
I wish this guy would get to the point and stop being so fluffy.
FYI, Solaris’s ld (the dynamic linker) supports two entirely separate LD_LIBRARY_PATH variables for 32 and 64-bit executables. I would suggest that if you are going to mix executing the 64- and 32- bit binaries and need their respective dynamic libraries to get loaded automatically, read the manpage for ld. It was very helpful when I once wanted to do this.
when i do the same tests using Forte Developer 7 on Solaris 8 the 64’s are faster and quite abit.. try building the 64’s with say -xarch=v9a and see how you go.. from years of experience gcc on solaris is crap.
Is this dude sponsored by Intel? Why would someone go through these kind of efforts to find out a 64bit app is a tiny bit slower as the 32bit version? Maybe some people were shocked to find Opteron on 64bit was a lot faster as when running in 32bit mode. Why would someone create FUD about 64bit being slower as 32bit when Opteron currently is pulling _all_ bricks out of Intel’s backyard??
remember this quote? :
“Windows [n.]
A thirty-two bit extension and GUI shell to a sixteen bit patch to an eight bit operating system originally coded for a four bit microprocessor and sold by a two-bit company that can’t stand one bit of competition.”
(Anonymous USEnet post)
Here’s another one :
“Itanium [n.]
a.k.a. Itanic. An incompatible sixty-four bit extension to a thirty-two bit Pentium 4 CPU created by a company who’s previous CPU was called Pentium 5 and presumably also cannot count upwards in performance.”
Robert