SPARC Optimizations With GCC

In continuing with my articles exploring the my SPARC-based Sun Ultra 5, I’m going to cover the topic of compiler optimizations on the SPARC platform. While many are familiar with GCC compiler optimizations for the x86 platform, there are naturally differences for GCC on SPARC, and some platform-specific issues to keep in mind.

These are tips that are easy to add when compiling software where performance is important. And as it turns out, the SPARC platform has characteristics that further benefit from optimization, often more dramatically than x86.

Since compilers work based on the hardware architecture, these tips would apply for GCC for all operating systems that run on SPARC, including all of the operating systems reviewed on this series of articles (FreeBSD, Linux, Solaris, NetBSD, and OpenBSD). These tips would cover GCC 2.95 through the current GCC (3.3.2 as of writing).

This article is written from the perspective of a sys admin, and not a developer. System administrators are usually concerned with performance, and these are tips to help when compiling source code.


Basics On The SPARC Platform

For the SPARC platform, there are a 3 basic classes of processors: V7, V8, and V9. The SPARC V7 is the lowest common denominator for the SPARC platform; anything compiled with the SPARC V7 instruction set will run on any SPARC-based system, just like i386 is the lowest common denominator for the x86 platform.

V7-based systems include Sun’s sun4 and sun4c systems, such as the SPARCStation 1 and 2, and the SPARCStation IPX for sun4c, and the Sun 4/300 for sun4.

The V8 architecture includes sun4m and sun4d systems. The V8 architecture adds some instructions that really help out with performance, including integer divide and multiply. These benefits will become apparent in later tests.

Sun4m-based systems include the SPARCStation 5, 10, 20, and Classic, and sun4d-based systems include the SPARCServer 1000 and SPARCCenter 2000.

The V9 architecture are 64-bit processors (as opposed to V7/V8 32-bit processors) and are fully backwards compatible with previous architectures. The V9 processors include the UltraSPARC, UltraSPARC II, UltraSPARC III, and the new UltraSPARC IV processors. The V9 is known as sun4u, which is what my Sun Ultra 5 is classified as. Sun currently currently only makes systems based on SPARC V9/sun4u.


























Architecture

SPARC
sun4/sun4c

V7

sun4d/sun4m

V8

sun4u

V9 (64-bit capable)



Processor-specific Optimizations

On SPARC systems GCC will produces binaries for V7-based binaries by default, just as GCC produces binaries based on the i386 instruction set on the x86 platform, by default.

One way to possibly increase performance in a dramatic way is to set the -mcpu option to your specific processor. Here is a portion of the entry from the GCC docs regarding this option:

-mcpu=cpu_type

Set the instruction set, register set, and instruction
scheduling parameters for machine type cpu_type.

Since the processor for my Sun Ultra 5 is a V9-based UltraSPARC IIi, I’ll use

-mcpu=ultrasparc
. Since the only V9 systems are UltraSPARC, there’s no real reason to use -mcpu=v9, -mcpu=ultrasparc would work for all UltraSPARC processors and is the (theoretically) high optimization.

It should be noted that pecifying -mcpu=ultrasparc or even v9 for the V9/64-bit class of processors will not create 64-bit code. The code will still be tuned for the UltraSPARC processors, but the binaries will remain 32-bit. The creation of 64-bit code requires using the -m64 flag (-m32 for 32-bit code is implied by default).

To show how dramatically -mcpu can affect performance on the SPARC architecture, I ran some comparison tests with OpenSSL 0.9.7c compiled with three

-mcpu
optimizations: v7, v8, and ultrasparc.

These tests were run under Solaris 9 (12/03) and compiled with GCC 3.3.2 and compiled with -O3. Each test was run three times, and the results averaged. The individual results varied very little.




For the computationally-intensive OpenSSL, the -mcpu=ultrasparc optimization doubled the performance when compared with V7.

Of course not all applications will benefit to this extent; presumably there are applications that would benefit very little. But for those CPU-intensive operations, this optimization can make a big difference.

The difference is much more dramatic than what we would see with similar optimizations on the x86 platform.

To show you the contrast in performance in intra-platform optimizations, I ran the same test on a Pentium III 1 GHz x86 system. I compiled OpenSSL with -march=i386 and -march=i686 (the highest effective optimization for my Pentium III system).

The x86 test system is running Linux 2.4, and OpenSSL 0.9.7c was again compiled with GCC 3.3.2. They were compiled with -O3, and each run was done 3 times with the results averaged. Again, there was very little delta between the individual runs.

Since I’m running a Pentium III, I could have used -march=pentium3. I actually did, and found there to be no difference in results between -march=i686 and -march=pentium3. Also, OpenSSL on Linux x86 is often distributed in both i386 and i686 iterations.

Remember, we’re not comparing the performance of a 1 GHz Pentium III processor with a 333 MHz UltraSPARC IIi processor, rather we’re comparing the difference between the lowest common denominator and the highest (effective) optimization between x86 and SPARC.


As you can see, the i686 flag does indeed give a performance boost as expected, but it’s not nearly as dramatic as the difference between V7 and V9 (or even V8) on SPARC. This highlights the importance of optimizations for SPARC.


Contrasting With x86
You may have noticed that I used -march for x86, yet -mcpu for SPARC. For x86 GCC users this may seem confusing, since

-mcpu
under x86 only tunes a specific CPU, but doesn’t take advantage of any additional instructions or additional functionality.

For SPARC, there is no -march flag, instead it uses

-mcpu
to specify platform-specific optimizations. The
-mtune
flags works as the -mcpu has typically been used on the x86 platform, by tuning code for a particular platform but not taking advantage of additional instructions. (It should be noted that the -mcpu flag has actually been deprecated on x86 GCC in favor of -mtune.)

So while -mtune is the same on both x86 and SPARC (creates backward compatible tuned binaries), -mcpu creates CPU-specific binaries (and not backward compatible) for SPARC, and -march does the same for x86.

For great resources on GCC for x86, check out GCC Facts and Myths by Joao Seabra and the GCC x86 optimization docs from GCC.

The -On Flag
Another optimization option for GCC (universal to all platforms) is the -On flag, which controls many more specific optimization flags.

Further reading on these optimizations can be found on the GCC document site.

To see what the effect of the -On flag with GCC has, I compiled OpenSSL 0.9.7c with -mcpu=ultrasparc, and -On (where n could be 0 through 3), which is the range for GCC (there’s also -Os, which does maximum optimizations save for anything that might tend to dramatically increase size, but I didn’t test that).

As before, the tests were run 3 times for each variant, and the results averaged. There was very little delta between the runs. OpenSSL 0.9.7c was used on Solaris 9 (12/03), compiled with GCC 3.3.2.



The results where quite surprising, as I had thought going in that there would be greater delta between the various levels of optimizations. As the results show, there wasn’t much difference until going to zero.

This was only a single application, and the effectiveness of these optimizations will vary of course depending on your application, so keep that in mind.

SSH is Slow, but Why?
On many of the operating systems I evaluated, I noticed that logging in via SSH was inordinately slow, such as NetBSD 1.6.1. It could take 3 or more seconds to get a password prompt. I knew it wasn’t the hardware, as logging into Solaris via SSH would return a password in less than a second. So what was a culprit? Google came up with these two items of particular note:

  • http://sparclinux.net/faq/cache/22.html
  • http://marc.theaimsgroup.com/?l=linux-sparc&m=101663078400634&w=2

    When I ran the OpenSSL speed test on NetBSD, I got extremely poor performance:

    OpenSSL 0.9.6g 9 Aug 2002
    built on: NetBSD 1.6.1
    options:bn(32,32) md2(int) rc4(ptr,int) des(ptr,risc1,16,int) blowfish(idx)
    compiler: gcc version 2.95.3 20010315 (release) (NetBSD nb3)
    sign verify sign/s verify/s
    rsa 512 bits 0.0248s 0.0022s 40.2 449.9
    rsa 1024 bits 0.1279s 0.0076s 7.8 131.7
    rsa 2048 bits 0.9217s 0.0276s 1.1 36.2
    rsa 4096 bits 6.4647s 0.0928s 0.2 10.8
    sign verify sign/s verify/s
    dsa 512 bits 0.0224s 0.0281s 44.7 35.5
    dsa 1024 bits 0.0750s 0.0927s 13.3 10.8

    This was even slower than my OpenSSL 0.9.7c tests with the V7 instruction set on Solaris 9. Performing a “/usr/bin/true” through SSH on NetBSD showed the lengthy delay:

    > time ssh 192.168.0.19 “/usr/bin/true”
    0:02.79
    Almost 3 seconds! And it wasn’t just NetBSD, either. A few others suffered the same problem.

    To fix this, I compiled OpenSSL 0.9.6l from NetBSD’s pkgsrc and compiled OpenSSH 3.7.1p2. I made sure to include “-mcpu=ultrasparc“, and ran the โ€œ

    /usr/bin/true
    โ€ test again.

    > time ssh [email protected] “/usr/bin/true”
    0:01.35

    I was able to cut the time almost in half with that optimization.

    I ran the same test on Solaris 9, using OpenSSL 0.9.7c libs compiled for -mcpu=v7 and -mcpu=ultrasparc OpenSSH 3.7.1p2.

    For -mcpu=v7, the login took almost 2 seconds.

    > time ssh [email protected] “/bin/true”
    0:01.56

    With

    -mcpu=ultrasparc
    however, it took less than a second.
    > time ssh [email protected] “/bin/true”
    0:00.95

    Where To Add The Optimizations
    There are a few ways to add optimizations at compile time. For many applications, you can go into the Makefile and look for the CLFAG entry, such as this for OpenSSL 0.9.6l on NetBSD 1.6.1:

    CFLAG= -fPIC -DDSO_DLFCN -DHAVE_DLFCN_H -DTERMIOS -O2 -Wall

    Here is where I would add

    -mcpu=ultrasparc
    , probably at the end.

    CFLAG= -fPIC -DDSO_DLFCN -DHAVE_DLFCN_H -DTERMIOS -O2 -Wall -mcpu=ultrasparc

    For applications like MySQL, there are several subdirectories with their own Makefiles, all generated/configured when the Configure

    script is run. Editing just the top level
    Makefile
    probably will not affect the subdirectories, so there needs to be another way.

    Often, these applications will accept environment variables of

    CFLAGS
    (for the C compiler) and CXXFLAGS (for the C++ compiler flags).

    export CFLAG="-O3 -mcpu=ultrasparc"

    Running that before you run the configure script will add those flags. You can see in this excerpt from the Configure โ€“help

    command from MySQL 4.0.17, showing the various compiler-related environment variables it accepts:
    CC C compiler command
    CFLAGS C compiler flags
    LDFLAGS linker flags, e.g. -L[lib dir] if you have libraries
    in a nonstandard directory [lib dir] CPPFLAGS C/C++ preprocessor flags, e.g. -I[include dir] if you have headers in a nonstandard directory [include dir] CXX C++ compiler command
    CXXFLAGS C++ compiler flags
    CPP C preprocessor

    This is common for the more complex open source applications.


    To Optimize or Not To Optimize
    Optimization really depends on what you’re compiling. If you’re creating a โ€œhello worldโ€ application, or compiling ls from GNU’s fileutils, you probably don’t need to squeeze every ounce of possible performance. Characteristics such as mathematical operations versus I/O would all be factors in the potential benefit.

    Still, the performance optimizations discussed can have a potentially huge impact on performance on SPARC systems, much more dramatically than comparable optimizations on x86 systems.

    As such, adding -mcpu options for compilation is a good idea for systems that support V8 or higher. Even if you’ve got a mix of systems, it can very well be worth your time to keep multiple sets of binaries, one for each platform you run.

  • 22 Comments

    1. 2004-02-23 10:07 pm
    2. 2004-02-23 10:10 pm
    3. 2004-02-23 10:48 pm
    4. 2004-02-23 10:54 pm
    5. 2004-02-23 11:00 pm
    6. 2004-02-23 11:01 pm
    7. 2004-02-23 11:14 pm
    8. 2004-02-24 12:33 am
    9. 2004-02-24 12:37 am
    10. 2004-02-24 1:16 am
    11. 2004-02-24 3:44 am
    12. 2004-02-24 4:10 am
    13. 2004-02-24 6:12 am
    14. 2004-02-24 6:44 am
    15. 2004-02-24 6:56 am
    16. 2004-02-24 8:09 am
    17. 2004-02-24 2:50 pm
    18. 2004-02-24 4:24 pm
    19. 2004-02-24 4:31 pm
    20. 2004-03-02 9:52 pm
    21. 2004-03-02 9:56 pm
    22. 2004-03-04 2:21 pm