Why glxgears Is Slower with Kernel Mode Setting, Why It Doesn’t Matter

Guest post by Rahul 2009-03-14 Linux 25 Comments

Fedora Project has been on the forefont of development and adoption of kernel mode setting to enhance the desktop linux experience by making fairly invasive infrastructure improvements that affect the interaction between Xorg and the Linux kernel. In the past, one of the common way to test Xorg performance has been to use glxgears. While that hasn’t been a particular good way to do it ever, the switch to kernel mode setting for Intel drivers ahead of the Fedora 11 Beta release to be available shortly has exposed the fallacy of this. In short, don’t use glxgears. There are better methods to assess performance.

25 Comments

2009-03-14 11:55 pm
sbergman27
So, let me get this straight. glxgears is too “insanely simple” a rendering task for DRI2 to perform well, and so performance tanks by about 60% compared to DRI1. So don’t look at glxgears; it makes us look bad. Use these other benchmarks that we have cherry-picked that make us look better.
I’m not going to argue that glxgears was ever a great benchmark. But this just sounds too much like “Ignore the man behind the curtain!” for my taste.
Edited 2009-03-14 23:56 UTC

2009-03-15 12:21 am
ebassi
in short: having more frames per second than your display is able to paint is just plain stupid.
sure, it might make you feel that you haven’t wasted money on the latest and greatest GPU, but it’s *never* a meaningful benchmark of *anything*.
for this reason the only sane value for glxgears should be the equal to the redraw frequency of your display — 50, 60, 85, whatever. having 1000s of FPS is an insane value, and it does not measure anything except marginal stuff — like the speed of a GL buffer swap in this case.
all applications should *always* sync to the vblanking of your display; it avoids tearing, it simplifies redrawing cycles and avoids wasting power.

2009-03-15 12:29 am
sbergman27
I’m not sure I’d be so quick to dismiss the cost of a glXSwapBuffers(). 440 swaps per second corresponds to 22% of a refresh at 100Hz.
I’ll bet you that in some future release we’ll be hearing that “Optimizations have sped up glswapbuffer() by factor of 2.5!” which will be our cue to cheer.
Edited 2009-03-15 00:31 UTC
2009-03-15 4:10 am
cb_osn
in short: having more frames per second than your display is able to paint is just plain stupid.
sure, it might make you feel that you haven’t wasted money on the latest and greatest GPU, but it’s *never* a meaningful benchmark of *anything*.
for this reason the only sane value for glxgears should be the equal to the redraw frequency of your display — 50, 60, 85, whatever. having 1000s of FPS is an insane value, and it does not measure anything except marginal stuff — like the speed of a GL buffer swap in this case.
all applications should *always* sync to the vblanking of your display; it avoids tearing, it simplifies redrawing cycles and avoids wasting power.
For many types of applications, this is true, but it makes less sense when the processing behind each frame is intensive. Games are the obvious example where lots of processing time is spent preparing the next frame for rendering– logic, AI, physics, animation, culling are all necessary prior to submitting commands to the graphics card. Locking the graphics driver to the refresh rate effectively causes buffer swap function call to block the entire thread until the next vblank. This means that you can’t even begin preparing the next frame until the current one has been presented to the display.
The optimal solution is to run your rendering operations in a separate thread. Unfortunately, the issues involved in synchronizing the render/logic threads are nontrivial. The massively parallel PS3 and the trend of increasing the number of cores rather than the clock speed in PCs have accelerated research in this area, but we’re not quite there yet.

2009-03-15 12:24 am
JMcCarthy
No, they’re not saying don’t look at it they’re saying it’s meaningless, it’s a good benchmark for a particular task, which they admit is indeed slower, otherwise it’s a poor general benchmark.
If what you’re suggesting is true they aren’t really doing any favors by suggesting applications which are much more intensive.
Edited 2009-03-15 00:26 UTC
2009-03-15 12:25 am
Rahul
Simple tools for benchmarking are not a problem. Simplistic tools, not reflective of real world performance are. Even worse are tools that measure the wrong thing. That glxgears isn’t reflective of real world performance is not new knowledge and has been known for a long time. Ask a Xorg or OpenGL developer or even a avid gamer. We have people using top and blaming Xorg instead of the actual application (hint: use xrestop to find the actual culprit). This is similar.
With DRI2, what glxgears tests for can indeed get slower but it doesn’t affect real world performance and there are other equally simple tools that are more accurate.
2009-03-15 12:26 pm
renox
The only reason users used glxgerars is that it’s ubiquitous as the only thing glxgears is good for is to tell whether the 3D is HW accelerated or not.
2009-03-15 8:07 pm
vermaden
I’m not going to argue that glxgears was ever a great benchmark. But this just sounds too much like “Ignore the man behind the curtain!” for my taste.
Glxgears never should been used as a benchmark, it only shows that 3D is up and running.
Other thing is that my buddy noticed a slow down after all that new things went up (DRI2/KMS/…) in the 3D stack, not only glxgears.
I do not bother more about all this “3D developpment” generally, xorg becomes bigger and bigger shit, things that used to work now does not work, it appears that hald is needed when it never was, sad really …

2009-03-15 9:20 pm
Rahul
Without HAL, Xorg used to poke the hardware directly bypassing the kernel which would occasionally result in weird and hard to track bugs. Using the same infrastructure as rest of the desktop is a very good thing. We can finally get some integration.
The reason that Xorg did it’s own thing and duplicate parts of the operating system was because it was build on proprietary Unix systems with no deep access to OS internals. We cannot continue to carry on historical baggage and mistakes.

2009-03-15 10:23 pm
dmantione
No, the reason is that Linus didn’t want video drivers in his kernel. “Just do what mama said: Use VGA text mode and X” is a popular quote from him. Also remember the big kernel flame war about GGI/KGI.
It’s ironic that more than a decade later, people realiser the GGI/KGI folks have been right: Video-configuration should be in-kernel.
I still haven’t had the answer why swapping buffers needs to be slower because of kernel-mode-setting, though.

2009-03-16 1:34 am
Rahul
Linus objections came in much later and only relates to Linux. Xorg has a longer history than that. I am talking about original days of the Xorg group before XFree86 people forked.

2009-03-15 12:53 am
BSDfan
If glxgears is slow, there is a bug in the driver.. simple.
Edited 2009-03-15 00:53 UTC

2009-03-17 7:55 pm
atriq
If Quake Wars is slow, then there’s a bug in the driver. If glxgears is slow, I really don’t care.

2009-03-15 1:10 am
Morph
“glXSwapBuffers() – basically, how fast we can push render buffers into the card… is slower with DRI2”
Does anyone know exactly why it is slower? Because there is an extra copy involved when copying data from userspace to kernel space? Or because of the system call overhead?

2009-03-15 4:21 am
cb_osn
Does anyone know exactly why it is slower? Because there is an extra copy involved when copying data from userspace to kernel space?
Without digging through the code myself, this would be my assumption. As far as I know, X has always dealt with accelerated graphics by using overlays– which causes its own set of problems. If DRI2 is handling composition of 3D graphics by copying the data out of the GPU buffers and back into system memory, this would account for the decrease in speed. If this is the case, I can only hope that it is a stopgap measure and that they eventually intend to handle all compositing in the graphics hardware. Of course, given the way network transparency works in X and how deeply embedded it is in the graphics model, I’m not sure how they can reconcile those features.

2009-03-15 2:47 pm
siride
Please stop blaming network transparency. It’s only relevant for the wire protocol, and even then, on a local machine, pretty fast alternative mechanisms are used instead of, e.g., TCP sockets.

2009-03-16 6:14 am
cb_osn
Please stop blaming network transparency. It’s only relevant for the wire protocol, and even then, on a local machine, pretty fast alternative mechanisms are used instead of, e.g., TCP sockets.
This is not the same as being able to use shared memory to copy pixmaps across process boundaries. When OpenGL is involved, we’re talking about copying data from GPU memory to system memory which requires a synchronized copy across the PCI bus– and this incurs a severe performance penalty.
I have a fairly limited knowledge of X, but a great deal of knowledge about 3D graphics. I’m making an educated guess that DRI2 must perform this expensive copy operation due to some technical limitations of X, possibly related to network transparency (I assume each rendered frame must be present in system memory for transmission across the network, and I also assume this is the case even if the X server is running locally). I’ll try to find the time to do some research on it in the next few days, and if anyone else has more detailed information, I’m more than willing to be corrected if I am wrong.

2009-03-16 1:16 pm
siride
3d over the network is to be handled differently from 3d locally. And, as far as I have seen, they really haven’t cared much about making 3d over the network work that well. You are pretty much limited to sending GLX commands over the wire and having the X server do all the rendering work (aka, indirect rendering). This is completely different from DRI and DRI2 which have clients communicate directly with the hardware, bypassing even the X server. There may be a need to move stuff from GPU to system memory, but that would not be related to network transparency.
2009-03-16 4:20 pm
ba1l
I assume each rendered frame must be present in system memory for transmission across the network, and I also assume this is the case even if the X server is running locally.
You’d be wrong on both counts.
In network mode, rendering is assumed to be done on the remote X server. OpenGL commands are serialized (that’s what the GLX spec mainly deals with), and sent across the network connection to the remote X server. The remote X server then renders the scene using any graphics hardware it has. The actual rendering buffer is kept in VRAM by the remote X server.
In local mode, OpenGL uses direct rendering. The OpenGL implementation sends rendering commands directly to the graphics hardware, bypassing even the X server. Again, the buffer stays on the video card.
DRI2 makes a lot of changes, mostly relating to memory management. It’s likely that it’s having to copy the render buffer somewhere else for display, but that’s likely to be somewhere else in VRAM. Video cards have very fast copy operations that should be able to do that.
If it were copying the entire scene into system RAM, and then back again, that’d cause a much greater performance loss than the 40% or so we’re seeing here.
Then again, this is on Intel hardware, which is on-board, and doesn’t have it’s own VRAM. A copy to or from an area of RAM the video hardware can use should be fairly fast, since it doesn’t have to go through the AGP or PCI-X bus. So I have no idea.
2009-03-16 4:53 pm
cb_osn
In network mode, rendering is assumed to be done on the remote X server. OpenGL commands are serialized (that’s what the GLX spec mainly deals with), and sent across the network connection to the remote X server. The remote X server then renders the scene using any graphics hardware it has. The actual rendering buffer is kept in VRAM by the remote X server.
Are GL resources (textures, VBOs, display lists) serialized as well? It seems likely that these resources would not be present on the machine running the remote X server.
In local mode, OpenGL uses direct rendering. The OpenGL implementation sends rendering commands directly to the graphics hardware, bypassing even the X server. Again, the buffer stays on the video card.
Good to know, thanks.
DRI2 makes a lot of changes, mostly relating to memory management. It’s likely that it’s having to copy the render buffer somewhere else for display, but that’s likely to be somewhere else in VRAM. Video cards have very fast copy operations that should be able to do that.
Yes, VRAM to VRAM copies are very fast and would not be an issue.
If it were copying the entire scene into system RAM, and then back again, that’d cause a much greater performance loss than the 40% or so we’re seeing here.
We used to do exactly this to enable render to texture support before p-buffers (and later, FBOs) where standardized. The performance penalty that we saw was actually very close to 40%, which is why I assumed this may be the case.
Then again, this is on Intel hardware, which is on-board, and doesn’t have it’s own VRAM. A copy to or from an area of RAM the video hardware can use should be fairly fast, since it doesn’t have to go through the AGP or PCI-X bus. So I have no idea.
[/q]
This is a great point, and makes me even more curious as to why we’d see such a performance loss during a buffer swap.
2009-03-16 4:37 pm
cb_osn
Thank you, siride and ba1l, for your responses.
Edited 2009-03-16 16:38 UTC

2009-03-15 2:53 pm
sbergman27
Of course, given the way network transparency works in X and how deeply embedded it is in the graphics model, I’m not sure how they can reconcile those features.
Uhhh… what do you think D, R, and I stand for?

2009-03-15 3:24 am
3rdalbum
Oh no! With the current situation, I get 4700 FPS on glxgears. With kernel mode setting, I only get a mere 1890 FPS! Now it’s only 23x as many frames per second than I can actually see!
Didn’t glxgears once have an option to turn on frames-per-second reporting, which was enabled by using the flag “-iunderstandthisisnotabenchmark”?
If kernel mode setting doesn’t introduce a slowdown in the Phoronix Test Suite benchmarks, then I’m happy.
2009-03-15 6:09 am
dmantione
One of the worst things you can do is to hide graphics hardware behind the kernel. Like the original IBM PC had a BIOS call to draw a pixel… This is not going to work, drawing operations need to be performed from userspace. If DRI2 requires a system call to draw actual data, it has a braindead architecture.
That does not mean kernel mode setting is a bad idea, it is good. Proper achitecture leaves graphics card configuration, including mode setting to the kernel. This is because the kernel must know the current configuration to ensure system stability. The actual drawing is a task that belongs to userspace.
2009-03-18 4:59 pm
axilmar
…then it should specify an interface which the usermode X driver fills with appropriate code.