Cairo, Xlib, and the Shared Memory Extension

Submitted by diegocg 2008-06-28 Wayland 42 Comments

“Maybe I’m just naive, but designing a graphics API such that all image data had to be sent over a socket to another process every time the image needed to be drawn seems like complete idiocy. Unfortunately, that is precisely what the X Window System forces a program to do, and exactly what Cairo does when drawing images in Linux – a full copy of the image data, send to another process, no less, every time it is drawn. One would think there would be some room for improvement. Unsurprisingly, others felt the same way about X, and decided to write an extension, Xlib Shm or XShm for short, that allows images to placed in a shared memory segment from which the X server reads which allows the program to avoid the memory copy. GTK already makes use of the XShm extension, and it seems like a good idea to see if Gecko couldn’t do the same.”

About The Author

Thom Holwerda

Follow me on Mastodon @thomholwerda@exquisite.social

42 Comments

2008-06-28 10:59 pm
ohxten
Maybe this is why FF always feels laggier on *nix for me.
2008-06-28 11:52 pm
jjezabek
Honestly… Xshm has been around for like ~10-15 years (I’ve used it since at least 2001). And none of the Cairo folks thought that it might be useful for fast drawing of images… That explains a lot…
2008-06-29 12:00 am
PlatformAgnostic
Why not cache a few larger XShm segments and parcel out pieces of them for small images? That would save a lot of the setup costs.

2008-06-29 9:11 am
gilboa
Why not cache a few larger XShm segments and parcel out pieces of them for small images? That would save a lot of the setup costs.
I wondered the same. (I have zero experience with xlib; I’m just guessing)
In general you can do one of two things:
A. Create slab cache management for Shm (within xlib; E.g. 1KB, 2KB, 8KB, 16Kb). Use lazy release for unused blocks.
B. Allocate a huge shared buffer and cut it into small random-size pieces.
(B.) has the obvious down side of concurrency that may negate any performance advantage you get by reducing the number of share memory blocks to 1.
(A.) will increase the memory footprint of your application and fragment your memory. (Especially in browsers – where image size tend to vary radically)
– Gilboa

2008-06-29 12:04 am
sbergman27
1. I agree that using XShm when the client is local is good.
2. I’m surprised that Cairo did not already use it.
However, it looks to me like XShm reduced the total time by only about 20% and the author concluded that the original way was “idiocy”, which seems a bit over the top, especially considering the extra flexibility of the socket approach.
Despite all the moaning about X I hear on this site, not one of the 40 or so users I have running complete desktops via remote X over 100mbit lan has ever once complained to me about their screen update speeds.
Edited 2008-06-29 00:08 UTC

2008-06-29 3:51 am
tyrione
1. I agree that using XShm when the client is local is good.
2. I’m surprised that Cairo did not already use it.
However, it looks to me like XShm reduced the total time by only about 20% and the author concluded that the original way was “idiocy”, which seems a bit over the top, especially considering the extra flexibility of the socket approach.
Despite all the moaning about X I hear on this site, not one of the 40 or so users I have running complete desktops via remote X over 100mbit lan has ever once complained to me about their screen update speeds.
100BaseT is the baseline for no-one to complain. Move back to 10baseT or 10base2 and compare it’s performance in a network environment next to OS X and it’s WindowServer approach.

2008-06-29 5:00 am
siride
I’ve used such and I haven’t found X to be slower really at all. In fact, because two machines are involved, it created a sort of dual core environment, where one CPU did X and the other (on the remote machine) ran all the programs.
Of course, GTK+ was a little laggy, but it’s that way even with a local connection.
I’ve also done remote X across a campus network (actually, between two networks on the campus) and it was tolerable, with a little bit of lag. Part of the problem was the shitty X server running on a slow Windows machine on my end.
2008-06-29 8:01 am
dmantione
100BaseT is the baseline for no-one to complain. Move back to 10baseT or 10base2 and compare it’s performance in a network environment next to OS X and it’s WindowServer approach.
If the application is well written (properly uploads all its images to the X server and reuses them), X still rules over a 10 mbit network compared to RDP, because you have an Xserver with all extensions activated. RDP can only do 2D RGB screen updates (which it does this very well), but cannot do YUV, 2D rendering (Xrender) or 3D.
Watching a DVD with Xine on a remote Unix server, for example, works rather well because a PAL frame is just 720×288 pixels 50 times a second. YUV data is only 16 bits per pixel. Xvideo on the Xserver takes care of RGB conversion, deinterlacing and scaling.
Doing this with RDP works very badly, because if you watch full-screen at 1280×960 the RDP server needs to send 1280x960x32bit@50Hz, which is even doubtfull on a 100mbit network.
Further, note that X is not so much bandwidth but more latency limited. Therefore, over the internet is performs poorly, while RDP shines when running over the internet.

2008-06-29 8:41 am
Anonymous
It sounds like you’re thinking of VNC. I’m pretty sure that RDP only sends the *instructions* and raw image data (if needed) to draw the window on the client side.
So no, watching a DVD wouldn’t be a full-screen update @ refresh rate.

2008-06-29 9:39 am
dmantione
It sounds like you’re thinking of VNC. I’m pretty sure that RDP only sends the *instructions* and raw image data (if needed) to draw the window on the client side.
So no, watching a DVD wouldn’t be a full-screen update @ refresh rate.
Try it and watch the difference between Windows & Linux. Highly recommended.
The difference between VNC and RDP is that VNC captures the desktop and sends modified part, while RDP sits at the GDI level, it sends GDI drawing commands over the network. As the GDI has no “overlay” facilties, watching video on RDP falls back to sending RGB images over the network.

2008-06-29 8:43 am
PlatformAgnostic
Work is being done to alleviate at least one of the limitations you mentioned in RDP. The video problem is interesting too, and the general solution adopted by some folks (but not RDP afaik) is to send the compressed video data through a side channel and decompress it on the client side. This seems like the only workable design to me.

2008-06-29 12:07 pm
Soulbender
Move back to 10baseT or 10base2 and compare it’s performance
True but we’re in 2008 now, not 1990, and no one is using 10baset and 10base2 anymore.
If you do, well, get with the program.

2008-06-29 5:19 pm
sbergman27
True but we’re in 2008 now, not 1990,
Actually, I did not directly address this issue in my previous response to tyrione. X works about as well under 10baseT as 100baseT. People have a tendency to think in terms of bandwidth. X actually does quite well over limited bandwidth. It is latency which kills the standard (non-NX) X protocols. 10baseT has essentially the same latency as 100baseT. The most dramatic way to demonstrate standard X’s sensitivity latency is to set up a ppp connection over an external modem. Start an application like Firefox and watch the lights. According to, I believe, Keith Packard some years ago (who had worked on LBX previously), approximately 90% of the round trips were actually unnecessary and an artifact of the then current implementation, and not inherent in the protocol itself. He was working on eliminating those unnecessary round trips. Not sure where we stand today.
However, it does not really matter, because for any sort of WAN connection we have NX.
Edited 2008-06-29 17:26 UTC

2008-06-29 5:42 pm
Soulbender
10baseT has essentially the same latency as 100baseT.
I thought his point was the 10baseX saturates quicker, given X being more bandwidth hungry, thus causing higher latency.
2008-06-29 6:15 pm
sbergman27
I thought his point was the 10baseX saturates quicker, given X being more bandwidth hungry, thus causing higher latency.
It takes a lot to saturate 10mbit. Watching a movie or playing software rendered Quake will do it. But browsing the web with Firefox/Cairo? Not even close. Many of my users started out on 10baseT some years back and the difference is not even detectable when doing normal business desktop things.
Yes, if what you are doing is inherently high bandwidth the extra bandwidth of 100baseT will help. Otherwise, latency is the determining factor.
Edited 2008-06-29 18:34 UTC

2008-06-29 2:14 pm
sbergman27
You are comparing apples to oranges. Even more of my users run from remote offices using NX (which is still really X). And believe me… NX blows RDP out of the water for both speed and quality over WAN connections. They are not even in the same class. Proxying RDP (or VNC) through an NX server helps some, but not that much. The OSX approach is nothing but VNC, which is noticeably poorer in quality and speed than even RDP.
X has Xshm which is optimum for local clients, straight X which is optimum for LAN environments, and NX which is optimum for WANs. No other windowing system can touch that combination for performance and flexibility.
Edit: I should mention that as amazing as FreeNX is… it has the rudest and most unhelpful support mailing list in all of Open Source, to the point that if I had another option that was even remotely as good I would migrate to it.
Edited 2008-06-29 14:17 UTC

2008-06-30 10:36 am
tyrione
You are comparing apples to oranges. Even more of my users run from remote offices using NX (which is still really X). And believe me… NX blows RDP out of the water for both speed and quality over WAN connections. They are not even in the same class. Proxying RDP (or VNC) through an NX server helps some, but not that much. The OSX approach is nothing but VNC, which is noticeably poorer in quality and speed than even RDP.
X has Xshm which is optimum for local clients, straight X which is optimum for LAN environments, and NX which is optimum for WANs. No other windowing system can touch that combination for performance and flexibility.
Edit: I should mention that as amazing as FreeNX is… it has the rudest and most unhelpful support mailing list in all of Open Source, to the point that if I had another option that was even remotely as good I would migrate to it.
Let’s not compare WindowServers of OS X to Xorg. Apple didn’t design WindowServer around the notion of a VPN/RDP approach.
I will be interested to see how Apple reimplements NXHost and beyond for 10.6 and the Enteprise.

2008-06-29 12:38 am
anda_skoa
I think the article is exaggerating the problem.
True, transferring an image from the application to the X server takes quite some time and can be sped up using the shared memory extension.
However, an image does not have to be transferred every time it needs to be drawn.
As long as the application does not explicitly free the resources associated with the X server side image, it can be drawn or bitblitted with basically no delay.

2008-06-29 1:50 am
rayiner
This is a key point!
The “idiocy” comment in the original posting was way off-base. The XImage system actually makes a ton of sense. In most apps, you’re drawing the same images over and over again (sprites, chrome). Its not a big deal to transfer them to the X server the first time they’re drawn. In a web-browser context, you’re drawing different images all the time, but what do you think is the bottleneck, copying the image to the X server, or downloading them from the internet?
Where XShm helps is when you dynamically generate images for each frame, and as such are constantly uploading new images to the Xserver to draw. However, even in that case, it is unlikely to provide a huge speedup, since the overall time is probably dominated by creating the image, not copying it to the X server.

2008-06-29 9:02 am
diegocg
There’s an additional big problem with sockets: Once you send the image to the server, the client can’t “share” it, because it’s living in the server’s address space. So you have a copy of the image on the server AND exactly the same copy of the image on the client. The image is effectively using 2x its size in the memory.
The “right fix” for this stupid waste of resources is to free() the image on the client. With shared memory, you can SHARE the image’s memory between the client and the process.
That’s how things should work on a sane local graphic setup, but X is far from being “sane”. Yes, X.org IS efficient and comparable to other systems, but only because people have spent a lot of time workarounding the stupid things it does so that at the end in local systems it works like other sane systems.
That’s the funny thing: X is supposed to be “network transparent”, but then the server and the client need to DETECT if they’re in “local mode”, or if such extension is enabled or not, and make SPECIAL MODIFICATIONS to the code paths so that it works right. There’s no “transparency” anymore, as applications (toolkits) need to be aware of such low-level graphic details and code different paths according to the results. It has been needed to create many X extensions, it has been needed to MODIFY applications to use those extensions…oh my god, it’s just not fun even to think about it. And to think that backwards compatibility is all that was stopping people from doing it right…
Edited 2008-06-29 09:20 UTC

2008-06-29 10:20 am
FooBarWidget
That’s how things should work on a sane local graphic setup, but X is far from being “sane”. Yes, X.org IS efficient and comparable to other systems, but only because people have spent a lot of time workarounding the stupid things it does so that at the end in local systems it works like other sane systems.
How is any of this “stupid”? Isn’t that exactly how most software is supposed to be engineered? What do you expect from them? That shared memory somehow magically works over the network? What kind of answer would satisfy you?
That’s the funny thing: X is supposed to be “network transparent”, but then the server and the client need to DETECT if they’re in “local mode”, or if such extension is enabled or not, and make SPECIAL MODIFICATIONS to the code paths so that it works right.
No they don’t. “Transparent” means that you don’t *have* to do that. Even if you don’t use the special modifications, things will still work. It’s just that if you do use the special modifications, then things will be more efficient.
Really, what more do you want? It seems that no matter what solution is given, you will never be satisfied until X doesn’t work over the network at all. While that may be less “stupid” in your eyes, it’ll actually be a step backwards, because then all your apps are still working on the same speed, *and* they don’t work on the network anymore.

2008-06-29 2:01 pm
diegocg
How is any of this “stupid”? Isn’t that exactly how most software is supposed to be engineered?
Certainly not GOOD software. Good software optimizes the common case and workarounds the corner cases. X optimizes for the corner case, and workarounds the common one.
2008-06-29 2:11 pm
FooBarWidget
You call it a workaround, I call it the only way things can possibly be implemented given the feature set.
Clearly there’s nothing X can ever do to satisfy you, you already have your opinion ready regardless of what they do.

2008-06-29 10:30 am
renox
Note that sharing the image between the client and the server is not always the right thing to do as setting up shared memory is costly.
So the wise thing to do is to copy small images and use shared memory only for big one.
As for you’re rant against X, I’d say that ‘standard X’ is network transparent, but when you want to use local optimisations, then it isn’t network transparent anymore, which makes sense.
That’s why we have toolkits which hide this.
2008-07-02 3:58 am
fernandotcl
There’s an additional big problem with sockets: Once you send the image to the server, the client can’t “share” it, because it’s living in the server’s address space. So you have a copy of the image on the server AND exactly the same copy of the image on the client. The image is effectively using 2x its size in the memory. The “right fix” for this stupid waste of resources is to free() the image on the client. With shared memory, you can SHARE the image’s memory between the client and the process.
The client can’t share it? Wtf? Once you send it, it’s gone? Have you EVER dealt with sockets at all?
It’s not a big deal having it in memory twice. That certainly is not a bottleneck for 99% of all X apps out there nowadays.
The “right fix”… OMG…

2008-06-29 8:49 am
luzr
Well, we are speaking about Cairo here right?
Cairo is basically a software renderer. That implies that in fact you do all your rendering to the memory, then transfer the resulting image to X11.
That in fact implies that you rarely are able to “reuse” stored images. Each screen update means a new image and new transfer.
IME, the difference in image transfer is about 2x. OTOH, the rendering itself can perhaps take quite a bit of time too, so in practice, maybe it really is not that critical.
BTW, you can try it:
x11perf -putimage500
x11perf -shmput500

2008-06-29 10:21 am
STTS
Fedora 9 with all updates, redeon driver, AMD 3870, Phenom 9750
[storm@fast ~]$ x11perf -putimage500
x11perf – X11 performance program, version 1.5
The X.Org Foundation server version 10499902 on :0.0
from fast
Sun Jun 29 14:11:16 2008
Sync time adjustment is 0.0334 msecs.
2800 reps @ 2.0214 msec ( 495.0/sec): PutImage 500×500 square
2800 reps @ 1.9846 msec ( 504.0/sec): PutImage 500×500 square
2800 reps @ 1.9471 msec ( 514.0/sec): PutImage 500×500 square
2800 reps @ 1.9314 msec ( 518.0/sec): PutImage 500×500 square
2800 reps @ 1.9462 msec ( 514.0/sec): PutImage 500×500 square
14000 trep @ 1.9661 msec ( 509.0/sec): PutImage 500×500 square
[storm@fast ~]$ x11perf -shmput500
x11perf – X11 performance program, version 1.5
The X.Org Foundation server version 10499902 on :0.0
from fast
Sun Jun 29 14:12:17 2008
Sync time adjustment is 0.0305 msecs.
16000 reps @ 0.3851 msec ( 2600.0/sec): ShmPutImage 500×500 square
16000 reps @ 0.3684 msec ( 2710.0/sec): ShmPutImage 500×500 square
16000 reps @ 0.3661 msec ( 2730.0/sec): ShmPutImage 500×500 square
16000 reps @ 0.3684 msec ( 2710.0/sec): ShmPutImage 500×500 square
16000 reps @ 0.3680 msec ( 2720.0/sec): ShmPutImage 500×500 square
80000 trep @ 0.3712 msec ( 2690.0/sec): ShmPutImage 500×500 square
~5 times faster, not bad.

2008-06-29 1:02 pm
dpeterc
openSUSE 10.3, Intel Q6600, Nvidia8600 GTS nvidia binary drivers, 2560×1600 resolution, 32 bpp
x11perf -putimage500
x11perf – X11 performance program, version 1.5
The X.Org Foundation server version 70200000 on :0.0
from monsta
Sun Jun 29 14:38:21 2008
Sync time adjustment is 0.0413 msecs.
8000 reps @ 0.7415 msec ( 1350.0/sec): PutImage 500×500 square
8000 reps @ 0.8194 msec ( 1220.0/sec): PutImage 500×500 square
8000 reps @ 0.9253 msec ( 1080.0/sec): PutImage 500×500 square
8000 reps @ 0.8722 msec ( 1150.0/sec): PutImage 500×500 square
8000 reps @ 0.8803 msec ( 1140.0/sec): PutImage 500×500 square
40000 trep @ 0.8477 msec ( 1180.0/sec): PutImage 500×500 square
x11perf -shmput500
x11perf – X11 performance program, version 1.5
The X.Org Foundation server version 70200000 on :0.0
from monsta
Sun Jun 29 14:39:16 2008
Sync time adjustment is 0.0385 msecs.
12000 reps @ 0.4442 msec ( 2250.0/sec): ShmPutImage 500×500 square
12000 reps @ 0.4435 msec ( 2250.0/sec): ShmPutImage 500×500 square
12000 reps @ 0.4433 msec ( 2260.0/sec): ShmPutImage 500×500 square
12000 reps @ 0.4429 msec ( 2260.0/sec): ShmPutImage 500×500 square
12000 reps @ 0.4424 msec ( 2260.0/sec): ShmPutImage 500×500 square
60000 trep @ 0.4433 msec ( 2260.0/sec): ShmPutImage 500×500 square
Only two times faster. Or in other words, already the core X on a local server is very fast and does not use TCP/IP when local.
Don’t get me wrong, I use MIT-SHM for 14 years in my programs. But in optimization, you need to look at the total cost, not just optimize because you can. The actual time spent in copying the image buffer is most likely negligent with respect to generating the image data. So you will get better overall speedup by optimizing other part of the application.
In the old times, this data transfer was very slow, so optimizing it by MIT-SHM make a lot of sense. Nowadays, graphics cards are very fast (even on high resolutions like the ones I use), so optimizing with MIT-SHM does not buy you much (in the total application speed).

2008-06-29 2:18 pm
BSDfan
My Pentium 4 system running OpenBSD 4.3 (r128 driver):
$ x11perf -putimage500
x11perf – X11 performance program, version 1.5
The X.Org Foundation server version 10400090 on :0.0
from xxx.xxx
Sun Jun 29 10:04:07 2008
Sync time adjustment is 0.0857 msecs.
320 reps @ 16.6892 msec ( 59.9/sec): PutImage 500×500 square
320 reps @ 16.6960 msec ( 59.9/sec): PutImage 500×500 square
320 reps @ 16.7091 msec ( 59.8/sec): PutImage 500×500 square
320 reps @ 16.7655 msec ( 59.6/sec): PutImage 500×500 square
320 reps @ 16.6939 msec ( 59.9/sec): PutImage 500×500 square
1600 trep @ 16.7108 msec ( 59.8/sec): PutImage 500×500 square
$ x11perf -shmput500
x11perf – X11 performance program, version 1.5
The X.Org Foundation server version 10400090 on :0.0
from xxx.xxx
Sun Jun 29 10:04:44 2008
Sync time adjustment is 0.0856 msecs.
1600 reps @ 3.4037 msec ( 294.0/sec): ShmPutImage 500×500 square
1600 reps @ 3.4092 msec ( 293.0/sec): ShmPutImage 500×500 square
1600 reps @ 3.4257 msec ( 292.0/sec): ShmPutImage 500×500 square
1600 reps @ 3.4064 msec ( 294.0/sec): ShmPutImage 500×500 square
1600 reps @ 3.4086 msec ( 293.0/sec): ShmPutImage 500×500 square
8000 trep @ 3.4107 msec ( 293.0/sec): ShmPutImage 500×500 square
I definitely see a speed increase, stop thinking about how well it’ll improve performance on modern systems and realize people DO use older systems. (Not that the pentium 4 is is ancient, but I don’t have my Alpha system nearby.)
EDIT: This does not in any way mean I’m against the Networking principles of X, but for a local workstation that won’t be listening on TCP. (i.e: -nolisten tcp), local optimizations are a good idea.
Edited 2008-06-29 14:22 UTC
2008-06-29 2:55 pm
Ekorn
This is interesting, running Debian Lenny on a Core2 Duo 2,4GHz with integrated Intel GMA965 :
$ time x11perf -putimage500
x11perf – X11 performance program, version 1.2
The X.Org Foundation server version 10400090 on :0.0
from yggdrasil
Sun Jun 29 17:02:43 2008
Sync time adjustment is 0.0228 msecs.
8000 reps @ 0.7075 msec ( 1410.0/sec): PutImage 500×500 square
8000 reps @ 0.6736 msec ( 1480.0/sec): PutImage 500×500 square
8000 reps @ 0.6596 msec ( 1520.0/sec): PutImage 500×500 square
8000 reps @ 0.6818 msec ( 1470.0/sec): PutImage 500×500 square
8000 reps @ 0.6738 msec ( 1480.0/sec): PutImage 500×500 square
40000 trep @ 0.6793 msec ( 1470.0/sec): PutImage 500×500 square
real 0m33.953s
user 0m9.421s
sys 0m9.337s
$ time x11perf -shmput500
x11perf – X11 performance program, version 1.2
The X.Org Foundation server version 10400090 on :0.0
from yggdrasil
Sun Jun 29 17:03:29 2008
Sync time adjustment is 0.0230 msecs.
8000 reps @ 0.9115 msec ( 1100.0/sec): ShmPutImage 500×500 square
8000 reps @ 0.9155 msec ( 1090.0/sec): ShmPutImage 500×500 square
8000 reps @ 0.9105 msec ( 1100.0/sec): ShmPutImage 500×500 square
8000 reps @ 0.9106 msec ( 1100.0/sec): ShmPutImage 500×500 square
8000 reps @ 0.9121 msec ( 1100.0/sec): ShmPutImage 500×500 square
40000 trep @ 0.9120 msec ( 1100.0/sec): ShmPutImage 500×500 square
real 0m43.488s
user 0m0.356s
sys 0m0.728s
Edit: Added ‘time’ to the commands.
Edited 2008-06-29 15:04 UTC

2008-06-29 3:49 am
tyrione
I’ve been impressed with how far Xorg has moved their client/server system forward, but X11 definitely has room to grow and improve.
Still, it’s night and day from when X11 was around on DECstations.
2008-06-29 4:58 am
siride
Remember also that sending data over a Unix socket is not really as slow as people expect it to be. The cost of doing that versus setting up a shared memory region and all of the synchronization that that involves is probably not very significant. And that’s why SHM isn’t used that much. It just doesn’t buy you a whole lot. It would be better to find better ways to send the right data and the right amount, to the server, instead of trying to figure out how to be able to send huge amounts of redundant data.
2008-06-29 10:00 am
Lousewort
The OP has the wrong end of the stick.
We have a home grown socket based inter-application middleware that manages over 30000 messages per second between processes on the same machine. I hardly call that slow…
We use cairo for cross platform graphing. Win32 GDI using cairo on XP is quite a bit slower in rendering and screen update than X. While I realise that might be due to the Win32 backend, it is also possible that it is due to the GDI.
Bottom line is that Shm, which has been around since 1991, is not used because it is not needed. Using it instead of sockets would break the ability for your app to work across the network.
The cairo toolkit is extremely efficient. If you see FF taking some time to render pages, rather point a few fingers at your internet speed, or the speed of the site being rendered.

2008-06-29 2:31 pm
malacore
We use cairo for cross platform graphing. Win32 GDI using cairo on XP is quite a bit slower in rendering and screen update than X. While I realise that might be due to the Win32 backend, it is also possible that it is due to the GDI.
It’s the Win32 backend that is poorly implemented, because it works in software-only mode, while the X11 backend uses XRender and it’s quite faster.
2008-06-29 11:05 pm
kaiwai
Bottom line is that Shm, which has been around since 1991, is not used because it is not needed. Using it instead of sockets would break the ability for your app to work across the network.
Then obviously you have ignored the many posts on here which clearly state that if it is run across the network, it falls back to sockets. GTK+ makes use of SHM, and yet, can be used across a network. All the original author is saying is that like GTK+, if Cairo can use SHM, it should use it – otherwise, fall back to sockets when it can’t.
Is this the cause of perceived slowness? no. Xorg already have provided fixes; XCB for example addresses a lot of them, the problem is that the projects currently using libX11 haven’t gotten their act together and moved to XCB yet.

2008-06-29 2:29 pm
CrLf
I admit it, I didn’t read the article. And I probably won’t, because I’m sick and tired of reading articles bashing X11, usually from people that have no idea what they’re talking about.
X11 is decades old, it was developed on *very* slow machines for today’s standards and it managed to run reasonably fast. Yes, X11 has all sorts of layers for interprocess communication to allow processes to run on one machine while seamlessly displaying their UI on another machine (even if that UI involves 3D), but over the years the platform has evolved numerous optimizations for the local case.
Performance problems in graphical applications are seldom caused by performance problems in the X Server itself. The origin is most likely the UI toolkit.
And I’m not saying the UI toolkits are bad, because X-based toolkits have been doing for years what Microsoft has only implemented in Vista. For example, dynamically resizable windows (i.e. container based dynamic layouts), whose increased complexity eats CPU cycles and makes optimization more difficult.
X11 shows its quality just by being used after all these years, while competing display servers have come and gone. It is a bit like unix really.

2008-06-29 3:01 pm
madcrow
The article wasn’t bashing X, it was (politely) criticizing Cairo for not using a feature of X (SHM) which could result in an overall performance increase of about 20%, as proved by tests done by the author when he went and patched Cairo to actually USE said feature.
The article was about “Cairo doesn’t use X as well as it could, here’s how to fix it and what effect fixing it has” NOT “ZOMG X Suxx!”
Edited 2008-06-29 15:06 UTC

2008-06-29 5:20 pm
hackus
Reading some of the comments with, xorg now at 1.5 RC2 leads me to point out that really, I feel X was way way WAY ahead of its time both in conceptual implementation and execution.
Working as I have in software engineering and networking now for over 20 years I believe that the direction to smaller portable computing devices like my eeePC for example is going to be the future.
For example, the arguments put forth for local and remote cases for display processing with regards to how X does things, are quickly becomming clear that a remote facility for data both video and audio is going to be required. I mean, I am not going to chug along my 12 pound 8 core laptop with me, when all I need is a NX session to it from my eeePC 900.
Which, is what I do right now.
Performance is grand, anywhere I go.
I can pull up mplayer on my laptop, and start watching my favorite lizard blow the hell out of Tokyo, while using linphone to take a call from work.
All over NX.
I think the dual approach is best. Most of the arguments would seem to be how can we do either one BETTER.
This design though, network centric to be sure, is only now demonstrating its full potential, and it has taken some time for networks to catch up.
I think the discussions are good for the continued improvements that X has undergone.
I mean, lets not forget:
X broke free from the bonds of XFree86 stifiling influences after many people, including me and many others on this list recognized X was moving way too slow to assault the Microsoft desktop.
X is rapidly changing to meet the needs of GNOME and KDE requirements through input from the freedesktop.org
Corporations like ATI are recognizing the above advantages of X and are providing open source drivers because their customer bases also recognize these facts and demand the hardware that they buy works 100% feature wise on GPL software.
We are making progress on this stuff you know. In the past year I would say the progress has moved from fast to BLAZING WARP SPEED.
Now, all we need is to get Blizzard to make a native version of Diablo 3 for Linux.
I bet that would help the cause a bit..especially since I am almost sure that neck and neck, Linux will kick Windows arse in display quality and speed if we had the drivers to the hardware opened up.
Now that we do, well….perhaps now is the time?
-Hack
2008-06-29 5:58 pm
SomeGuy
Sockets are essentially a memcpy() in kernel. They’re quite cheap when amortized over blitting the same image to screen repeatedly
Shm leads to hard synchronization issues, which could lead to messy pixmap stomping if the client/X server isn’t careful.
But more importantly, the SHM data *must* be kept in the shared buffer. This doesn’t sound so important until you realize that the image *doesn’t* want to live in the shared buffer when it’s being accelerated. what does this mean?
Well, if you’re using the shm extension, you’re essentially forcing the X server to draw in software (not that it’s a change from the poorly accelerated proprietary drivers he was using, anyways.)
If you upload images and paint them repeatedly, the good old upload-over-a-socket allows X to put the pixmaps wherever it wants to get best performance. This is really why XSHM is discouraged these days.

2008-06-29 8:23 pm
renox
I don’t understand your point about the shared memory preventing hardware optimisation: all the extension says in that in the end the one who create the data must put it at this ‘shared memory’ location?
If the GPU can read or write at this memory location (depending on the operation of course), why couldn’t there be hardware accelation?

2008-06-29 9:13 pm
SomeGuy
Because if you can’t move the pixmap to fast video ram for accelerated hardware operations, you’ve already lost. Although just using XShmPutImage can sometimes be a win, it isn’t always, because chances are you’ll be copying or moving the memory anyways.
The X SHM extension — the way it’s normally used — makes it very hard to do this efficiently, if not entirely impossible, so Xshm beats traditional XPutImage if you only care about a single blit, but if you’re reusing the pixmap it can be a net loss, depending on your drivers (the better the drivers, the more of a loss it is)
In fact, it’s so much worse that Exa (the new Xorg acceleration architecture) actually disables parts of the extension (shared pixmaps) so that it can properly accelerate 2d stuff.

2008-06-30 7:54 pm
panzi
Fedoara 8, all patches, KDE3 (so kwin and *NO* composite/XGL/AIGLX), nvidia GeForce 8800 GTS 512 (GPU 0), proprietary drivers (version 173.14.09), x server vendor string: 1.3.0
$ uname -a
Linux panzi 2.6.25.6-27.fc8 #1 SMP Fri Jun 13 16:38:52 EDT 2008 i686 i686 i386 GNU/Linux
$ x11perf -putimage500
x11perf – X11 performance program, version 1.5
The X.Org Foundation server version 10300000 on :0.0
from panzi
Mon Jun 30 21:41:07 2008
Sync time adjustment is 0.0360 msecs.
8000 reps @ 1.0904 msec ( 917.0/sec): PutImage 500×500 square
8000 reps @ 1.0871 msec ( 920.0/sec): PutImage 500×500 square
8000 reps @ 1.0928 msec ( 915.0/sec): PutImage 500×500 square
8000 reps @ 1.0950 msec ( 913.0/sec): PutImage 500×500 square
8000 reps @ 1.0959 msec ( 913.0/sec): PutImage 500×500 square
40000 trep @ 1.0922 msec ( 916.0/sec): PutImage 500×500 square
$ x11perf -shmput500
x11perf – X11 performance program, version 1.5
The X.Org Foundation server version 10300000 on :0.0
from panzi
Mon Jun 30 21:42:12 2008
Sync time adjustment is 0.0356 msecs.
12000 reps @ 0.6102 msec ( 1640.0/sec): ShmPutImage 500×500 square
12000 reps @ 0.6031 msec ( 1660.0/sec): ShmPutImage 500×500 square
12000 reps @ 0.6065 msec ( 1650.0/sec): ShmPutImage 500×500 square
12000 reps @ 0.6037 msec ( 1660.0/sec): ShmPutImage 500×500 square
12000 reps @ 0.6065 msec ( 1650.0/sec): ShmPutImage 500×500 square
60000 trep @ 0.6060 msec ( 1650.0/sec): ShmPutImage 500×500 square
so about twice as fast