Speaking of Steam, the Linux version of Valve’s gaming platform has just received a pretty substantial set of fixes for crashes, and Timothee “TTimo” Besset, who works for Valve on Linux support, has published a blog post with more details about what kind of crashes they’ve been fixing.
The Steam client update on November 5th mentions “Fixed some miscellaneous common crashes.” in the Linux notes, which I wanted to give a bit of background on. There’s more than one fix that made it in under the somewhat generic header, but the one change that made the most significant impact to Steam client stability on Linux has been a revamping of how we are approaching the
setenv
andgetenv
functions.One of my colleagues rightly dubbed
↫ Timothee “TTimo” Bessetsetenv
“the worst Linux API”. It’s such a simple, common API, available on all platforms that it was a little difficult to convince ourselves just how bad it is. I highly encourage anyone who writes software that will run on Linux at some point to read through “RachelByTheBay”‘s very engaging post on the subject.
This indeed seems to be a specific Linux problem, and due to the variability in Linux systems – different distributions, extensive user customisation, and so on – debugging information was more difficult to parse than on Windows and macOS. After a lot of work grouping the debug information to try and make sense of it all, it turned out that the two functions in question were causing issues in threads other than those that used them.
They had to resort to several solutions, from reducing the reliance on setenv
and refactoring it with exevpe
, to reducing the reliance on getenv through caching, to introducing “an ‘environment manager’ that pre-allocates large enough value buffers at startup for fixed environment variable names, before any threading has started”. It was especially this last one that had a major impact on reducing the number of crashes with Steam on Linux.
Besset does note that these functions are still used far too often, but that at this point it’s out of their control because that usage comes from the libraries of the operating system, like x11, xcb, dbus, and so on. Besset also mentions that it would be much better if this issue can be addressed in glibc, and in the comments, a user by the name of Adhemerval reports that this is indeed something the glibc team is working on.
I tend not to use environment variables much in my own software, but I appreciate the fact that many libraries do and this is a valid concern there.
They also allude to the behavior of multithreading and forking, which again is a unix specific issue. Those of us coming from windows usually look for some kind of spawn syscall, but linux doesn’t have one.. Instead the spawning functions defined by POSIX are implemented as a wrapper on top of fork or the clone syscall in linux. Furthermore while fork works fine for academic examples, in large production processes (think of something like a database or web browser), forking can be quite inefficient. Not only are there performance concerns, but multithreaded programs can cause undesirable side effects.
Another problem I only encounter with linux software is file handles inadvertently passed to children. Linux does not offer a direct way to pass specific file handles to child processes and all handles get passed by default, This makes it a bad API IMHO. File handles can be trivially leaked without anyone noticing. This might have security implications. Personally, I try to remember to set CLO_EXEC consistently on every single file handle and socket that I open, but it’s easy to miss in code using default flags and most software developers rely on 3rd libraries that might not set CLO_EXEC or expose handles at all. There is no good solution to this and as a result I’ve seen some code using hacks like this before calling exec “for(int i=0; i<1024; i++) close(i)" just in case something is holding on to file handles without our knowledge and we don't want to pass it along.
> look for some kind of spawn syscall, but linux doesn’t have one
You seem to have missed out on this for 23 years: https://man7.org/linux/man-pages/man3/posix_spawn.3.html
It also solves your fd problem.
js,
I’m not sure if you realize that I specifically mentioned the POSIX spawn functions in my post. It’s not a syscall, but a wrapper for fork or clone on linux. It provides a different API, but doesn’t fundamentally do anything differently than the fork & exec sequence and uses the same syscalls under the hood.
From your link…
A dedicated syscall for spawning new processes could be more efficient than fork/clone, but linux only has wrappers built on top of fork & clone.
js,
For fun, I did look at how glibc’s own code tackles the FD problem.
https://codebrowser.dev/glibc/glibc/sysdeps/unix/sysv/linux/closefrom_fallback.c.html
Interestingly glibc provides two implementations: one that opens and scans the proc file system to detect open files in order to close them before calling exec, and another that uses the for loop approach I alluded to before…
So glibc requires the same sort of workarounds that other fork users need to do in order to overcome all file handles being passed to child processes by default. Even though they went through the effort to create a new API, I find it regrettable that the POSIX API’s FILE_ACTIONS is a dirty abstraction that dances around fork’s behavior. After all, programmers and users are naturally thinking in terms of which FDs need to be passed into a child process, everything else should be closed. APIs that force us to think in terms of which FDs need to be closed have got it completely backwards. Alas, this backwards way of thinking is now the standard 🙁
Just don’t fork() around with getenv/setenv!
Nico57,
Most programs can just getenv when setting up before multitasking in the main loop. They will probably never experience corruption. But I agree with the author that there shouldn’t be a fault and the use of these functions should be well defined even for MT software.
This day i went back to linux again from another exodus to OS/2 and BeOS. This time full out 8x7900XTX and quad EPYC cpus (not the fastest in single core, mut fastest so far in multightreaded) 128 cores x4. with ram amounts i could not even have been possible to even have as storage before.
Yeah i still have to use my Thinkpad to work on the system. so it is not for “me” per se, but now i am forced to finally move off arcaos sadly… since this is the end. The end of “stuff working” on old systems and especially 32bit ones unless something happens soon.
To be fair linux allready suports more windows games than windows. But this is great.
Source? Also, when it comes to supported games that explicitly list Windows 11 in the “system requirements” box , it’s fair to say Windows has the advantage.
In plain English, sure, you can play abandonware Windows 98 games from 25 years ago with Linux, but when most people say “supports”, they typically don’t include abandonware stuff from 25 years ago, so your comment is misleading. Try to play a Denuvo-encrusted game on Linux and report back. And yes, people do buy Denuvo games and want to run them.
How old is your copy of call of duty2? How old is your dvd copy or warpath? how old is you dvd copy of turningpoint? How old is your DVD with “the civil war” 1+2. Sure they all are just about the best games ever made, no none of them work on windows 10 nor 11. Cod has a pirate workaround, good luck finding that for the others.
Yup all of them works in basic wine, not even neading proton for those. And yeah, try the thousands of games made for the 16 bit version of windows… yup almost all works just fine. Source? Pathetic. Just try it. You might like it.
And so far as i know, (very little it appears) no windows version can run 8 bit software without emulation. Granted that wine can not do that either, but we area comparing oranges to elephants at this point.
If you mean DOS that is also 16-bit, there was never an 8-bit IBM PC or OS.
I’ve fixed this problem in Firefox some time back by using function interposition:
https://bugzilla.mozilla.org/show_bug.cgi?id=1752703
https://hg.mozilla.org/mozilla-central/rev/1760c9f902bf
https://hg.mozilla.org/mozilla-central/rev/79dc5e93cef4
Since we can’t change all our dependencies (they are too many) I’ve slotted some thread-safe environment manipulation functions between the code calling them and the libc implementations. This way the calling code is transparently forced to acquire a lock before manipulating the environment and requires no changes (and no adjustments for forked off processes). This is achieved by linking our own library (mozglue) before libc on Linux, and by hooking dlsym() on Android where libc might have been loaded before we get a chance to load our code. I’ve eliminated all crashes related to environment manipulation with this trick, with the only exception of cases where glibc calls those functions internally, but that’s a genuine glibc bug that needs to be fixed upstream.
Nice! was wondering if something like that would be a viable solution. I’m kind of tempted to back and look at some older unexplained crashes in my code bases to see if it was related at all to setenv.
crystall,
Yeah, obviously something had to be done to fix it in FF. I kind of feel that it should be fixed in one spot: glibc. Not fixing it in glibc means that thousands of projects may eventually encounter the same bug at some point and every one of them them will have to implement their own workaround.