Killing a process and all of its descendants

Thom Holwerda 2019-08-03 Unix 16 Comments

Killing processes in a Unix-like system can be trickier than expected. Last week I was debugging an odd issue related to job stopping on Semaphore. More specifically, an issue related to the killing of a running process in a job. Here are the highlights of what I learned.

Interesting technical dive.

About The Author

Thom Holwerda

Follow me on Mastodon @thomholwerda@exquisite.social

16 Comments

2019-08-04 1:00 am
Alfman verbose=1
This topic leads to some of my gripes with unix under the hood: terminal level semantics being hard coded in the kernel. Linux inherited a mess of unnecessary complexity. I wish we could gut all this ugliness out and allow the terminal shells to manage and define job control primitives.
Most developers can safely ignore this stuff, but having worked on terminal shells and init systems years ago for my linux distro, I was thoroughly disappointed with how badly job control is implemented.

2019-08-05 6:11 am
Brendan
LOL.
A few weeks ago (on Linux) I tried doing “worker thread gets job from job queue; adjusts its thread priority to suit that job; does the job; then restores its thread priority to default”. Sounds like something that should be relatively common and almost trivial, doesn’t it?

2019-08-04 2:26 am
sukru
cgroups and containers (mostly) fix the issue:
https://linuxaria.com/article/how-to-manage-processes-with-cgroup-on-systemd
For local processes, cgroups can give sufficient isolation
For more complex cases, the process tree can be encapsulated in a container.
Linux tends to use processes more like threads, and process groups behaves similar to processes in other operating system.

2019-08-04 3:31 am
oiaohm
Its a lot more of a pain in tail than cgroups can fix.
https://lwn.net/Articles/794710/
The PID and pgid as a number idea is borken to start of with. pidfd under Linux will start fixing this.
Fun issue you do as igor suggest of using kill on a pgid the result can be horrible you can have application crash and another start in it place in the middle. So now you killed some bystander application that just happen to start at the wrong time. Same applies to PID and SID(session id as well)
PID, PGID, SID and TTY can all be changed when a program executes another program. Cgroups provide tracking in cases where the PID/GGID/SID/TTY have changed. Cgroups fixs the tracking problem not so much on the killing problem. Yes systemd cgroup system it still possible to have a signal delivered to wrong process so they need to implement pidfd at some point..
Pidfd cannot be used for pgid, sid or tty killing at this stage.
ideal world we would have pgidfd, sidfd and ttyfd for delivering signals as well as pidfd.
The important workflow thing is without pidfd.
You find the process you want to kill.
You pray it not died and been replaced.
You send kill signal with fingers crossed.
With pidfd
You find process you think you want to kill.
You open pidfd on that process id.
You check that process by pidfd that it is the right process if it changed you can abort.
You finally send kill signal and it will go to the right process.
Unix always said everything was a file. That was everything bar PID, PGID, SID, TTY basically everything you need for process management was not a file and does not work right.
cgroups fixes half of the problem that half being able to track what processes X service/user started. KIlling this correctly is something that is not implemented yet commonly yet pidfd is a start to fixing this properly..
Yes the Posix standard is busted here. So when someone say they have make a init/service management system using only posix functions you know it broken.
2019-08-04 4:14 am
kwan_e
sukru,
Linux tends to use processes more like threads
Linux uses the same clone system call to implement fork and pthreads, I think.

2019-08-04 8:33 am
Alfman verbose=1
kwan_e,
Linux uses the same clone system call to implement fork and pthreads, I think.
Though ambiguous, I think that might be what sukru meant when saying processes are like threads.
All of us have brought up areas where linux could be better engineered. With forking (or cloning), the unix way seems cool when you learn how it works and is used. Passing file handles and program state to children has a naturally intuitive flow to it. However it can become grating the more we encounter some of the complexities it causes over time. Forking processes becomes inefficient particularly when parent processes are large. Also it is difficult to control the file handles a child process will receive because kernel handles are passed implicitly rather than explicitly. Consider how unreasonable this design is in multithreaded code where one thread has no control over the handles being open & closed by other threads.
Most (all?) linux functions that return handles have been rewritten to accept a CLO_EXEC flag and one can use that in one’s own code to explicitly tell the kernel to close handles after ‘exec’ calls. (Have a look at “accept” for example: https://linux.die.net/man/2/accept ) This helps, however unfortunately it defaults to off to be compatible with POSIX semantics. So sometimes it’s very hard to explicitly pass only the handles you want.
oiaohm brings up another design flaw where fork and friends returns a PID instead of a kernel handle. This leads to implicit race conditions whereby a process can technically hold onto PIDs that no longer refers to the process instance it is supposed to. The exact risk of the problem depends on several conditions, however the limited accuracy of sending signals and killing processes in unix is concerning.
People are critical of win32 CreateProcess for being less elegant, which is true, but one redeeming quality is that it has none of these underlying design problems. For example, it returns process & thread handles to the parent rather than PIDs.
https://docs.microsoft.com/en-us/windows/win32/api/processthreadsapi/ns-processthreadsapi-process_information
For better or worse, the fixes we get in linux leave the legacy baggage in place and the workarounds that we engineer to address them tends to add complexity and causing confusion especially for inexperienced developers who aren’t intimately aware of the subtle issues that existed.
Sometimes I wish we could set out with a clean slate and re-engineer things to avoid earlier mistakes. This was something plan9. engineers and other projects set out to do, but as with plan9 it’s extremely unlikely any such effort could reach critical mass and replace the defacto programming paradigms for FOSS software development which are based on linux. For me and probably many others, linux is “good enough”.

2019-08-04 8:58 am
kwan_e
Alfman,
People are critical of win32 CreateProcess for being less elegant, which is true, but one redeeming quality is that it has none of these underlying design problems. For example, it returns process & thread handles to the parent rather than PIDs.
https://blogs.msdn.microsoft.com/wsl/2016/05/23/pico-process-overview/
It’s funny, though, that they left an implementation of fork in the NT kernel, and just brushed off the cobwebs when they implemented WSL.
However, having handles has its own problems. It’s been a while since I programmed Win32, but aren’t all handles reference counted? Dropping a handle without closing it, which happens more often than PIDs getting recycled, leads to resource exhaustion in the kernel, which is worse.

2019-08-04 12:30 pm
oiaohm
>> However, having handles has its own problems. It’s been a while since I programmed Win32, but aren’t all handles reference counted? Dropping a handle without closing it.
Linux with file handles has it own systems for dealing with these problems already. People opening files and failing to close them the Linux kernel has a evil solution its called give your application score value to make it more likely for OOM Killer will out take your process if resource exhaustion happens. The problem you are referring to with windows is really a lack of good handling of out of resources events. Systems like Linux have this well handled. This means the biggest problem is the fact PID does the wrong things.
>> which happens more often than PIDs getting recycled
Very bad presume here. This is totally missing something about PID.
PID recycling happens a lot on Linux systems. If you do a “cat /proc/sys/kernel/pid_max” on most Linux systems you get the value of 32768 as the total number of PID that exist. So 1 to 32768 and that it. Some systems you have started to recycle PID numbers before the boot is even complete and before you have in fact logged in.
Its more often luck than good management that a kill to a process that has exited does not hit another process that has started under Linux.
pid max does put a upper bounds on how badly the PID table can grow it also means the space until stuff starts getting recycled is also quite small. PID on Linux is a very small pool to the point it that recycling can have started inside the first 2 min from userspace being started up. If the Linux kernel was like first Unix where when you run out of PID as soon as you reached even lightly used systems would hit PID max inside 8 hours because the value is so small.
Windows management of processes has a problem of no proper OOM killer to deal with research starvation or proper like ulimits in places to prevent it.
Remember ulimit under Linux as well you take out a filehandle you don’t close it the thing keeps on cutting into the max number of files the process is allowed. So application not being clean with usage of files normally ends up in trouble before it can harm the system. Where is the windows equal of ulimit on the number of process handles an application can in fact have open?
>> aren’t all handles reference counted?
Here is a sneaky underhanded stunt for Linux/posix file handles. You can avoid reference counting to a point.. Each process has its own file table. This has a link to the system handle. System handle goes away update each processes file table to point to null. Yes /dev/null it has a use. You attempt to send a signal to /dev/null it not going to work. There is no process 0.
Yes file handle is reference counted system wide. But system-wide can also close the handle completely and disconnect all processes connection to it even if they still have a entry in that process file table that now points to /dev/null instead of what ever resource went by by. Of course /dev/null does not bother being referenced counted as it always exists. Dropping a file handle without sending the system a close signal does not mean the Linux kernel cannot close that file handle system wide and basically dead end all users of it.
See windows handling of handles has a issue. If windows include a means to leave a dead end applications on handle they would not have the kernel level resource starvation problem as simply. Linux/Posix world did handling of handles a little better than windows.

2019-08-05 9:21 am
tut
How is this any different than fork/exec on Linux? If you don’t call wait/waitpid() on your child process id, you’re going to leak. It’s actually worse, because you have a process id that’s ref-counted, whereas with handles everyone knows that you have to close them.

2019-08-04 12:02 pm
acobar
I really find it very convenient that finishing a process do not automatically kill its descendants as we can let remote process running on a server after we finish the session. Perhaps, a good example is how we can “detach” a tmux or screen sessions and “attach” it later?
Aside the fact that the author really should put more attention to the manuals, because all that surprised him is documented, what would you do differently? I find fork, exec*, the various pthread_*, mutexes and semaphores to be very convenient, well thought and complementing set. Usually, I tend to assign the complexity of them to the non-trivial nature of concurrency. Also, we should remember that the passing all file handles were a convenient way to deal with the traditional unix way of concatenating (pipping) commands and be able to control the processes doing it.
I understand that bit-rot is a problem that must be addressed but how to handle it is a very complex problem that spans “generations” for such kind of subsystems and clean sheet proposals is a kind of luxury that killed many efforts of re-hauling what was seen as “lacking” on computing.

2019-08-04 3:56 pm
Alfman verbose=1
acobar,
I really find it very convenient that finishing a process do not automatically kill its descendants as we can let remote process running on a server after we finish the session. Perhaps, a good example is how we can “detach” a tmux or screen sessions and “attach” it later?
Sure, I don’t think kill should terminate descendants by default. The problem is using PIDs as handles when there’s no guaranty that the PID you obtain/lookup will still refer to the correct process when you send it a signal.
I find fork, exec*, the various pthread_*, mutexes and semaphores to be very convenient, well thought and complementing set. Usually, I tend to assign the complexity of them to the non-trivial nature of concurrency. Also, we should remember that the passing all file handles were a convenient way to deal with the traditional unix way of concatenating (pipping) commands and be able to control the processes doing it.
Fork is clever. It works great in academic examples, and sometimes even works ok in practice. In many cases you can use multithreaded paradigm instead of forking, which avoids some of fork’s problems….but not all of them. I alluded to this before, but spawning a child process becomes less and less efficient on linux the larger the parent process becomes. Fork does not scale well.
Take a look at posix_spawn:
https://linux.die.net/man/3/posix_spawn
An optimal implementation does not have to clone to parent’s paging tables to start a completely new process. The parent’s consumption is 100% irrelevant to the child. However the linux implimentation under the hood is implemented by cloning the parent’s memory (and overcommitting in the process), making it impossible to spawn a child off a large parent efficiently.
Let’s pretend you’ve got a large multithreaded process on the order of gigabytes, let’s say 8GB. Let’s say it’s a database engine and that it needs to periodically run small cronlike jobs now and then for maintenance tasks. When it’s time for the database to spawn a maintenance task, in addition to creating the new child process, all of the parent’s memory has to be shared with the child by setting up MMU COW. This involves read/writing several megabytes worth of the parent’s page tables and copying them into child’s page tables. that’s going to get thrown out the moment the new process calls “exec”. In all likelihood, the child probably doesn’t need more than 1k of data between the calls to fork and exec, the rest is a complete waste. We are quite spoiled with modern hardware processing tons of data in the blink of an eye., but still the 99.99998% inefficiency makes me wince..
The inefficiency is well understood and workarounds like vfork have been proposed in the past to cope with fork’s poor scalability, however vfork itself is widely considered a hack and has become obsolete. I’m not terribly optimistic fork’s problems will be permanently addressed in linux because despite it’s complications, it’s considered good enough and replacing all the legacy code is probably unfeasible.
http://man7.org/linux/man-pages/man2/vfork.2.html
Some consider the semantics of vfork() to be an architectural
blemish, and the 4.2BSD man page stated: “This system call will be
eliminated when proper system sharing mechanisms are implemented.
Users should not depend on the memory sharing semantics of vfork() as
it will, in that case, be made synonymous to fork(2).” However, even
though modern memory management hardware has decreased the
performance difference between fork(2) and vfork(), there are various
reasons why Linux and other systems have retained vfork():
* Some performance-critical applications require the small
performance advantage conferred by vfork().
* vfork() can be implemented on systems that lack a memory-
management unit (MMU), but fork(2) can’t be implemented on such
systems. (POSIX.1-2008 removed vfork() from the standard; the
POSIX rationale for the posix_spawn(3) function notes that that
function, which provides functionality equivalent to
fork(2)+exec(3), is designed to be implementable on systems that
lack an MMU.)
* On systems where memory is constrained, vfork() avoids the need to
temporarily commit memory (see the description of
/proc/sys/vm/overcommit_memory in proc(5)) in order to execute a
new program. (This can be especially beneficial where a large
parent process wishes to execute a small helper program in a child
process.) By contrast, using fork(2) in this scenario requires
either committing an amount of memory equal to the size of the
parent process (if strict overcommitting is in force) or
overcommitting memory with the risk that a process is terminated
by the out-of-memory (OOM) killer.

2019-08-04 8:23 pm
acobar
I know that you are exemplifying the kind of limitations fork() may suffer but, as I see it, your example is one case where using threads make more sense. To be true, on most problems people complain about fork() syscall it, regularly, is a case of not picking the right tech for the job.
fork() has all the problems you said but since COW was implemented (a long time ago) what is recommended, if you really have a need for a whole new process, is to do the minimum async operations you can if you want to spawn a new process, usually by execve() to avoid the penalties you described and some other annoyances. But then again, people usually ignore recommendations, many times by not being aware of them, granted.
One of my pet peeves when talking about the problems developers face is how frequent we ask for simple solutions for complex problems, and from my POV async operations and concurrency happen to be the most complex cases, somehow “longing” for complete (i.e. solutions contempling all cases) and optimum solutions. This is just not reasonable when we remember who we are and the limited time we have.

2019-08-04 7:11 pm
fabrica64
>> For example, when I SSH into a server, start a process, and exit, the started process is killed.
In Linux if the process is started in background (with &) this is not always true.
Exiting bash using the “exit” internal command or CTRL-D there’s no SIGHUP sent to the process, SIGHUP is sent only if the terminal is closed forcing bash to quit
2019-08-04 10:31 pm
Alfman verbose=1
acobar,
I know that you are exemplifying the kind of limitations fork() may suffer but, as I see it, your example is one case where using threads make more sense. To be true, on most problems people complain about fork() syscall it, regularly, is a case of not picking the right tech for the job.
In general, there’s no inherent reason a large process shouldn’t be able to spawn child processes efficiently though. That’s a byproduct of the linux implementation. It’s not unusual for the UI application spawn off background child processes, they might not even be written in the same programming language. There are times when it makes sense to use threads, there are other times where it makes sense to spawn child processes (say to execute under a different security context), but to suggest one should use threads and avoid spawning child processes because the implementation on linux is inefficient is kind of disappointing to me.
fork() has all the problems you said but since COW was implemented (a long time ago) what is recommended, if you really have a need for a whole new process, is to do the minimum async operations you can if you want to spawn a new process, usually by execve() to avoid the penalties you described and some other annoyances. But then again, people usually ignore recommendations, many times by not being aware of them, granted.
Hmm, execve() doesn’t spawn a new process. Unlike “posix_spawn” calling “exec*” does NOT return, it wipes out the caller with a new program keeping the same PID and open files handles etc. Execve() is one of many variants used for the second half of the “fork/exec” usage pattern.
If you need to have both a parent and child exist simultaneously, then you need to call fork (which calls clone in linux) to generate the new PID before you can use exec* to load a program.
The posix_spawn call was created to eliminate the need to clone the parent at all. The problem is that linux doesn’t support it natively and so code using posix_spawn gets converted to the less efficient fork/exec under the hood.
One of my pet peeves when talking about the problems developers face is how frequent we ask for simple solutions for complex problems, and from my POV async operations and concurrency happen to be the most complex cases, somehow “longing” for complete (i.e. solutions contempling all cases) and optimum solutions. This is just not reasonable when we remember who we are and the limited time we have.
Well, the thing is that a lot of these problems are solvable, but due to legacy API conventions and backwards compatibility, we get stuck. This is the reason why I expressed a desire to reboot the industry and engineer things better this time with the benefit of hindsight. It’s not just software APIs, but also things like security, cpu architecture, networking protocols, IP stacks, DNS, etc. We are constantly held back by really obtuse legacy designs that we inherited from many years back..I concede that it can be very difficult to change things today now that we’re so vested in the “good enough” technology that we already have, but I’m just saying that if somehow we got the opportunity to clear the slate, we could avoid many of the mistakes that we’ve made.

2019-08-05 5:54 am
acobar
Hmm, execve() doesn’t spawn a new process. Unlike “posix_spawn” calling “exec*” does NOT return, it wipes out the caller with a new program keeping the same PID and open files handles etc. Execve() is one of many variants used for the second half of the “fork/exec” usage pattern.
fork()/execve() was what I was talking about, together with minimum changes to just setup the environment you need to the new process, so that you can establish a proper communication between both processes. MMU COW will guarantee (almost), that most things will not need to be copied on most cases, and try to follow the “do one thing well” mantra and separate “large” programs on specialized sub-process that can live “independently”; they share the absolute minimum amount of code and data.
For cases where a close image of the parent process is needed, and separating our resources in TEXT, HEAP and DATA, there are cases where it does not help a lot, i.e., when we are modifying lots of things in HEAP and DATA. For the HEAP part I don’t know of any solution (specially if we are using recursion), for DATA, though, keeping only what is really needed, instead of modifying most of them, should be used together with mmap() for large structures/data that must be shared before fork() to mitigate the need to copy large number of blocks. Granted, it will introduce the need of coordination to handle access/modification of them but, then again, there is a price to pay for concurrency.
For me, as I said before, it looks like a case where we are asking for things that are only theoretically a problem, because mitigation techniques are already in place.
Now, you got me curious, what solutions you devise for HEAP and DATA for process that should keep the TEXT blocks unaltered (i.e., where we are using just fork())?

2019-08-05 9:56 am
Alfman verbose=1
acobar,
fork()/execve() was what I was talking about, together with minimum changes to just setup the environment you need to the new process, so that you can establish a proper communication between both processes.
Oh, then I misunderstood your previous post where you seem to suggest execve as an alternative to fork that avoids the fork penalties I was talking about, but it does not. I don’t want to repeat myself here, but I don’t think you’ve acknowledged the additional overhead needed by this kind of implementation, on top of any OS accounting tables, you still need to alter and copy many megabytes worth of page tables to enable the page faults used by the COW mechanism. You have to map the entire address space even though this is almost 100% wasteful. After exec, you either need to restore page tables or let the CPU fault them, either way more overhead. All the memory activity is likely to evict all the recently used data from caches, forcing the system to repopulate the caches with useful data after the fork/exec sequence is done. In short when you look at fork/exec holistically, it’s not very streamlined and is quite inefficient under the hood versus a spawn primitive that avoids forking overhead all together as well as the need to overcommit, which brings it’s own bag of troubles for robust software. Like I said before, I know it’s often considered “good enough”, but as someone who has an affinity towards optimality, the unnecessary overhead needed by this design pattern makes me wince.
For cases where a close image of the parent process is needed, and separating our resources in TEXT, HEAP and DATA, there are cases where it does not help a lot, i.e., when we are modifying lots of things in HEAP and DATA. For the HEAP part I don’t know of any solution (specially if we are using recursion), for DATA, though, keeping only what is really needed, instead of modifying most of them, should be used together with mmap() for large structures/data that must be shared before fork() to mitigate the need to copy large number of blocks. Granted, it will introduce the need of coordination to handle access/modification of them but, then again, there is a price to pay for concurrency.
We’re talking about a different design pattern here, fork without exec. I admit it’s a really cool design pattern for concurrency that unix innovated on and many of our unix applications like apache have built on this pattern. I get the appeal, however fork being fork, the costs are proportional to the size of the parent process, which limits it’s practical scalability.
For me, as I said before, it looks like a case where we are asking for things that are only theoretically a problem, because mitigation techniques are already in place.
If the mitigation is to avoid fork, then I would agree, haha. It should not be controversial to suggest that fork should be avoided in favor of alternatives wherever scalability is crucial.
Now, you got me curious, what solutions you devise for HEAP and DATA for process that should keep the TEXT blocks unaltered (i.e., where we are using just fork())?
The obvious answer is threads, but you already know that, so I don’t think I’ve understood your question.
Ultimately though, even threads reach their own bottlenecks and it can be very difficult to saturate a high bandwidth link because of this. The next step is a thread abstraction that does not involve crossing the kernel barrier and eliminates a lot of synchronization problems. Look at goroutines…
https://gobyexample.com/goroutines
I think most developers would find that to be an elegant approach.
But the really high efficiency daemons like nginx and lighttp turn to event oriented asynchronous IO models that don’t require the overhead of a thread/synchronization/stack for every single connection or request. This is as minimalist as you can get. Not only can you handle millions of connections without much trouble (don’t try that with threads or processes, haha), but kernel mechanisms like epoll are extremely efficient for passing events into userspace without a syscall per event, which is often used by other approaches.