Async Rust: Why is it so Fast?

Everyone knows that a web server, for example, will be able to serve up a lot more traffic if it is written using Rust async code rather than straight OS threads. Lots of people have the numbers to prove it.

But I have wondered why, and none of the explanations I have read have explained it to me. They all seem to boil down to a vague “because”.

Well, today, listening to the podcast Rustacean Station, the episode “Asynchronous Programming in Rust with Carl Fredrik Samson”, I think I figured it out. Yes they gave more explanation, not quite satisfying, but it helped, so I hit pause and started talking to myself like a crazy person (handy to be driving alone in cases like this) and I think I get it. Here it is.

Disclaimer

I am not experienced with Rust async code. I am still learning, I am certain to get things at least a little wrong, possibly a lot. I’m largely writing this for myself, but I’m putting it in public on the theory that others might find it illuminating even in its errors.

Multitasking Techniques

The problem is how to do multitasking, efficiently, so as to get the most work done. Let’s look at different approaches, in sequence, to land with a better understanding the virtues of Rust’s async abilities.

Processes

This is the oldest of these approaches. Here the operating system uses its god-like abilities to set up individual processes, each with an apparently complete computer at its disposal. Each process runs its own program, and can do what it wants, can (attempt to) access any memory location, and can happy crash when it messes up too badly, but likely with no other process noticing. Individual processes are so well isolated that unless they go to special effort to communicate with each other, or really load the machine with heavy work, can be completely ignorant of any other process even existing.

Processes are very general purpose. They are also relatively expensive, for the OS to switch context from one process to another everything that it can access has to be changed, and that is work. To do the work to switch processes dozens or it hundreds of times a second is no problem, but don’t try to do it hundreds of thousands of times a second. Being general purpose has a cost.

Threads

Threads are more lightweight than are processes, but at least on Linux, they are kind of the same. The way to start a new process on Linux is to fork a copy of oneself. There will now be two of you, clones of each other, both running at once, which seems a bit wrong and pointless. But you wrote the program they are running, and there is a way for the two copies to each check and see which copy it is. One can discover it is the original and the other discover it is the copy, and they can do different things in the two cases. A common thing for a new process to do is to start running an entirely new program.

So what is a thread? Well, when forking a Linux process one can choose how much is shared between the old and the new. In the case of a thread they still share the same memory. Sharing memory means they can cooperate in their work, providing the cooperating is done carefully.

Threads do not share everything. Each thread has its own copy of the CPU registers, and each has its own stack. So we are still pretty general purpose, the different threads could be working on different problems independently, more likely they cooperate, though that is a lot harder to program correctly.

The primary value to threads isn’t that their context is faster to switch (though threads are little faster), the motivation for threads is to be able to very efficiently share memory and so programs take advantage of multiple CPUs for some single purpose.

Green Threads

Green threads (or stackful co-routines) are very much like threads, except they live entirely in userland. Instead of the OS using its god-like powers to switch execution from one thread to another, the switching happens within the process. But the process doesn’t have any god-like powers over itself other than self-discipline. So green threads need to be written in a slightly special way to make the context switching possible at all. When context switches do happen there is still a similar amount of work needed, but there is a little win in that the OS doesn’t have to do it, so no context switch into kernel mode. Remember, any system call into the Linux kernel takes longer than a local function call.

So green threads do switch faster than can OS threads, and early Rust, before 1.0, did have some sort of green threads. But it is long gone.

Preemptive vs. Cooperative Multitasking

Time for a little detour. Everything up to this point has been preemptive multitasking. At one moment one program (or thread) is running then whoosh the next moment suddenly another program or thread will start running instead. The code that is being run and then not run doesn’t have any control, nor even any easy knowledge, of this change.

In cooperative multitasking these unanticipated changes do not happen, rather cooperative multitasking requires code regularly yield the CPU and let other code run. If it doesn’t yield, other code can’t run on that CPU. This is both good and bad. The good is if code has valuable and important work it is doing, it won’t be interrupted. The bad is if some code is churning away on something unimportant it can keep other code from running. Cooperative multitasking assumes responsible—cooperative—behavior.

Fun fact: The original Macintosh, way back when, before you were born (not all of you, but an awful lot of you), had cooperative multitasking. It worked amazingly well, but a lot of people scoffed, they wanted “real” multitasking. Meaning preemptive multitasking. But cooperative multitasking is also real.

Rust Knows More

Async Rust benefits from being Rust.

I like to say that writing a multithreaded program is easy, but maintaining a multithreaded program is mostly impossible. (And writing a multithreaded program of any size takes long enough that some of the effort is effectively maintenance before you are done…which means you are screwed.)

Rust makes multithreaded programming possible (if not easy) by insisting that the program explicitly say a bunch of stuff that in other languages would be in the comments. “Be sure to always use this mutex before accessing that data structure.”, for example.

The result is Rust knows a lot about your program and how all data is shared and not shared. This is key to async Rust.

Rust Async: Cooperative

Rust async is cooperative multitasking. (Mostly: you can also spawn threads and that is no longer just cooperative.)

When you write async Rust code you can be assured that once your code is executing the Rust async stuff will not halt you to start running some other async code, except at clear points, where you know this might happen.

These context switches can happen whenever you call code that would block, normally this is any code that might wait for IO. Such as waiting for network activity, waiting for a disk, or waiting for a user to type something. Or, you can explicitly yield with yield_now() or maybe you have a reason to sleep() for a specific amount of time. But the key point is that execution can only switch from one hunk of code to another hunk of code at specific and well defined points.

Here async draws on Rust’s requirement that we be all precise about what data is where and who has access to it for what purposes. I don’t know the details, but given all that information about data, plus knowledge of all the locations where execution might change, the Rust compiler works out a state machine for how to do all of the possible execution transitions. (I’m a bit amazed that this is possible, but I’ll believe it works, people use it for real work.)

And here is where the efficiency seems to lie: To switch context from one async execution path to another async execution path nothing really has to happen: no stacks need to be swapped out, no CPU register files need to be swapped, no MMU configurations need to be touched, nothing has to happen inside the kernel, rather we are still just running a Rust program, and the Rust program is just doing something else, not that different from branches you explicitly put in your own code.

This also means that the multitasking that happens in a Rust async program is not general purpose the same way that preemptive multitasking is. The Rust compiler looked at your specific code and output the code necessary for your program to run. This isn’t a practical problem, for the compiler is still completely general purpose, but its output is not.

Put another way, Rust’s passion for zero cost abstractions apply here, too. Rust only has to mess with the minimum to start working on something different, and that means it will be fast.

At least I think so. Note I used the weaselly word “just” twice in that paragraph. Always be suspicious of the word “just” (along with the words “exactly” and “simply”, and …).

Concluding

The same way that Rust doesn’t have a garbage collector, because it largely manages to analyze away the whole problem at compile time, Rust async seems to make context switching costs go away by mostly analyzing away the code that implements context switches, at compile time. This is what makes it fast.

In async Rust there is still a “runtime” (and more than one mutually incompatible runtime option to choose from, annoyingly), and it still needs to have rather general purpose code that talks to the OS to find out about IO that has unblocked and connect that up with pending awaits in your Rust code, and a pretty general purpose scheduler that decides what to run next when blocked execution is unblocked. But this runtime is still well smaller than nearly any OS. It comes as source code a Rust crate, and it is part of what the compiler will be optimizing along with your code. Something to be aware of: When you see an “async” or an “await” in code, a lot of magic is happening in there, more than other parts of Rust’s syntax.

Stuff I will learn more about as I dig deeper and use async more.

-kb

Epilogue

There is a fairness issue I need to bring up: asynchronous, cooperative multitasking has inherent advantages, unrelated to Rust. Other languages can and do pull these tricks, and it works for them, too. Rust maybe can do it better…but async is not entirely a Rust trick.

And another epilogical thought: The scheduler has a lot of information about the program being run. It will have choices in what to schedule (if more than one task becomes unblocked, which should run?). It can maybe put the two together and be smart about what it schedules.

For example, if there happens to be a crossbeam channel that is filled by only one task, but is drained by many, and if several are waiting to pull work out, it might be a really smart to give more priority to the one that puts work in so as to unblock those ready to drain it. The scheduler has information about every blocked task and can know dependencies between them. Clever programming in the scheduler might make a really big difference in total performance.

Also, the scheduler is in a position to keep statistics about system performance. To the extent work loads follow patterns it might be able to dynamically tune how it schedules unblocked work. For example, if the code that fills the hypothetical crossbeam channel runs very quickly but the code the drains it takes a lot of time and can be processed in parallel, maybe statistics can reveal that, and used to put more priority on filling the channel.

Some of these sorts of tricks can and are done by OS schedulers, too, but an async scheduler runs inside the program and can have a lot more information with which to make such choices. How much of an advantage Rust has here when compared to other async systems, I don’t know. But I bet there is some.

I suspect this is going to be an area of ongoing work, and I suppose it makes it a little less annoying that we have multiple Rust async runtimes, if that multitude allows more innovation to happen.

I’ll mention a final benefit to an async approach: It scales better. Having many thousands of active async tasks in existence is more practical than having many thousands of OS processes or threads running at one time.

Much of this is because the context information for each is inherently smaller, but there is another fairness issue here: new async systems were specific written to scale that much. A general purpose OS could also scale up to enormous numbers of processes. Arguably this is true with Erlang now, but okay, Erlang is weird. Had Linux anticipated current RAM capacity and been built from from the start to handle orders of magnitude more processes, I bet it could have, too. (But it didn’t.)

©2024 Kent Borg


Posted

in

by