=== sched_getcpu(2), membarrier(2), and rseq(2) syscalls Contact: Konstantin Belousov Links: + link:https://kib.kiev.ua/kib/membarrier.pdf[Linux manpage for membarrier(2)] URL: link:https://kib.kiev.ua/kib/membarrier.pdf[https://kib.kiev.ua/kib/membarrier.pdf] + link:https://reviews.freebsd.org/D32360[membarrier(2) implementation] URL: link:https://reviews.freebsd.org/D32360[https://reviews.freebsd.org/D32360] + link:https://kib.kiev.ua/kib/rseq.pdf[Linux manpage for rseq(2)] URL: link:https://kib.kiev.ua/kib/rseq.pdf[https://kib.kiev.ua/kib/rseq.pdf] + link:https://reviews.freebsd.org/D32505[rseq(2) and userspace bindings implementation] URL: link:https://reviews.freebsd.org/D32505[https://reviews.freebsd.org/D32505] Linux provides a set of syscalls that allow to develop mostly syscall-less scalable algorithms in userspace. The mechanisms are based on optimistic execution using CPU-local data with the assumption that rare events like context switches or signal delivery do not occur for the given calculation, and if they do occur, rollback and restart is performed. This very high-level approach is used, as I understand, for implementation of tools like URCU, fast malloc allocators (tcmalloc) and other userspace infrastructure projects aimed at large partitioned machines. For instance, sched_getcpu(2) syscall returns the CPU id of the CPU where the current thread is currently executing. On amd64, if available, we use a RDTSCP or RDPID instruction to query the CPU id without changing CPU mode, otherwise this is a light-weight syscall. Of course, the answer provided is obsolete the moment it is created, even before it is returned to userspace. But it allows seeding values in some structures that are valid for a long time (at the CPU speed scale) and are automatically corrected on exceptional control flow events like context switches, and userspace can either detect and rollback or sync and rollback with the exceptions. There are two cornerstone syscalls that allow userspace to implement these efficient algorithms: membarrier(2) and rseq(2). Membarrier is a facility that helps implementing fast CPU ordering barriers, typically used for asymmetric/biased locking. In these lock implementation schemes, the owner of the object often assumes that there are contenders/parallel threads that need coordinating with. If some thread starts accessing the same resource, then it is its duty to ensure correctness. Examples of 'traps' that fast code path utilize are reads from a dedicated page that is unmapped by contenders, to switch the fast path to the slow one. Or we could send a signal to all threads that potentially have access to that object, to insert a barrier. Or we can use the membarrier(2) facility, which incurs significantly less overhead than signalling all threads. Membarrier(2) inserts a barrier, which is the typical underlying hardware operation to ensure ordering, into the specified set of CPUs, if these CPUs are executing the specified thread. If these CPUs are not executing the targeted threads, it is assumed that sequential consistency guarantees from the context switch are enough to fulfill the requirement of membarrier(2). Overall, the fast path can be implemented without slow instructions, and the slow path injects required fences into the fast path at the cost of IPI. The facility to detect exceptional conditions in the userspace thread execution was developed in Linux and called rseq(2). It is a feature often called Restartable Atomic Sequences, which explains the acronym. The ability to cheaply do that allows code longer than a single instruction to execute atomically, without the need to propose and implement unsafe operations like disabling preemption, which is not feasible for userspace. For instance, code might use CPU-local resources, which otherwise does not cope well with context switches. There cannot be an analog of critical_enter(9) in userspace. (A facility to cheaply block signal delivery exists in FreeBSD, see sigfastblock(2), but correctly using it is provably too hard to implement in general-purpose code, esp. because it requires version-dependent coordination with rtdl and libthr.) rseq(2) takes per-thread block of memory, where the thread writes the current CPU id (see sched_getcpu(2)) and specifies the block of critical code that must be unwound if an exceptional situation like a context switch occurred while the block was executing. The fast code path uses per-cpu data and typically does not need any corrections, but would a context switch occur, transfer of control to the abort handler informs userspace about the event. So instead of disabling context switches, code can cheaply check for one after the calculation and retry if needed. An interesting rseq(2) implementation detail is that it is impossible (and not needed) to access/update rseq structures from kernel during the actual context switch, because we cannot access userspace from under a spinlock. In other words, threads using rseq do not incur any performance cost from system-global context switches. Instead, if the process registered for rseq(2), on any return to user mode we check if any exceptional events happened while the thread was in the kernel (context switches may happen only while the thread is in kernel mode), and if a context switch indeed occurred, we fire an ast to check whether the program counter is inside the critical section and jump to the abort handler if it is. The implementations of membarrier(2) and rseq(2) are clean-room: I used Linux manual pages as the reference and public discussions of the features for clarifying corner cases. On Linux/glibc, there was no stable glibc interface to the rseq facility. One proposed integration was committed then reverted from glibc. It might be prudent to wait some more for the rseq(2) interface to stabilize in glibc before providing it in our libc or to rely on tight integration between kernel and userspace in our base system, and use ABI tricks like symbol versioning to evolve the interface. There is no goal to be 100% compatible with Linux anyway. Sponsor: The FreeBSD Foundation