sched: Introduce per-memory-map concurrency ID

This feature allows the scheduler to expose a per-memory map concurrency ID to user-space. This concurrency ID is within the possible cpus range, and is temporarily (and uniquely) assigned while threads are actively running within a memory map. If a memory map has fewer threads than cores, or is limited to run on few cores concurrently through sched affinity or cgroup cpusets, the concurrency IDs will be values close to 0, thus allowing efficient use of user-space memory for per-cpu data structures. This feature is meant to be exposed by a new rseq thread area field. The primary purpose of this feature is to do the heavy-lifting needed by memory allocators to allow them to use per-cpu data structures efficiently in the following situations: - Single-threaded applications, - Multi-threaded applications on large systems (many cores) with limited cpu affinity mask, - Multi-threaded applications on large systems (many cores) with restricted cgroup cpuset per container. One of the key concern from scheduler maintainers is the overhead associated with additional spin locks or atomic operations in the scheduler fast-path. This is why the following optimization is implemented. On context switch between threads belonging to the same memory map, transfer the mm_cid from prev to next without any atomic ops. This takes care of use-cases involving frequent context switch between threads belonging to the same memory map. Additional optimizations can be done if the spin locks added when context switching between threads belonging to different memory maps end up being a performance bottleneck. Those are left out of this patch though. A performance impact would have to be clearly demonstrated to justify the added complexity. The credit goes to Paul Turner (Google) for the original virtual cpu id idea. This feature is implemented based on the discussions with Paul Turner and Peter Oskolkov (Google), but I took the liberty to implement scheduler fast-path optimizations and my own NUMA-awareness scheme. The rumor has it that Google have been running a rseq vcpu_id extension internally in production for a year. The tcmalloc source code indeed has comments hinting at a vcpu_id prototype extension to the rseq system call [1]. The following benchmarks do not show any significant overhead added to the scheduler context switch by this feature: * perf bench sched messaging (process) Baseline: 86.5±0.3 ms With mm_cid: 86.7±2.6 ms * perf bench sched messaging (threaded) Baseline: 84.3±3.0 ms With mm_cid: 84.7±2.6 ms * hackbench (process) Baseline: 82.9±2.7 ms With mm_cid: 82.9±2.9 ms * hackbench (threaded) Baseline: 85.2±2.6 ms With mm_cid: 84.4±2.9 ms [1] https://github.com/google/tcmalloc/blob/master/tcmalloc/internal/linux_syscall_support.h#L26 Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20221122203932.231377-8-mathieu.desnoyers@efficios.com
author: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> 2022-11-22 15:39:09 -0500
committer: Peter Zijlstra <peterz@infradead.org> 2022-12-27 12:52:11 +0100
commit: af7f588d8f7355bc4298dd1962d7826358fc95f0 (patch)
tree: 6515179cd9f89aad62e7ed5a1b1999969834d994 /init
parent: 99babd04b25054717d21840298b0b46046b42cd9 (diff)
download: linux-rpi-af7f588d8f7355bc4298dd1962d7826358fc95f0.tar.gz
linux-rpi-af7f588d8f7355bc4298dd1962d7826358fc95f0.tar.bz2
linux-rpi-af7f588d8f7355bc4298dd1962d7826358fc95f0.zip
1 files changed, 4 insertions, 0 deletions
diff --git a/init/Kconfig b/init/Kconfig
index 7e5c3ddc341d..1ce960aa453e 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1041,6 +1041,10 @@ config RT_GROUP_SCHED
 
 endif #CGROUP_SCHED
 
+config SCHED_MM_CID
+	def_bool y
+	depends on SMP && RSEQ
+
 config UCLAMP_TASK_GROUP
 	bool "Utilization clamping per group of tasks"
 	depends on CGROUP_SCHED
author	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>	2022-11-22 15:39:09 -0500
committer	Peter Zijlstra <peterz@infradead.org>	2022-12-27 12:52:11 +0100
commit	af7f588d8f7355bc4298dd1962d7826358fc95f0 (patch)
tree	6515179cd9f89aad62e7ed5a1b1999969834d994 /init
parent	99babd04b25054717d21840298b0b46046b42cd9 (diff)
download	linux-rpi-af7f588d8f7355bc4298dd1962d7826358fc95f0.tar.gz linux-rpi-af7f588d8f7355bc4298dd1962d7826358fc95f0.tar.bz2 linux-rpi-af7f588d8f7355bc4298dd1962d7826358fc95f0.zip