> The evolution of scheduling in Linux By and large, by the year 2000, operating systems designers considered scheduling to be a solved problem…
If I recall correctly, 2000 is 2.3.x which had a braindead trivial scheduler. Basically it just looped through the process list and executed goodness() and found the 'best'. This was obscenely slow when there were a lot of dormant processes. The process table links all mapped to the same cache set which led to cache evictions during the scheduling loop, even TLB evictions. Et cetera. It was just bad, really really bad. Compared to its BSD, Solaris, ... contemporaries, it was garbage.
The O(1) scheduler starts in 2.4 at the tail end of 2000 followed by the Completely Fair Scheduler in 2007, etc. And the Linux scheduler continues to get better. But in 2000, it sucked. Reeked.
Somewhat relevant: Threadripper CPUs, which are aimed at the high-end consumer market, are NUMA with two memory domains.
This makes the overall scheduling problem much harder, to the point that they were built with a special "Disable half the cores" mode and supporting hardware to give both memory banks same-speed access to the remaining ones.
Honestly, I believe the only reason they did this is so that when review sites run benchmarks the Threadripper won't look abnormally slow compared to the other chips out there.
I own one and in both work and play I have had zero issues. If I drop a frame here and there in a game due to some memory latency? Eh, could care less. If you can afford a Threadripper you can afford a 1080 Ti and a Gsync monitor to smooth out any issues you might run into.
You also can probably afford enough memory that the kernel can schedule your game entirely on one half of the CPU, but I don't know if that sort of scheduling (and defragmentation) is commonly used yet.
It’s going to be on the application to be NUMA aware, regardless of how much memory you have. Games have never really had to deal with this due to the absolutely minuscule number of people who played games on server-grade dual socket Xeons. It’ll be interesting to see if any of the big names (Unity, Unreal, Crytek/Lunberyard) ever care enough to make a patch for proper NUMA support.
It's entirely possible to do this at the OS level. It makes the scheduling problem much harder, yes, but a user can—for example—force their game to run only in one domain using CPU affinity, then somehow trigger the kernel to migrate all its memory to that domain. I know how to do the former, I haven't tried the latter.
It would be more difficult to do it automatically, but if NUMA systems become more common then I see no reason why it shouldn't be tried.
I've commented several times -- it shouldn't be too hard to find -- and I'm trying not to waste any more time on this.
The better question is: why has the Linux Foundation been silent about it?
If this were true -- "a decade of wasted cores", with losses of "13-24% for typical Linux workloads" (for a decade, as the title suggests) -- then companies like Netflix would have lost many millions due to our choice of Linux, and due to the Linux scheduling maintainers and community failing to identify such egregious problems. The industry as a whole -- including every device and server that runs Linux -- would have lost many BILLIONS. It would be one of the most costly failures in technology EVER.
And the Linux Foundation is silent about this? Seriously?
I've already answered it elsewhere -- more than once -- and I'm trying not to waste any more time on it. I think the Linux Foundation is best positioned to coordinate a detailed written response.
There's some extra research I did that I haven't shared yet, including, for example, when the bugs were introduced (still needs double checking):
- Bug 1: Mar 2011, for Linux 2.6.38
- Bug 2: Apr 2012, for Linux 3.4
- Bug 3: Dec 2009, for Linux 2.6.32
- Bug 4: Feb 2015, for Linux 3.19
This paper was published in early 2016, with the title "A decade of wasted cores".
Is NUMA that rare? Back in 2007-2008 or so, my company bought some rack servers fitted with 48-core AMD Opteron (Magny-Cours), which weren't particularly expensive, and had a NUMA architecture.
We didn't have HPC workloads, just Postgres, which uses one OS process per connection, and performance was terrible as a result.
You may find this a fascinating read. It's regarding how SQL Server has it's own threading scheduler to bypass the Windows scheduler. When ran on a machine with HyperThreading enabled it will only use 50% of the CPU, I guess they're going after CPU Cache utilisation.
> We didn't have HPC workloads, just Postgres, which uses one OS process per connection, and performance was terrible as a result.
I'd bet, but not too much, that that was more due to a) postgres' internal locking implementation scaling horribly at that time b) zone_reclaim_mode leading to bad behaviour around IO.
It's possible, but we weren't testing 48 clients at that time, just our normal workload, which had much less parallelism than that. The person in charge of setting up the systems explained the performance issues as being due to NUMA.
The paper has a clickbaity title, but the actual performance impact of the scheduler behaviors discussed therein is negligible on many workloads. YMMV.
The scheduling policies listed on the man page I linked share some generic kernel code, but I wouldn't classify them as the same scheduler. If you look inside the kernel/sched/ directory in the source, you'll find that an instance of `struct sched_class` is defined for each scheduler class. There are dl_sched_class, rt_sched_class, fair_sched_class, and idle_sched_class. You can see in `pick_next_task` in core.c that these class structs are iterated over, calling into each scheduler's own `pick_next_task`: http://elixir.free-electrons.com/linux/v4.13.9/source/kernel...
If I recall correctly, 2000 is 2.3.x which had a braindead trivial scheduler. Basically it just looped through the process list and executed goodness() and found the 'best'. This was obscenely slow when there were a lot of dormant processes. The process table links all mapped to the same cache set which led to cache evictions during the scheduling loop, even TLB evictions. Et cetera. It was just bad, really really bad. Compared to its BSD, Solaris, ... contemporaries, it was garbage.
The O(1) scheduler starts in 2.4 at the tail end of 2000 followed by the Completely Fair Scheduler in 2007, etc. And the Linux scheduler continues to get better. But in 2000, it sucked. Reeked.