The Linux Scheduler: A Decade of Wasted Cores (2016)

CalChris · on Oct 23, 2017

> The evolution of scheduling in Linux By and large, by the year 2000, operating systems designers considered scheduling to be a solved problem…

If I recall correctly, 2000 is 2.3.x which had a braindead trivial scheduler. Basically it just looped through the process list and executed goodness() and found the 'best'. This was obscenely slow when there were a lot of dormant processes. The process table links all mapped to the same cache set which led to cache evictions during the scheduling loop, even TLB evictions. Et cetera. It was just bad, really really bad. Compared to its BSD, Solaris, ... contemporaries, it was garbage.

The O(1) scheduler starts in 2.4 at the tail end of 2000 followed by the Completely Fair Scheduler in 2007, etc. And the Linux scheduler continues to get better. But in 2000, it sucked. Reeked.

ksk · on Oct 23, 2017

From my experience, Linux CPU and I/O scheduling got good from 2.6.x (also around 2007) around when AIO became robust.

dullgiulio · on Oct 23, 2017

Isn't AIO still implemented with threads in libc? In that case, it's just the CPU scheduler that counts...

MichaelMoser123 · on Oct 23, 2017

He means io_submit, that one is not done as user threads

Filligree · on Oct 23, 2017

Somewhat relevant: Threadripper CPUs, which are aimed at the high-end consumer market, are NUMA with two memory domains.

This makes the overall scheduling problem much harder, to the point that they were built with a special "Disable half the cores" mode and supporting hardware to give both memory banks same-speed access to the remaining ones.

hesdeadjim · on Oct 23, 2017

Honestly, I believe the only reason they did this is so that when review sites run benchmarks the Threadripper won't look abnormally slow compared to the other chips out there.

I own one and in both work and play I have had zero issues. If I drop a frame here and there in a game due to some memory latency? Eh, could care less. If you can afford a Threadripper you can afford a 1080 Ti and a Gsync monitor to smooth out any issues you might run into.

Filligree · on Oct 24, 2017

You also can probably afford enough memory that the kernel can schedule your game entirely on one half of the CPU, but I don't know if that sort of scheduling (and defragmentation) is commonly used yet.

hesdeadjim · on Oct 24, 2017

It’s going to be on the application to be NUMA aware, regardless of how much memory you have. Games have never really had to deal with this due to the absolutely minuscule number of people who played games on server-grade dual socket Xeons. It’ll be interesting to see if any of the big names (Unity, Unreal, Crytek/Lunberyard) ever care enough to make a patch for proper NUMA support.

Filligree · on Oct 24, 2017

It's entirely possible to do this at the OS level. It makes the scheduling problem much harder, yes, but a user can—for example—force their game to run only in one domain using CPU affinity, then somehow trigger the kernel to migrate all its memory to that domain. I know how to do the former, I haven't tried the latter.

It would be more difficult to do it automatically, but if NUMA systems become more common then I see no reason why it shouldn't be tried.

brendangregg · on Oct 23, 2017

Still 0% performance wins for Netflix. Not a "decade of wasted cores" for us.

slashdev · on Oct 23, 2017

Brendan is probably referring to these prior comments of his:

https://news.ycombinator.com/item?id=11573375 https://news.ycombinator.com/item?id=11502221

diptanu · on Oct 23, 2017

Can you please explain that more?

brendangregg · on Oct 23, 2017

I've commented several times -- it shouldn't be too hard to find -- and I'm trying not to waste any more time on this.

The better question is: why has the Linux Foundation been silent about it?

If this were true -- "a decade of wasted cores", with losses of "13-24% for typical Linux workloads" (for a decade, as the title suggests) -- then companies like Netflix would have lost many millions due to our choice of Linux, and due to the Linux scheduling maintainers and community failing to identify such egregious problems. The industry as a whole -- including every device and server that runs Linux -- would have lost many BILLIONS. It would be one of the most costly failures in technology EVER.

And the Linux Foundation is silent about this? Seriously?

bogomipz · on Oct 23, 2017

>"I've commented several times -- it shouldn't be too hard to find -- and I'm trying not to waste any more time on this."

Then why make a cryptic comment like you did? Cryptic comments kind of invite questions.

brendangregg · on Oct 23, 2017

I've already answered it elsewhere -- more than once -- and I'm trying not to waste any more time on it. I think the Linux Foundation is best positioned to coordinate a detailed written response.

There's some extra research I did that I haven't shared yet, including, for example, when the bugs were introduced (still needs double checking):

  - Bug 1: Mar 2011, for Linux 2.6.38
  - Bug 2: Apr 2012, for Linux 3.4
  - Bug 3: Dec 2009, for Linux 2.6.32
  - Bug 4: Feb 2015, for Linux 3.19

This paper was published in early 2016, with the title "A decade of wasted cores".

thechao · on Oct 24, 2017

Ah, but they were very intense years, so they counted more than one!

awalton · on Oct 23, 2017

The tl;dr is that unless you had an HPC workload with a NUMA box, you couldn't observe this bad behavior.

Which means in reality, you could name approximately everyone that ran into this issue on a single list: top500.org.

wumpus · on Oct 23, 2017

Many HPC workloads run 1 process per core, and pin them. So they can't observe this bad behavior.

The Linpack benchmark is an example of an HPC code that should run one process per core and be pinned.

atombender · on Oct 24, 2017

Is NUMA that rare? Back in 2007-2008 or so, my company bought some rack servers fitted with 48-core AMD Opteron (Magny-Cours), which weren't particularly expensive, and had a NUMA architecture.

We didn't have HPC workloads, just Postgres, which uses one OS process per connection, and performance was terrible as a result.

velox_io · on Oct 24, 2017

You may find this a fascinating read. It's regarding how SQL Server has it's own threading scheduler to bypass the Windows scheduler. When ran on a machine with HyperThreading enabled it will only use 50% of the CPU, I guess they're going after CPU Cache utilisation.

It's really hard to find a decent article on the subject, here's two of the best I've found. https://www.mssqltips.com/sqlservertip/4403/understanding-sq... https://docs.microsoft.com/en-us/sql/relational-databases/th...

anarazel · on Oct 24, 2017

> We didn't have HPC workloads, just Postgres, which uses one OS process per connection, and performance was terrible as a result.

I'd bet, but not too much, that that was more due to a) postgres' internal locking implementation scaling horribly at that time b) zone_reclaim_mode leading to bad behaviour around IO.

atombender · on Oct 24, 2017

It's possible, but we weren't testing 48 clients at that time, just our normal workload, which had much less parallelism than that. The person in charge of setting up the systems explained the performance issues as being due to NUMA.

the_why_of_y · on Oct 23, 2017

The paper has a clickbaity title, but the actual performance impact of the scheduler behaviors discussed therein is negligible on many workloads. YMMV.

ralphm · on Oct 23, 2017

Article is from 2016. Previous discussion: https://news.ycombinator.com/item?id=11570606

pdw · on Oct 23, 2017

Some brief notes from the Linux developers POV: https://lwn.net/Articles/734039/ (section "Multi-core scheduling")

andai · on Oct 23, 2017

Previous previous discussion:

https://news.ycombinator.com/item?id=11501493

tkyjonathan · on Oct 23, 2017

DBAs knew about this for many years now. We would simply change the schedular on the DB servers.

pdw · on Oct 23, 2017

I think you might be confusing CPU schedulers and IO schedulers. Linux never had switchable CPU schedulers.

0xfeba · on Oct 23, 2017

Yes it does:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

fair.c deadline.c rt.c

dragontamer · on Oct 23, 2017

Alternatively, it might be a confusion between Linux and Windows.

https://technet.microsoft.com/en-us/library/aa175393(v=sql.8...

The Windows Database team has been using the "User Mode Scheduling" feature of Windows to implement custom scheduling for databases for some time.

amyjess · on Oct 23, 2017

Mainline, sure, but there are other kernel forks that do, such as the ck kernel.

noughth · on Oct 23, 2017

Sure it does: http://man7.org/linux/man-pages/man2/sched_setscheduler.2.ht...

ksk · on Oct 23, 2017

If you read the page, it simply states that you can tweak the parameters of the existing scheduler, not replace it entirely.

noughth · on Oct 23, 2017

The scheduling policies listed on the man page I linked share some generic kernel code, but I wouldn't classify them as the same scheduler. If you look inside the kernel/sched/ directory in the source, you'll find that an instance of `struct sched_class` is defined for each scheduler class. There are dl_sched_class, rt_sched_class, fair_sched_class, and idle_sched_class. You can see in `pick_next_task` in core.c that these class structs are iterated over, calling into each scheduler's own `pick_next_task`: http://elixir.free-electrons.com/linux/v4.13.9/source/kernel...

cat199 · on Oct 23, 2017

not to mention the fact that kernel devs knew about it as well, so the whole title of the article is a bit bunk