Given how much these types of processors are used for virtualization, wouldn't a...

Sanddancer · on May 17, 2017

Switching tasks is expensive [1]. Twice as many cores running at half the speed can be considerably faster in the real world because you're not constantly stopping to flush the cache, save the kilobytes of register a modern CPU has, etc. Honestly, I'm surprised that x86 has kept with just two virtual threads for this long. Architectures like Sparc and Power have 4+ threads per core because so many modern jobs are built around hurrying up and waiting.

[1] http://www.cs.rochester.edu/u/cli/research/switch.pdf

gpderetta · on May 17, 2017

A core at twice the frequency is better than two at half the frequency every time. The problem is usually the trade-off is not that clear (either the slower cores consume significantly less or they are faster than just half as slow).

Regarding HT, 2 threads is really a sweet spot for a 4-wide CPU. More than that and the competition for cache resources, execution units and register file become significant.

POWER8 is special because a factor of x2 is because each power 'core; is pretty much two distinct smaller cores that can gang together to speed up one thread (it also helps on per core software licensing), while the other x2 factor is for very specialized loads (this is also true, or used to be, for SPARC).

IIRC XeonPhi which is also a specialized cpu has 4xHT.

pavlov · on May 16, 2017

My understanding is that lower-clocked cores run cooler and consume less power. CPUs designed for high clock speeds will have "hotspots" where certain units are running very hot compared to the rest of the chip. Slower cores have more even thermal profiles.

So, if you don't need the high peak clock speeds, 32 half-speed cores would be preferable for datacenters to save money on electricity and cooling design.

wtallis · on May 16, 2017

Higher core count means more total cache and fewer context switches. If your workload actually has 32 threads, then the 32-core processor will probably be faster overall than a processor with sixteen of the same cores running at twice the clock speed.

CyberDildonics · on May 17, 2017

It also means less memory latency per clock cycle.

sliken · on May 16, 2017

1.4GHz is just the base clock, designed for power efficiency when the CPU isn't under load. The turbo-clock is 2.8 GHz.

Intel does similar, the E5-2650 v4 is 2.2 GHz to 2.9 GHz.

happyopossum · on May 16, 2017

On at least some Intel chips, the turbo clock is only applicable to one core - do you see any indication that all 32 of these cores can run at 2.8Ghz? And if so, sustain that for any period of time?

wtallis · on May 17, 2017

Intel's Turbo Boost isn't binary. CPUs typically have a base clock, a maximum turbo frequency that may only be attainable when using a single core, and numerous intermediate states depending on how many cores are active, including an all-cores turbo frequency that may be usable only for short bursts or may be indefinitely sustainable given a sufficiently lightweight instruction stream (ie. not much AVX).

As an example, the 22-core Intel Xeon E5-2696v4 has a base clock of 2.2GHz. With one or two cores active, it can turbo up to 3.7GHz. With three cores active, the maximum is 3.5GHz, and it decreases by 100MHz per active core until ten cores are active. With 10 or more active cores, the limit is 2.8GHz, provided that the chip is still within its power and thermal limits.

effie · on May 17, 2017

I wonder, is there any way to control this behaviour from software? Or via microcode/BIOS update?

jlawer · on May 16, 2017

Many Virtualised workloads have many VMs with light CPU utilisation on each. As such this kind of core configuration allows lots of VMs to be doing low intensity background tasks at the same time.

As such this would be ideal for things like VDI and web / cloud hosting where the quantity of VMs is very high, but the load from each is typically not.

happyopossum · on May 16, 2017

> Many Virtualised workloads have many VMs with light CPU utilisation on each

Which is why every virtualization platform out there lets you oversubscribe CPUs. That's a solved problem, what's the benefit of having 100 VMs run on 32 slow cores vs 16 fast ones?

adrianratnapala · on May 17, 2017

Only for a weak definition of "solved".

As mentioned above, context switching is expensive and extra L1 cache is valuable. Time-sharing can also have a huge effect on latency (because requests must wait until their server is scheduled), even when the throughput is still good.

Even if time-sharing performs well most of the time, when it goes wrong, the performance problems can be opaque and hard to debug. In general solution that "really" does something will save engineer-days as compared to a thing that does it at the same price/performance trade-off, but virtually.

foota · on May 17, 2017

Someone else mentioned the costs of context switching

CarVac · on May 17, 2017

I haven't seen anywhere but wccftech give the 1.4 GHz number, though.