That's part of it, but doesn't fully answer the question. Decoding and issuing 6...

Symmetry · on Oct 5, 2018

Lower maximum clock speeds mean you have more FO4s to play with and potentially makes the fan-out issues in wide designs a bit more manageable. Decode I expect to be pretty easy, as long as your branch predictor is on target the power costs just grow linearly with decode width when all your instructions start at nice 32-bit boundaries.

Mostly I'm curious about how complete the bypass network is on their functional units and if execution is clustered like the POWER8. The width doubling in the A series does remind me of the POWER 7 to 8 transition.

Renaming is also apparently a major constraint on design width in many cases but I'm not so familiar with that.

aw1621107 · on Oct 5, 2018

> more FO4s to play with

What does "FO4" stand for here? Googling it yields "Fallout 4", which definitely isn't right, and I'm not sure what other keywords to tack on to get the right result.

Symmetry · on Oct 5, 2018

Sorry, that a "fan-out of 4" which. Traditionally you look at circuit timings in terms of how long it takes one transistor to switch 4 other transistors of equal size. Wire capacitance is a lot more important these days so it's not necessarily the best metric anymore but it's still used. The fewer FO4s of delay you have in a pipeline stage the faster you can clock a chip. The fewer FO4s in your longest pipeline stage the faster you can clock your chip, though there's also a non-linear dependence on voltage. Because of that non-linearity I'd still expect a lower-clocked chip to have more complex pipeline stages. And you can only increase your speed by slimming down stages so much. The overhead of latches and accounting for clock jitter generally add 4 FO4s beyond the useful logic you accomplish in a pipeline stage.

aw1621107 · on Oct 5, 2018

Excellent explanation! Quick follow-up question: Why FO4 specifically, and not some other number/metric? Was (is?) that a particularly common structure in CPUs?

Symmetry · on Oct 5, 2018

I'm afraid you'll have to ask someone with a much grayer beard than mine to get a good answer to that.

Cyph0n · on Oct 5, 2018

As someone who is familiar with the term, great explanation!

mikeyouse · on Oct 5, 2018

From this rando PDF, it looks like "Fan-out of 4".

> Fan-out of 4 is a process-independent delay metric in CMOS tech.

The wrap-up slide at the very end has a bit of a rundown.

http://people.duke.edu/~bcl15/teachdir/ece590_fall14/Present...

amluto · on Oct 5, 2018

One respect in which I can imagine the x86 ISA being a real problem is in decode bandwidth. To issue 6 x86 instructions per cycle, either the front end needs to decode 6 per cycle, or it needs to cache decoded instructions. And x86 can’t be decoded in parallel without massive complexity because the instructions are variable length, and even determining the length requires mostly decoding the instruction.

rayiner · on Oct 5, 2018

It's true that decoding x86 is harder, but Sandy Bridge+ get most instructions from a uop cache, which delivers 4 fixed-length uops per cycle. You could make that 6 wide, but Intel doesn't because they wouldn't be able to fill that.

pcwalton · on Oct 5, 2018

AArch64 has a larger register file and fewer dependencies in general than x86-64 does. For example, most instructions don't set flags. I don't know for sure, but that might be enough to raise the ILP sufficiently.

AceJohnny2 · on Oct 5, 2018

> and optimized the compiler to actually use that IPC is the really interesting question.

You can check that part at least. Isn't it LLVM?

wintercharm · on Oct 5, 2018

They also have a 192 instruction reorder buffer, and a really well optimized OS.