Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

That's part of it, but doesn't fully answer the question. Decoding and issuing 6 instructions per cycle ordinarily is extremely costly in terms of power. And it's usually very hard to keep those execution units busy--it's hard to find six independent instructions to issue every clock cycle. How Apple built a 6-wide CPU within that power envelope, and optimized the compiler to actually use that IPC is the really interesting question.

On a Xeon v3 core, SPECint averages below 2 instructions per cycle: https://www.researchgate.net/publication/322745869_A_Workloa.... How does Apple beat Intel on branch integer benchmarks like 403.gcc by a factor of two per clock?



Lower maximum clock speeds mean you have more FO4s to play with and potentially makes the fan-out issues in wide designs a bit more manageable. Decode I expect to be pretty easy, as long as your branch predictor is on target the power costs just grow linearly with decode width when all your instructions start at nice 32-bit boundaries.

Mostly I'm curious about how complete the bypass network is on their functional units and if execution is clustered like the POWER8. The width doubling in the A series does remind me of the POWER 7 to 8 transition.

Renaming is also apparently a major constraint on design width in many cases but I'm not so familiar with that.


> more FO4s to play with

What does "FO4" stand for here? Googling it yields "Fallout 4", which definitely isn't right, and I'm not sure what other keywords to tack on to get the right result.


Sorry, that a "fan-out of 4" which. Traditionally you look at circuit timings in terms of how long it takes one transistor to switch 4 other transistors of equal size. Wire capacitance is a lot more important these days so it's not necessarily the best metric anymore but it's still used. The fewer FO4s of delay you have in a pipeline stage the faster you can clock a chip. The fewer FO4s in your longest pipeline stage the faster you can clock your chip, though there's also a non-linear dependence on voltage. Because of that non-linearity I'd still expect a lower-clocked chip to have more complex pipeline stages. And you can only increase your speed by slimming down stages so much. The overhead of latches and accounting for clock jitter generally add 4 FO4s beyond the useful logic you accomplish in a pipeline stage.


Excellent explanation! Quick follow-up question: Why FO4 specifically, and not some other number/metric? Was (is?) that a particularly common structure in CPUs?


I'm afraid you'll have to ask someone with a much grayer beard than mine to get a good answer to that.


As someone who is familiar with the term, great explanation!


From this rando PDF, it looks like "Fan-out of 4".

> Fan-out of 4 is a process-independent delay metric in CMOS tech.

The wrap-up slide at the very end has a bit of a rundown.

http://people.duke.edu/~bcl15/teachdir/ece590_fall14/Present...


One respect in which I can imagine the x86 ISA being a real problem is in decode bandwidth. To issue 6 x86 instructions per cycle, either the front end needs to decode 6 per cycle, or it needs to cache decoded instructions. And x86 can’t be decoded in parallel without massive complexity because the instructions are variable length, and even determining the length requires mostly decoding the instruction.


It's true that decoding x86 is harder, but Sandy Bridge+ get most instructions from a uop cache, which delivers 4 fixed-length uops per cycle. You could make that 6 wide, but Intel doesn't because they wouldn't be able to fill that.


AArch64 has a larger register file and fewer dependencies in general than x86-64 does. For example, most instructions don't set flags. I don't know for sure, but that might be enough to raise the ILP sufficiently.


> and optimized the compiler to actually use that IPC is the really interesting question.

You can check that part at least. Isn't it LLVM?


They also have a 192 instruction reorder buffer, and a really well optimized OS.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: