Wsrtjtyk

Question

I'm wondering how loops of various sizes perform on recent x86 processors, as a function of number of uops.

Here's a quote from Peter Cordes who raised the issue of non-multiple-of-4 counts in another question:

I also found that the uop bandwidth out of the loop buffer isn't a
constant 4 per cycle, if the loop isn't a multiple of 4 uops. (i.e.
it's abc, abc, ...; not abca, bcab, ...). Agner Fog's microarch doc
unfortunately wasn't clear on this limitation of the loop buffer.

The issue is about whether loops need to be a multiple of N uops to execute at maximum uop throughput, where N is the width of the processor. (i.e., 4 for recent Intel processors). There are a lot of complicating factors when talking about "width" and count uops, but I mostly want to ignore those. In particular, assume no micro or macro-fusion.

Peter gives the following example of a loop with 7 uops in its body:

A 7-uop loop will issue groups of 4|3|4|3|... I haven't tested larger
loops (that don't fit in the loop buffer) to see if it's possible for
the first instruction from the next iteration to issue in the same
group as the taken-branch to it, but I assume not.

More generally, the claim is that each iteration of a loop with x uops in its body will take at least ceil(x / 4) iterations, rather than simply x / 4.

Is this true for some or all recent x86-compatible processors?

There's also the question of whether a taken branch within a loop body can also produce an issue group of less than 4 uops. — Sep 3 '16 at 22:39
@dwelch: To microbenchmark this, you simply write a loop with 2 NOPs vs. a loop with 3 NOPs (plus a non-macro-fused dec/jnz). The total cycles should double when you go from 4 uops in the loop to 5. Or just independent reg-reg ALU ops like ADD or OR, instead of NOP. Or were you talking about instruction fetch? The whole point of this experiment is to test the loop buffer in modern Intel CPUs, which, for tiny loops, recycles the contents of the queue between the rest of the frontend and the issue stage, using it as a loop buffer. So L1I and L0uop caches untouched. — Sep 4 '16 at 4:56
@dwelch Your comments here are completely unhelpful. This is a real question from someone who understands the complexity. Go read the Skylake section in Agner Fog's microarch pdf before making any more wrong guesses about why this effect might be hard to measure or alignment dependent. It's known more or less exactly how SnB-family microarchitectures shortcut NOPs, issuing them but not needing to dispatch them to an execution unit. (Still, it is something to double-check, and best avoided when possible). — Sep 4 '16 at 18:59
@dwelch: I don't have a SKL, IDK why BeeOnRope doesn't just test it. BTW you're totally wrong about L1I being involved here. The whole point of the loop buffer is that it caches already-decoded instructions, and can just feed them to issue stage 4 uops at a time, without touching L1I$ or even the main L0 uop-cache. Overhead from OS interrupts is fractions of a %, and this effect produces easily-measurable differences of 25% to 100% in the total cycle count to run 100M iterations. I've done this on my SnB hardware, but it's broken ATM so I can't re-run the experiment myself. — Sep 5 '16 at 13:01
There are a lot of complicating factors I'm not sure you can ignore these complicating factors if you're optimizing on such a low level. And when you have it right for one CPU another comes out with different alignment of optimal factors. — Sep 5 '16 at 15:35

score 22 · Accepted Answer · 2018-11-05 22:31:03Z

I did some investigation with Linux perf to help answer this on my Skylake i7-6700HQ box, and Haswell results have been kindly provided by another user. The analysis below applies to Skylake, but it is followed by a comparison versus Haswell.

Other architectures may vary⁰ -and I welcome additional results - the source is available).

This question mostly deals with the front end, since on recent architectures it is the front end which imposes the hard limit of four fused-domain uops per cycle.

Summary of Rules for Loop Performance

First, I'll summarize the results in terms of a few "performance rules" to keep in mind when dealing with small loops. There are plenty of other performance rules as well - these are complementary to them (i.e., you probably don't break another rule to just to satisfy these ones). These rules apply most directly to Haswell and later architectures - see the other answer for an overview of the differences on earlier architectures.

First, count the number of macro-fused uops in your loop. You can use Agner's instruction tables to look this up directly for every instruction, except that an ALU uop and immediately follow branch will usually fuse together into a single uop. Then based on this count:

If the count is a multiple of 4, you're good: these loops execute optimally.

If the count is even and less than 32, you're good, except if it's 10 in which case you should unroll to another even number if you can.

For odd numbers you should try to unroll to an even number less than 32 or a multiple of 4, if you can.

For loops larger than 32 uops but less than 64, you might want to unroll if it isn't already a multiple of 4: with more than 64 uops you'll get efficient performance at any value on Sklyake and almost all values on Haswell (with a few deviations, possibly alignment related). The inefficiencies for these loops are still relatively small: the values to avoid most are 4N + 1 counts, followed by 4N + 2 counts.

Summary of Findings

For code served out of the uop cache, there are no apparent multiple-of-4 effects. Loops of any number of uops can be executed at a throughput of 4 fused-domain uops per cycle.

For code processed by the legacy decoders, the opposite is true: loop execution time is limited to integral number of cycles, and hence loops that are not a multiple of 4 uops cannot achieve 4 uops/cycle, as they waste some issue/execution slots.

For code issued from the loop stream detector (LSD), the situation is a mix of the two and is explained in more detail below. In general, loops less than 32 uops and with an even number of uops execute optimally, while odd-sized loops do not, and larger loops require a multiple-of-4 uop count to execute optimally.

What Intel Says

Intel actually has a note on this in their optimization manual, details in the other answer.

Details

As anyone well-versed recent x86-64 architectures knows, at any point the fetch and decode portion of the front end may be working in one several different modes, depending on the code size and other factors. As it turns out, these different modes all have different behaviors with respect to loop sizing. I'll cover them separately follow.

Legacy Decoder

The legacy decoder¹ is the full machine-code-to-uops decoder that is used² when the code doesn't fit in the uop caching mechanisms (LSD or DSB). The primary reason this would occur is if the code working set is larger than the uop cache (approximately ~1500 uops in the ideal case, less in practice). For this test though, we'll take advantage of the fact that the legacy decoder will also be used if an aligned 32-byte chunk contains more than 18 instructions³.

To test the legacy decoder behavior, we use a loop that looks like this:

short_nop:

    mov rax, 100_000_000

ALIGN 32

.top:

    dec rax

    nop

    ...

    jnz .top

    ret

Basically, a trivial loop that counts down until rax is zero. All instructions are a single uop⁴ and the number of nop instructions is varied (at the location shown as ...) to test different sizes of loops (so a 4-uop loop will have 2 nops, plus the two loop control instructions). There is no macro-fusion as we always separate the dec and jnz with at least one nop, and also no micro-fusion. Finally, there is no memory access at (outside of the implied icache access).

Note that this loop is very dense - about 1 byte per instruction (since the nop instructions are 1 byte each) - so we'll trigger the > 18 instructions in a 32B chunk condition as soon as hit 19 instructions in the loop. Based on examining the perf performance counters lsd.uops and idq.mite_uops that's exactly what we see: essentially 100% of the instructions come out of the LSD⁵ up until and including the 18 uop loop, but at 19 uops and up, 100% come from the legacy decoder.

In any case, here are the cycles/iteration for all loop sizes from 3 to 99 uops⁶:

Cyles/iteration for loops with given size

The blue points are the loops that fit in the LSD, and show somewhat complex behavior. We'll look at these later.

The red points (starting at 19 uops/iteration), are handled by the legacy decoder, and show a very predictable pattern:

All loops with N uops take exactly ceiling(N/4) iterations

So, for the legacy decoder at least, Peter's observation holds exactly on Skylake: loops with a multiple of 4 uops may execute at an IPC of 4, but any other number of uops will waste 1, 2 or 3 execution slots (for loops with 4N+3, 4N+2, 4N+1 instructions, respectively).

It is not clear to me why this happens. Although it may seem obvious if you consider that decoding happens in contiguous 16B chunks, and so at a decoding rate of 4 uops/cycle loops not a multiple of 4 would always have some trailing (wasted) slots in the cycle the jnz instruction is encountered. However, the actual fetch & decode unit is composed of predecode and decode phases, with a queue in-between. The predecode phase actually has a throughput of 6 instructions, but only decodes to the end of the 16-byte boundary on each cycle. This seems to imply that the bubble that occurs at the end of the loop could be absorbed by the predecoder -> decode queue since the predecoder has an average throughput higher than 4.

So I can't fully explain this based on my understanding of how the predecoder works. It may be that there is some additional limitation in decoding or pre-decoding that prevents non-integral cycle counts. For example, perhaps the legacy decoders cannot decode instructions on both sides of a jump even if the instructions after the jump are available in the predecoded queue. Perhaps it is related to the need to handle macro-fusion.

The test above shows the behavior where the top of the loop is aligned on a 32-byte boundary. Below is the same graph, but with an added series that shows the effect when the top of loop is moved 2 bytes up (i.e, now misaligned at a 32N + 30 boundary):

Legacy decoder cycles/iteration when misaligned

Most loop sizes now suffer a 1 or 2 cycle penalty. The 1 penalty case makes sense when you consider decoding 16B boundaries and 4-instructions per cycle decoding, and the 2 cycle penalty cases occurs for loops where for some reason the DSB is used for 1 instruction in the loop (probably the dec instruction which appears in its own 32-byte chunk), and some DSB<->MITE switching penalties are incurred.

In some cases, the misalignment doesn't hurt when it ends up better aligning the end of the loop. I tested the misalignment and it persists in the same way up to 200 uop loops. If you take the description of the predecoders at face value, it would seem that, as above, they should be able to hide a fetch bubble for misalignment, but it doesn't happen (perhaps the queue is not big enough).

DSB (Uop Cache)

The uop cache (Intel likes to call it the DSB) is able to cache most loops of moderate amount of instructions. In a typical program, you'd hope that most of your instructions are served out of this cache⁷.

We can repeat the test above, but now serving uops out of the uop cache. This is a simple matter of increasing the size of our nops to 2 bytes, so we no longer hit the 18-instruction limit. We use the 2-byte nop xchg ax, ax in our loop:

long_nop_test:

    mov rax, iters

ALIGN 32

.top:

    dec eax

    xchg ax, ax  ; this is a 2-byte nop

    ...

    xchg ax, ax

    jnz .top

    ret

Here, there results are very straightforward. For all tested loop sizes delivered out of the DSB, the number of cycles required was N/4 - i.e., the loops executed at the maximum theoretical throughput, even if they didn't have a multiple of 4 uops. So in general, on Skylake, moderately sized loops served out of the DSB shouldn't need to worry about ensuring the uop count meets some particular multiple.

Here's a graph out to 1,000 uop loops. If you squint, you can see the sub-optimal behavior before 64-uops (when the loop is in the LSD). After that, it's a straight shot, 4 IPC the whole way to 1,000 uops (with a blip around 900 that was probably due to load on my box):

Cycle counts for loops served out of the DSB

Next we look at performance for loops that are small enough to fit in the uop cache.

LSD (Loop steam detector)

Important note: Intel has apparently disabled the LSD on Skylake (SKL150 erratum) and Kaby Lake (KBL095, KBW095 erratum) chips via a microcode update and on Skylake-X out of the box, due to a bug related to the interaction between hyperthreading and the LSD. For those chips, the graph below will likely not have the interesting region up to 64 uops; rather, it will just look the same as the region after 64 uops.

The loop stream detector can cache small loops of up to 64 uops (on Skylake). In Intel's recent documentation it is positioned more as a power-saving mechanism than a performance feature - although there are certainly no performance downsides mentioned to using the LSD.

Running this for the loop sizes that should fit in the LSD, we get the following cycles/iteration behavior:

Cycles per Iteration for LSD-resident loops

The red line here is the % of uops which are delivered from the LSD. It flatlines at 100% for all loop sizes from 5 to 56 uops.

For the 3 and 4 uop loops, we have the unusual behavior that 16% and 25% of the uops, respectively, are delivered from the legacy decoder. Huh? Luckily, it doesn't seem to affect the loop throughput as both cases achieve the maximum throughput of 1 loop/cycle - despite the fact that one could expect some MITE<->LSD transition penalties.

Between loop sizes of 57 and 62 uops, the number of uops delivered from LSD exhibits some weird behavior - approximately 70% of the uops are delivered from the LSD, and the rest from the DSB. Skylake nominally has a 64-uop LSD, so this is some kind of transition right before the LSD size is exceeded - perhaps there is some kind of internal alignment within the IDQ (on which the LSD is implemented) that causes only partial hits to the LSD in this phase. This phase is short and, performance-wise, seems mostly to be a linear combination of the full-in-LSD performance which precedes it, and the fully-in-DSB performance which follows it.

Let's look at the main body of results between 5 and 56 uops. We see three distinct regions:

Loops from 3 to 10 uops: Here, the behavior is complex. It is the only region where we see cycle counts that can't be explained by static behavior over a single loop iteration⁸. The range is short enough that it's hard to say if there is a pattern. Loops of 4, 6 and 8 uops all execute optimally, in N/4 cycles (that's the same pattern as the next region).

A loop of 10 uops, on the other hand, executes in 2.66 cycles per iteration, making it the only even loop size that doesn't execute optimally until you get to loop sizes of 34 uops or above (other than the outlier at 26). That corresponds to something like a repeated uop/cycle execution rate of 4, 4, 4, 3. For a loop of 5 uops, you get 1.33 cycles per iteration, very close but not the same as the ideal of 1.25. That corresponds to an execution rate of 4, 4, 4, 4, 3.

These results are hard to explain. The results are repeatable from run to run, and robust to changes such as swapping out the nop for an instruction that actually does something like mov ecx, 123. It might be something to do with the limit of 1 taken branch every 2 cycles, which applies to all loops except those that are "very small". It might be that the uops occasionally line up such that this limitation kicks in, leading to an extra cycle. Once you get to 12 uops or above, this never occurs since you are always taking at least three cycles per iteration.

Loops from 11 to 32-uops: We see a stair-step pattern, but with a period of two. Basically all loops with an even number of uops perform optimally - i.e., taking exactly N/4 cycles. Loops with odd number of uops waste one "issue slot", and take the same number of cycles as a loop with one more uops (i.e., a 17 uop loop takes the same 4.5 cycles as an 18 uop loop). So here we have behavior better than ceiling(N/4) for many uop counts, and we have the first evidence that Skylake at least can execute loops in a non-integral number of cycles.

The only outliers are N=25 and N=26, which both take about 1.5% longer than expected. It's small but reproducible, and robust to moving the function around in the file. That's too small to be explained by a per-iteration effect, unless it has a giant period, so it's probably something else.

The overall behavior here is exactly consistent (outside of the 25/26 anomaly) with the hardware unrolling the loop by a factor of 2.

Loops from 33 to ~64 uops: We see a stair-step pattern again, but with a period of 4, and worse average performance than the up-to 32 uop case. The behavior is exactly ceiling(N/4) - that is, the same as the legacy decoder case. So for loops of 32 to 64 uops, the LSD provides no apparent benefit over the legacy decoders, in terms of front end throughput for this particular limitation. Of course, there are many other ways the LSD is better - it avoids many of the potential decoding bottlenecks that occur for more complex or longer instructions, and it saves power, etc.

All of this is quite surprising, because it means that loops delivered from the uop cache generally perform better in the front end than loops delivered from the LSD, despite the LSD usually being positioned as a strictly better source of uops than the DSB (e.g., as part of advice to try to keep loops small enough to fit in the LSD).

Here's another way to look at the same data - in terms of the efficiency loss for a given uop count, versus the theoretical maximum throughput of 4 uops per cycle. A 10% efficiency hit means you only have 90% of the throughput that you'd calculate from the simple N/4 formula.

The overall behavior here is consistent with the hardware not doing any unrolling, which makes sense since a loop of more than 32 uops cannot be unrolled at all in a buffer of 64 uops.

Efficiency Loss by Loop Size

The three regions discussed above are colored differently, and at least competing effects are visible:

Everything else being equal, the larger the number of uops involved, the lower the efficiency hit. The hit is a fixed cost only once per iteration, so larger loops pay a smaller relative cost.

There is a large jump in inefficiency when you cross to into the 33+ uop region: both the size of the throughput loss increases, and the number of affected uop counts doubles.

The first region is somewhat chaotic, and 7 uops is the worst overall uop count.

Alignment

The DSB and LSD analysis above is for loop entries aligned to a 32-byte boundary, but the unaligned case doesn't seem to suffer in either case: there isn't a material difference from the aligned case (other than perhaps some small variation for less than 10 uops that I didn't investigate further).

Here's the unaligned results for 32N-2 and 32N+2 (i.e., the loop top 2 bytes before and after the 32B boundary):

Misaligned Cycles per Iteration

The ideal N/4 line is also shown for reference.

Haswell

Next next take a look at the prior microarchitecture: Haswell. The numbers here have been graciously provided by user Iwillnotexist Idonotexist.

LSD + Legacy Decode Pipeline

First, the results from the "dense code" test which tests the LSD (for small uop counts) and the legacy pipeline (for larger uop counts, since the loop "busts out" of the DSB due to instruction density.

Immediately we see a difference already in terms of when each architecture delivers uops from the LSD for a dense loop. Below we compare Skylake and Haswell for short loops of dense code (1 byte per instruction).

Haswell vs Skylake LSD Delivery %

As described above, the Skylake loop stops being delivered from the LSD at exactly 19 uops, as expected from the 18-uop per 32-byte region of code limit. Haswell, on the other hand, seems to stop delivering reliably from the LSD for the 16-uop and 17-uop loops as well. I don't have any explanation for this. There is also a difference in the 3-uop case: oddly both processors only deliver some of the their uops out of the LSD in the 3 and 4 uop cases, but the exact amount is the same for 4 uops, and different from 3.

We mostly care about the actual performance though, right? So let's look at the cycles/iteration for the 32-byte aligned dense code case:

Haswell vs Skylake LSD + Legacy Pipeline

This is the same data as show above for Skylake (the misaligned series has been removed), with Haswell plotted alongside. Immediately you notice that the pattern is similar for Haswell, but not the same. As above, there are two regions here:

Legacy Decode

The loops larger than ~16-18 uops (the uncertainty is described above) are delivered from the legacy decoders. The pattern for Haswell is somewhat different from Skylake.

For the range from 19-30 uops they are identical, but after that Haswell breaks the pattern. Skylake took ceil(N/4) cycles for loops delivered from the legacy decoders. Haswell, on the other hand, seems to take something like ceil((N+1)/4) + ceil((N+2)/12) - ceil((N+1)/12). OK, that's messy (shorter form, anyone?) - but basically it means that while Skylake executes loops with 4*N cycles optimally (i.e,. at 4-uops/cycle), such loops are (locally) usually the least optimal count (at least locally) - it takes one more cycle to execute such loops than Skylake. So you are actually best off with loops of 4N-1 uops on Haswell, except that the 25% of such loops that are also of the form 16-1N (31, 47, 63, etc) take one additional cycle. It's starting to sound like a leap year calculation - but the pattern is probably best understood visually above.

I don't think this pattern is intrinsic to uop dispatch on Haswell, so we shouldn't read to much into it. It seems to be explained by

0000000000455a80 <short_nop_aligned35.top>:

16B cycle

  1     1 455a80:       ff c8   dec    eax

  1     1 455a82:       90      nop

  1     1 455a83:       90      nop

  1     1 455a84:       90      nop

  1     2 455a85:       90      nop

  1     2 455a86:       90      nop

  1     2 455a87:       90      nop

  1     2 455a88:       90      nop

  1     3 455a89:       90      nop

  1     3 455a8a:       90      nop

  1     3 455a8b:       90      nop

  1     3 455a8c:       90      nop

  1     4 455a8d:       90      nop

  1     4 455a8e:       90      nop

  1     4 455a8f:       90      nop

  2     5 455a90:       90      nop

  2     5 455a91:       90      nop

  2     5 455a92:       90      nop

  2     5 455a93:       90      nop

  2     6 455a94:       90      nop

  2     6 455a95:       90      nop

  2     6 455a96:       90      nop

  2     6 455a97:       90      nop

  2     7 455a98:       90      nop

  2     7 455a99:       90      nop

  2     7 455a9a:       90      nop

  2     7 455a9b:       90      nop

  2     8 455a9c:       90      nop

  2     8 455a9d:       90      nop

  2     8 455a9e:       90      nop

  2     8 455a9f:       90      nop

  3     9 455aa0:       90      nop

  3     9 455aa1:       90      nop

  3     9 455aa2:       90      nop

  3     9 455aa3:       75 db   jne    455a80 <short_nop_aligned35.top>

Here I've noted the 16B decode chunk (1-3) each instruction appears in, and the cycle in which it will be decoded. The rule is basically that up to the next 4 instructions are decoded, as long as they fall in the current 16B chunk. Otherwise they have to wait until the next cycle. For N=35, we see that there is a loss of 1 decode slot in cycle 4 (only 3 instruction are left in the 16B chunk), but that otherwise the loop lines up very well with the 16B boundaries and even the last cycle (9) can decode 4 instructions.

Here's a truncated look at N=36, which is identical except for the end of the loop:

0000000000455b20 <short_nop_aligned36.top>:

16B cycle

  1     1 455a80:       ff c8   dec    eax

  1     1 455b20:       ff c8   dec    eax

  1     1 455b22:       90      nop

  ... [29 lines omitted] ...

  2     8 455b3f:       90      nop

  3     9 455b40:       90      nop

  3     9 455b41:       90      nop

  3     9 455b42:       90      nop

  3     9 455b43:       90      nop

  3    10 455b44:       75 da   jne    455b20 <short_nop_aligned36.top>

There are now 5 instructions to decode in the 3rd and final 16B chunk, so one additional cycle is needed. Basically 35 instructions, for this particular pattern of instructions happens to line up better with the 16B bit boundaries and saves one cycle when decoding. This doesn't mean that N=35 is better than N=36 in general! Different instructions will have different numbers of bytes and will line up differently. A similar alignment issue explains also the additional cycle that is required every 16 bytes:

16B cycle

...

  2     7 45581b:       90      nop

  2     8 45581c:       90      nop

  2     8 45581d:       90      nop

  2     8 45581e:       90      nop

  3     8 45581f:       75 df   jne    455800 <short_nop_aligned31.top>

Here the final jne has slipped into the next 16B chunk (if an instruction spans a 16B boundary it is effectively in the latter chunk), causing an extra cycle loss. This occurs only every 16 bytes.

So the Haswell legacy decoder results are explained perfectly by a legacy decoder that behaves as described, for example, in Agner Fog's microarchitecture doc. In fact, it also seems to explain Skylake results if you assume Skylake can decode 5 instructions per cycle (delivering up to 5 uops)⁹. Assuming it can, the asymptotic legacy decode throughput on this code for Skylake is still 4-uops, since a block of 16 nops decodes 5-5-5-1, versus 4-4-4-4 on Haswell, so you only get benefits at the edges: in the N=36 case above, for example, Skylake can decode all remaining 5 instructions, versus 4-1 for Haswell, saving a cycle.

The upshot is that it seems to be that the legacy decoder behavior can be understood in a fairly straightforward manner, and the main optimization advice is to continue to massage code so that it falls "smartly" into the 16B aligned chunks (perhaps that's NP-hard like bin packing?).

DSB (and LSD again)

Next let's take a look at the scenario where the code is served out of the LSD or DSB - by using the "long nop" test which avoids breaking the 18-uop per 32B chunk limit, and so stays in the DSB.

Haswell vs Skylake:

Haswell vs Skylake LSD and DSB

Note the LSD behavior - here Haswell stops serving out of the LSD at exactly 57 uops, which is completely consistent with the published size of the LSD of 57 uops. There is no weird "transition period" like we see on Skylake. Haswell also has the weird behavior for 3 and 4 uops where only ~0% and ~40% of the uops, respectively, come from the LSD.

Performance-wise, Haswell is normally in-line with Skylake with a few deviations, e.g., around 65, 77 and 97 uops where it rounds up to the next cycle, whereas Skylake is always able to sustain 4 uops/cycle even when that's results in a non-integer number of cycles. The slight deviation from expected at 25 and 26 uops has disappeared. Perhaps the 6-uop delivery rate of Skylake helps it avoid uop-cache alignment issues that Haswell suffers with its 4-uop delivery rate.

Other Architectures

Results for the following additional architectures were kindly provided by user Andreas Abel, but we'll have to use another answer for further analysis as we are at the character limit here.

Help Needed

Although results for many platforms have been kindly offered by the community, I'm still interested in results on chips older than Nehalem, and newer than Coffee Lake (in particular, Cannon Lake, which is a new uarch). The code to generate these results is public. Also, the results above are available in .ods format in GitHub as well.

⁰ In particular, the legacy decoder maximum throughput apparently increased from 4 to 5 uops in Skylake, and the maximum throughput for the uop cache increased from 4 to 6. Both of those could impact the results described here.

¹ Intel actually like to call the legacy decoder the MITE (Micro-instruction Translation Engine), perhaps because it's a faux-pas to actually tag any part of your architecture with the legacy connotation.

² Technically there is another, even slower, source of uops - the MS (microcode sequencing engine), which is used to implement any instruction with more than 4 uops, but we ignore this here since none of our loops contain microcoded instructions.

³ This works because any aligned 32-byte chunk can use at most 3-ways in its uop cache slot, and each slot holds up to 6 uops. So if you use more than 3 * 6 = 18 uops in a 32B chunk, the code can't be stored in the uop cache at all. It's probably rare to encounter this condition in practice, since the code needs to be very dense (less than 2 bytes per instruction) to trigger this.

⁴ The nop instructions decode to one uop, but don't are eliminated prior to execution (i.e., they don't use an execution port) - but still take up space in the front end and so count against the various limits that we are interested in.

⁵ The LSD is the loop stream detector, which caches small loops of up to 64 (Skylake) uops directly in the IDQ. On earlier architectures it can hold 28 uops (both logical cores active) or 56 uops (one logical core active).

⁶ We can't easily fit a 2 uop loop in this pattern, since that would mean zero nop instructions, meaning the dec and jnz instructions would macro-fuse, with a corresponding change in the uop count. Just take my word that all loops with 4 or less uops execute at best at 1 cycle/iteration.

⁷ For fun, I just ran perf stat against a short run of Firefox where I opened a tab and clicked around on a few Stack Overflow questions. For instructions delivered, I got 46% from DSB, 50% from legacy decoder and 4% for LSD. This shows that at least for big, branchy code like a browser the DSB still can't capture the large majority of the code (lucky the legacy decoders aren't too bad).

⁸ By this, I mean that all the other cycle counts can be explained by simply by taking an "effective" integral loop cost in uops (which might be higher than the actual size is uops) and dividing by 4. For these very short loops, this doesn't work - you can't get to 1.333 cycles per iteration by dividing any integer by 4. Said another way, in all other regions the costs have the form N/4 for some integer N.

⁹ In fact we know that Skylake can deliver 5 uops per cycle from the legacy decoder, but we don't know if those 5 uops can come from 5 different instructions, or only 4 or less. That is, we expect that Skylake can decode in the pattern 2-1-1-1, but I'm not sure if it can decode in the pattern 1-1-1-1-1. The above results give some evidence that it can indeed decode 1-1-1-1-1.

I see SKL can't manage non-integer cycles per iteration for larger loops, > half LSD size. Perhaps SKL unrolls loops in the LSD until they're a multiple of 4 uops, size permitting? That would be plausible from a design perspective I think, and sounds like a good theory if it's consistent with the data. — Oct 9 '16 at 7:20
@IwillnotexistIdonotexist: perfect, both those links work for me. Hopefully BeeOnRope can get them, too, and turn them into the same kind of graphs. — Oct 28 '16 at 5:40
@IwillnotexistIdonotexist - thanks very much for the Haswell numbers. I uploaded the first chunk of analysis above, covering mostly the legacy decode pipeline. It shed some light on the Skylake behavior actually - the legacy pipeline now seems like a simple case that can be explained (mostly?) by just looking at how the code falls on 16B boundaries, with the additional proviso that Skylake can decode 5 uops/cycle from 5 instructions, versus 4 from 4 in Haswell. — Oct 28 '16 at 21:58
Also, I'll avow myself surprised that <100% of the uops in a 3-uop loop come from the LSD. In my quickie Haswell experiments with libpfc I get ~100%. I suspect that this is because you put the nop(s) between the dec rax and the jne. In the loop nop dec jne, 3 insns/i issue but just 2 uops/i, all served out of LSD, in a pattern 0-4-0-4. In the loop dec nop jne, 3 insns/i issue, 3 uops/i, all served out of LSD, in a pattern 0-4-4-4-0-4-4-4. — Oct 28 '16 at 22:25
I created a pull request on GitHub with results for several microarchitectures, including Sandy Bridge. — Oct 27 at 13:49

score 3 · Answer 2 · 2018-11-05 22:39:03Z

This is a follow-on to the original answer, to analyze the behavior for five additional architectures, based on test results provided by Andreas Abel:

Nehalem

Sandy Bridge

Ivy Bridge

Broadwell

Coffee Lake

We take a quick look at the results on these architectures in addition to Skylake and Haswell. It only needs to be a "quick" look since all the architectures except Nehalem follow one of the existing patterns discussed above.

First, the short nop case which exercises the legacy decoder (for loops that don't fit in the LSD) and the LSD. Here is the cycles/iteration for this scenario, for all 7 architectures.

Figure 2.1: All architectures dense nop performance:

All Architectures Dense Nop Performance

This graph is really busy (click for a larger view) and a bit hard to read since the results for many architectures lie on top of each other, but I tried to ensure that a dedicated reader can track the line for any architecture.

First, let's discuss the big outlier: Nehalem. All of the other architectures have a slope that roughly follows the 4 uops/cycle line, but Nehalem is at almost exactly 3 uops per cycle, so quickly falls behind all of the other architectures. Outside of the initial LSD region, the line is also totally smooth, without the "stair step" appearance seen in the other architectures.

This is entirely consistent with Nehalem having a uop retirement limit of 3 uops/cycle. This is the bottleneck for uops outside of the LSD: they all execute at about exactly 3 uops per cycle, bottlenecked on retirement. The front-end isn't the bottleneck, so the exact uop count and decoding arrangement doens't matter and so the stair-step is absent.

Other than Nehalem, the other architectures, except Broadwell split fairly cleanly into groups: Haswell-like or Skylake-like. That is, all of Sandy Bridge, Ivy Bridge and Haswell behave like Haswell, for loops greater than about 15 uops (Haswell behavior is discussed in the other answer). Even though they are different micro-architectures, they behave largely the same since their legacy decoding capabilities are the same. Below about 15 uops we see Haswell as somewhat faster for any uop count not a multiple of 4. Perhaps it gets an additional unrolling in the LSD due to a larger LSD, or there are other "small loop" optimizations. For Sandy Bridge and Ivy Bridge, this means that small loops should definitely target a uop count which is a multiple of 4.

Coffee Lake behaves similarly to Skylake¹. This makes sense, since the micro-architecture is the same. Coffee Lake appears better than Skylake below about 16 uops, but this is just an effect of Coffee Lake's disabled LSD by default. Skylake was tested with an enabled LSD, before Intel disabled it via microcode update due to a security issue. Coffee Lake was released after this issue was known, so had the LSD disabled out-of-the-box. So for this test, Coffee Lake is using either the DSB (for loops below about 18 uops, which can still fit in the DSB) or the legacy decoder (for the remainder of the loops), which leads to better results for small uop count loops where the LSD imposes an overhead (interesting, for larger loops, the LSD and the legacy decoder happen to impose exactly the same overhead, for very different reasons).

Finally, we take a look at 2-byte NOPs, which aren't dense enough to prevent the use of the DSB (so this case is more reflective of typical code).

Figure 2.1: 2-byte nop performance:

2-byte nop performance

Again, the result is along the same lines as the earlier chart. Nehalem is still the outlier bottlenecked at 3 uops per cycle. For the range up to about 60ish uops, all architectures other than Coffee Lake are using the LSD, and we see that Sandy Bridge and Ivy Bridge perform a bit worse here, rounding up to the next cycle and so only achieving the maximum throughput of 4 uops/cycle if the number of uops in the loop is a multiple of 4. Above 32 uops the "unrolling" feature of Haswell and new uarchs dosn't have any effect, so everything is roughly tied.

Sandy Bridge actually has a few uop ranges (e.g., from 36 through 44 uops) where it performs better than the newer architectures. This seems to occur because not all loops are detected by the LSD and in these ranges the loops are served from the DSB instead. Since the DSB is generally faster, so is Sandy Bridge in these cases.

What Intel Says

You can actually find a section specifically dealing with this topic in the Intel Optimization Manual, section 3.4.2.5, as pointed out by Andreas Abel in the comments. There, Intel says:

The LSD holds micro-ops that construct small “infinite” loops.
Micro-ops from the LSD are allocated in the out-of-order engine. The
loop in the LSD ends with a taken branch to the beginning of the loop.
The taken branch at the end of the loop is always the last micro-op
allocated in the cycle. The instruction at the beginning of the loop
is always allocated at the next cycle. If the code performance is
bound by front end bandwidth, unused allocation slots result in a
bubble in allocation, and can cause performance degrada- tion.
Allocation bandwidth in Intel microarchitecture code name Sandy Bridge
is four micro-ops per cycle. Performance is best, when the number of
micro-ops in the LSD result in the least number of unused allo- cation
slots. You can use loop unrolling to control the number of micro-ops
that are in the LSD.

They go on to show an example where unrolling a loop by a factor of two doesn't help performance due to LSD "rounding", but unrolling by three works. The example is a big confusing since it actually mixes two effects since unrolling more also reduces the loop overhead and hence the number of uops per iteration. A more interesting example would have been where unrolling the loop fewer times led to an increase in performance due to LSD rounding effects.

This section seems to accurately describe the behavior in Sandy Bridge and Ivy Bridge. The results above show that both of these architectures do as described, and you lose 1, 2 or 3 uop execution slots for loops with 4N+3, 4N+2, or 4N+1 uops respectively.

It hasn't been updated with the new performance for Haswell and later however. As described in the other answer, performance has improved from the simple model described above and the behavior is more complex.

¹ There is a weird outlier at 16 uops where Coffee Lake performs worse than all the other architectures, even Nehalem (a regression of about 50%), but maybe this measurement noise?

@Peter this is the dense nop case, so in general the legacy decoder gets used since there are too many instructions per uop cache line. However for the small loops like under 18 hope one might imagine that the uop cache could still be used since there aren't enough nops to "break out" - which is what I saw on Sklyake with the LSD enabled. However for the coffee lake results it seems the DSB is not being used even for those small loops based on the perf counter results. — Nov 5 at 5:36
I will run the test on Coffee lake again later to see if the outlier was a measurement error. — Nov 5 at 16:27
I just came across section 3.4.2.5 of Intel's optimization manual. It hasn't been mentioned in the answers so far, but it seems relevant to the issue discussed here. — Nov 5 at 16:29
@PeterCordes - a correction to the above: Coffee Lake indeed uses the DSB for small loops less than about 18 uops, even in the "dense" case, so all is as expected (I observed this also on Skylake pre-microcode patch except replace DSB with LSD). I just read the data wrong or mis-remembered it. Yeah, it seems the LSD strategy was perhaps improved in Haswell: maybe the whole "unrolling" thing was added then, so before that small loops especially suffered when they weren't of the form 4N. This makes unrolling somewhat more important for those architectures. — Nov 5 at 20:55
I added created a new pull request with additional results for Coffee Lake. The outlier at 16 uops was a measurement error, probably caused by hyperthreading. — Nov 6 at 0:31

score 22 · Accepted Answer · 2018-11-05 22:31:03Z

I did some investigation with Linux perf to help answer this on my Skylake i7-6700HQ box, and Haswell results have been kindly provided by another user. The analysis below applies to Skylake, but it is followed by a comparison versus Haswell.

Other architectures may vary⁰ -and I welcome additional results - the source is available).

This question mostly deals with the front end, since on recent architectures it is the front end which imposes the hard limit of four fused-domain uops per cycle.

Summary of Rules for Loop Performance

First, I'll summarize the results in terms of a few "performance rules" to keep in mind when dealing with small loops. There are plenty of other performance rules as well - these are complementary to them (i.e., you probably don't break another rule to just to satisfy these ones). These rules apply most directly to Haswell and later architectures - see the other answer for an overview of the differences on earlier architectures.

First, count the number of macro-fused uops in your loop. You can use Agner's instruction tables to look this up directly for every instruction, except that an ALU uop and immediately follow branch will usually fuse together into a single uop. Then based on this count:

If the count is a multiple of 4, you're good: these loops execute optimally.

If the count is even and less than 32, you're good, except if it's 10 in which case you should unroll to another even number if you can.

For odd numbers you should try to unroll to an even number less than 32 or a multiple of 4, if you can.

For loops larger than 32 uops but less than 64, you might want to unroll if it isn't already a multiple of 4: with more than 64 uops you'll get efficient performance at any value on Sklyake and almost all values on Haswell (with a few deviations, possibly alignment related). The inefficiencies for these loops are still relatively small: the values to avoid most are 4N + 1 counts, followed by 4N + 2 counts.

Summary of Findings

For code served out of the uop cache, there are no apparent multiple-of-4 effects. Loops of any number of uops can be executed at a throughput of 4 fused-domain uops per cycle.

For code processed by the legacy decoders, the opposite is true: loop execution time is limited to integral number of cycles, and hence loops that are not a multiple of 4 uops cannot achieve 4 uops/cycle, as they waste some issue/execution slots.

For code issued from the loop stream detector (LSD), the situation is a mix of the two and is explained in more detail below. In general, loops less than 32 uops and with an even number of uops execute optimally, while odd-sized loops do not, and larger loops require a multiple-of-4 uop count to execute optimally.

What Intel Says

Intel actually has a note on this in their optimization manual, details in the other answer.

Details

As anyone well-versed recent x86-64 architectures knows, at any point the fetch and decode portion of the front end may be working in one several different modes, depending on the code size and other factors. As it turns out, these different modes all have different behaviors with respect to loop sizing. I'll cover them separately follow.

Legacy Decoder

The legacy decoder¹ is the full machine-code-to-uops decoder that is used² when the code doesn't fit in the uop caching mechanisms (LSD or DSB). The primary reason this would occur is if the code working set is larger than the uop cache (approximately ~1500 uops in the ideal case, less in practice). For this test though, we'll take advantage of the fact that the legacy decoder will also be used if an aligned 32-byte chunk contains more than 18 instructions³.

To test the legacy decoder behavior, we use a loop that looks like this:

short_nop:

    mov rax, 100_000_000

ALIGN 32

.top:

    dec rax

    nop

    ...

    jnz .top

    ret

Basically, a trivial loop that counts down until rax is zero. All instructions are a single uop⁴ and the number of nop instructions is varied (at the location shown as ...) to test different sizes of loops (so a 4-uop loop will have 2 nops, plus the two loop control instructions). There is no macro-fusion as we always separate the dec and jnz with at least one nop, and also no micro-fusion. Finally, there is no memory access at (outside of the implied icache access).

Note that this loop is very dense - about 1 byte per instruction (since the nop instructions are 1 byte each) - so we'll trigger the > 18 instructions in a 32B chunk condition as soon as hit 19 instructions in the loop. Based on examining the perf performance counters lsd.uops and idq.mite_uops that's exactly what we see: essentially 100% of the instructions come out of the LSD⁵ up until and including the 18 uop loop, but at 19 uops and up, 100% come from the legacy decoder.

In any case, here are the cycles/iteration for all loop sizes from 3 to 99 uops⁶:

Cyles/iteration for loops with given size

The blue points are the loops that fit in the LSD, and show somewhat complex behavior. We'll look at these later.

The red points (starting at 19 uops/iteration), are handled by the legacy decoder, and show a very predictable pattern:

All loops with N uops take exactly ceiling(N/4) iterations

So, for the legacy decoder at least, Peter's observation holds exactly on Skylake: loops with a multiple of 4 uops may execute at an IPC of 4, but any other number of uops will waste 1, 2 or 3 execution slots (for loops with 4N+3, 4N+2, 4N+1 instructions, respectively).

It is not clear to me why this happens. Although it may seem obvious if you consider that decoding happens in contiguous 16B chunks, and so at a decoding rate of 4 uops/cycle loops not a multiple of 4 would always have some trailing (wasted) slots in the cycle the jnz instruction is encountered. However, the actual fetch & decode unit is composed of predecode and decode phases, with a queue in-between. The predecode phase actually has a throughput of 6 instructions, but only decodes to the end of the 16-byte boundary on each cycle. This seems to imply that the bubble that occurs at the end of the loop could be absorbed by the predecoder -> decode queue since the predecoder has an average throughput higher than 4.

So I can't fully explain this based on my understanding of how the predecoder works. It may be that there is some additional limitation in decoding or pre-decoding that prevents non-integral cycle counts. For example, perhaps the legacy decoders cannot decode instructions on both sides of a jump even if the instructions after the jump are available in the predecoded queue. Perhaps it is related to the need to handle macro-fusion.

The test above shows the behavior where the top of the loop is aligned on a 32-byte boundary. Below is the same graph, but with an added series that shows the effect when the top of loop is moved 2 bytes up (i.e, now misaligned at a 32N + 30 boundary):

Legacy decoder cycles/iteration when misaligned

Most loop sizes now suffer a 1 or 2 cycle penalty. The 1 penalty case makes sense when you consider decoding 16B boundaries and 4-instructions per cycle decoding, and the 2 cycle penalty cases occurs for loops where for some reason the DSB is used for 1 instruction in the loop (probably the dec instruction which appears in its own 32-byte chunk), and some DSB<->MITE switching penalties are incurred.

In some cases, the misalignment doesn't hurt when it ends up better aligning the end of the loop. I tested the misalignment and it persists in the same way up to 200 uop loops. If you take the description of the predecoders at face value, it would seem that, as above, they should be able to hide a fetch bubble for misalignment, but it doesn't happen (perhaps the queue is not big enough).

DSB (Uop Cache)

The uop cache (Intel likes to call it the DSB) is able to cache most loops of moderate amount of instructions. In a typical program, you'd hope that most of your instructions are served out of this cache⁷.

We can repeat the test above, but now serving uops out of the uop cache. This is a simple matter of increasing the size of our nops to 2 bytes, so we no longer hit the 18-instruction limit. We use the 2-byte nop xchg ax, ax in our loop:

long_nop_test:

    mov rax, iters

ALIGN 32

.top:

    dec eax

    xchg ax, ax  ; this is a 2-byte nop

    ...

    xchg ax, ax

    jnz .top

    ret