I said I was going to try and SIMD-ify my work. I'm going to do that, but I'm also going to break the cardinal rule of optimization and not optimize the thing that is most important. Instead, I will invest unnecessary time in the min/max determination loop:
1. Scalar Version
The 'normal' code that will produce scalar ASM is as simple as can be...
int min = data[0];
int max = data[0];
for (int i = 0; i < sz; i++) {
min = data[i] < min ? data[i] : min;
max = data[i] > max ? data[i] : max;
}
Let's look at the (scalar) ASM for this loop. This is the Intel Compiler (ICC) ASM, though the G++ output is almost the same.
..B1.28:
lea (%rdx,%rdx), %esi # compute offset for 'data[i]'
# (cute way to do %rdx*2)
incl %edx # incr loop counter
movslq %esi, %rsi # 'promote' offset
movl (%rbx,%rsi,4), %edi # load data[i]
cmpl %r15d, %edi # compare data[i] to min...
movl 4(%rbx,%rsi,4), %r8d # load data[i+1]
cmovl %edi, %r15d # ...store if data[i] < min
cmpl %r9d, %edi # compare data[i] to max
cmovge %edi, %r9d # ...store if data[i] >= max
cmpl %r15d, %r8d # compare data[i+1] to min
cmovl %r8d, %r15d # ...store if data[i+1] < min
cmpl %r9d, %r8d # compare data[i+1] to max
cmovge %r8d, %r9d # ...store if data[i+1] >= max
cmpl %eax, %edx # done?
jb ..B1.28 # ...if not, repeat
We see ICC has unrolled this loop for us; it is checking i and i+1 each loop iteration. It has hoisted the loads up above most of the comparisons in an effort to hide the load latency via out-of-order magic. I set up a test with more data than the Google puzzle; here's the time for 1,000,000 elements:
| Test | Time (clocks) | Speedup |
|---|---|---|
| ICC Scalar (unrolled 2x) | 2.9M (2.9 cycles per iteration) | 1x |
2. "Forced to not unroll" version
For the heck of it I hacked up the ASM to not unroll. The new ASM is:
sall $1, %eax # somewhere, eax was set to sz / 2; I need to
# double it again
..B1.28:
movslq %esi, %rsi # 'promote' offset
movl (%rbx,%rdx,4), %edi # load data[i]
incl %edx # incr counter
cmpl %r15d, %edi # compare data[i] to min
cmovl %edi, %r15d # ...store if data[i] < min
cmpl %r9d, %edi # compare data[i] to max
cmovge %edi, %r9d # ...store if data[i] >= max
cmpl %eax, %edx # done?
jb ..B1.28 # ...if not, repeat
| Test | Time (clocks) | Speedup |
|---|---|---|
| ICC Scalar (unrolled 2x) | 2.9M | 1x |
| ICC Scalar (not) | 2.8M | 1.03x |
Nope. Practically speaking the unrolling isn't helping nor hurting by much. On the other hand, if the data was not in a nice linear layout we can imaging how the hoisted load would be well worth this tiny cost.
3. Intrinsic (SSE) Version
This of course should be a perfect SIMD opportunity. It turns out ICC and G++ do vectorize this code, but I wrote this intrinsic code as an experiment anyway. Here it is:
int min = data[0]; // cannot init to 0!
int max = data[0];
// how many 'bundles' (full 4-wide chunks) do I have?
int iters_to_do = sz/4;
// if I have at least one...
if (iters_to_do > 0) {
// load a bundle into min & max
__m128i _max = _mm_load_si128((__m128i*)&data[0]);
__m128i _min = _max;
// compare additional bundles to this one, one at a time
// adjusting min / max for each. Note however this is a
// *per lane* comparison
for (int j = 1; j < iters_to_do; j++) {
__m128i these_vals =
_mm_load_si128((__m128i*)&data[j*4]);
_min = _mm_min_epu32(_min, these_vals);
_max = _mm_max_epu32(_max, these_vals);
}
// Now each lane has the min/max for all values that
// occupied that lane. 'reduce' those values down to
// 1 scalar value
unsigned int min1 = _mm_extract_epi32(_min, 0);
unsigned int min2 = _mm_extract_epi32(_min, 1);
unsigned int min3 = _mm_extract_epi32(_min, 2);
unsigned int min4 = _mm_extract_epi32(_min, 3);
min = min1 < min ? min1 : min;
min = min2 < min ? min2 : min;
min = min3 < min ? min3 : min;
min = min4 < min ? min4 : min;
int max1 = _mm_extract_epi32(_max, 0);
int max2 = _mm_extract_epi32(_max, 1);
int max3 = _mm_extract_epi32(_max, 2);
int max4 = _mm_extract_epi32(_max, 3);
max = max1 > max ? max1 : max;
max = max2 > max ? max2 : max;
max = max3 > max ? max3 : max;
max = max4 > max ? max4 : max;
}
// epilogue: if the amount of data is not a multiple of 4,
// do the cleanup case scalar-style
for (int i = iters_to_do*4;
i < iters_to_do*4 + sz % 4; i++) {
min = data[i] < min ? data[i] : min;
max = data[i] > max ? data[i] : max;
}
How did we do? It seems to have paid off. We were doing 2.9 cycles per iteration; now we need just half a clock to do the same check.
| Test | Time (clocks) | Speedup |
|---|---|---|
| ICC Scalar (unrolled 2x) | 2,900,000 | 1x |
| ICC Scalar (not) | 2,800,000 | 1.03x |
| Intrinsics | 550,000 | 5.2x |
So what kind of ASM did this generate? The main loop is pretty straightforward:
..B5.2:
movdqu (%rsi), %xmm1 # load initial bundle
movl $1, %r10d # initialize counter to 1
movdqa %xmm1, %xmm0 # copy data to xmm0
movl $4, %edx # init offset into data
cmpl $1, %eax # is there > 1 bundle?
jle ..B5.6 # ...no? bail
..B5.4:
incl %r10d # incr counter
movdqu (%rsi,%rdx,4), %xmm2 # load next bundle
addq $4, %rdx # add to offset for next load
pminud %xmm2, %xmm0 # compute lane-wise min
pmaxud %xmm2, %xmm1 # compute lane-wise max
cmpl %eax, %r10d # am I done?
jl ..B5.4 # ...no, do next bundle
(I do notice the loads are unaligned (movdqu) vs. aligned (movdqa). I hand-modified them to be aligned and saw virtually no difference, which kind of surprised me as this is still SSE - in AVX there is no penalty for an unaligned load on aligned data)
If you read the comments you know that the pminud and pmaxud are lane-wise min / max. So if you have
xmm2 = | 10 | 14 | 10 | 10 |
xmm0 = | 11 | 12 | 13 | 14 |
Then pmaxud %xmm2, %xmm0 will give you:
xmm0 = | 11 | 14 | 13 | 14 |
Lastly there is the epilog, which will give you the same result as scalar.
4. Conclusions / Next time
We achieved some nice gains, but boy was that a lot of work relatively speaking to get the intrinsic code written and debugged.
And yes, the O(N^2) nature of the original code still eats this O(N) preamble for breakfast. But ASM tweaking is a hobby of mine.
But next time I will SIMD-ify the main loop.