Matt on Software: SIMD-ifying Google Code Jam "Shop Credit" puzzle

Continuing where I left off last time...

I said I was going to try and SIMD-ify my work. I'm going to do that, but I'm also going to break the cardinal rule of optimization and not optimize the thing that is most important. Instead, I will invest unnecessary time in the min/max determination loop:

1. Scalar Version

The 'normal' code that will produce scalar ASM is as simple as can be...

int min = data[0];
int max = data[0];

for (int i = 0; i < sz; i++) {
min = data[i] < min ? data[i] : min;
max = data[i] > max ? data[i] : max;
}

Let's look at the (scalar) ASM for this loop. This is the Intel Compiler (ICC) ASM, though the G++ output is almost the same.

..B1.28:
lea (%rdx,%rdx), %esi # compute offset for 'data[i]'
# (cute way to do %rdx*2)
incl %edx # incr loop counter
movslq %esi, %rsi # 'promote' offset
movl (%rbx,%rsi,4), %edi # load data[i]
cmpl %r15d, %edi # compare data[i] to min...
movl 4(%rbx,%rsi,4), %r8d # load data[i+1]
cmovl %edi, %r15d # ...store if data[i] < min
cmpl %r9d, %edi # compare data[i] to max
cmovge %edi, %r9d # ...store if data[i] >= max
cmpl %r15d, %r8d # compare data[i+1] to min
cmovl %r8d, %r15d # ...store if data[i+1] < min
cmpl %r9d, %r8d # compare data[i+1] to max
cmovge %r8d, %r9d # ...store if data[i+1] >= max
cmpl %eax, %edx # done?
jb ..B1.28 # ...if not, repeat

We see ICC has unrolled this loop for us; it is checking i and i+1 each loop iteration. It has hoisted the loads up above most of the comparisons in an effort to hide the load latency via out-of-order magic. I set up a test with more data than the Google puzzle; here's the time for 1,000,000 elements:

Test	Time (clocks)	Speedup
ICC Scalar (unrolled 2x)	2.9M (2.9 cycles per iteration)	1x

2. "Forced to not unroll" version

For the heck of it I hacked up the ASM to not unroll. The new ASM is:

sall $1, %eax # somewhere, eax was set to sz / 2; I need to
# double it again
..B1.28:
movslq %esi, %rsi # 'promote' offset
movl (%rbx,%rdx,4), %edi # load data[i]
incl %edx # incr counter
cmpl %r15d, %edi # compare data[i] to min
cmovl %edi, %r15d # ...store if data[i] < min
cmpl %r9d, %edi # compare data[i] to max
cmovge %edi, %r9d # ...store if data[i] >= max
cmpl %eax, %edx # done?
jb ..B1.28 # ...if not, repeat

The ASM gets a bit simpler; I no longer need to do an LEA to compute an offset that is 2x my counter; I can just use the counter. My shift-left-to-double-eax is hacky and would probably break in some cases, but there it is. So, did performance worsen?

Test	Time (clocks)	Speedup
ICC Scalar (unrolled 2x)	2.9M	1x
ICC Scalar (not)	2.8M	1.03x

Nope. Practically speaking the unrolling isn't helping nor hurting by much. On the other hand, if the data was not in a nice linear layout we can imaging how the hoisted load would be well worth this tiny cost.

3. Intrinsic (SSE) Version

This of course should be a perfect SIMD opportunity. It turns out ICC and G++ do vectorize this code, but I wrote this intrinsic code as an experiment anyway. Here it is:

int min = data[0]; // cannot init to 0!
int max = data[0];

// how many 'bundles' (full 4-wide chunks) do I have?
int iters_to_do = sz/4;

// if I have at least one...
if (iters_to_do > 0) {

// load a bundle into min & max
__m128i _max = _mm_load_si128((__m128i*)&data[0]);
__m128i _min = _max;

// compare additional bundles to this one, one at a time
// adjusting min / max for each. Note however this is a
// *per lane* comparison
for (int j = 1; j < iters_to_do; j++) {
__m128i these_vals =
_mm_load_si128((__m128i*)&data[j*4]);

_min = _mm_min_epu32(_min, these_vals);
_max = _mm_max_epu32(_max, these_vals);
}

// Now each lane has the min/max for all values that
// occupied that lane. 'reduce' those values down to
// 1 scalar value
unsigned int min1 = _mm_extract_epi32(_min, 0);
unsigned int min2 = _mm_extract_epi32(_min, 1);
unsigned int min3 = _mm_extract_epi32(_min, 2);
unsigned int min4 = _mm_extract_epi32(_min, 3);
min = min1 < min ? min1 : min;
min = min2 < min ? min2 : min;
min = min3 < min ? min3 : min;
min = min4 < min ? min4 : min;

int max1 = _mm_extract_epi32(_max, 0);
int max2 = _mm_extract_epi32(_max, 1);
int max3 = _mm_extract_epi32(_max, 2);
int max4 = _mm_extract_epi32(_max, 3);
max = max1 > max ? max1 : max;
max = max2 > max ? max2 : max;
max = max3 > max ? max3 : max;
max = max4 > max ? max4 : max;
}

// epilogue: if the amount of data is not a multiple of 4,
// do the cleanup case scalar-style
for (int i = iters_to_do*4;
i < iters_to_do*4 + sz % 4; i++) {
min = data[i] < min ? data[i] : min;
max = data[i] > max ? data[i] : max;
}

How did we do? It seems to have paid off. We were doing 2.9 cycles per iteration; now we need just half a clock to do the same check.

Test	Time (clocks)	Speedup
ICC Scalar (unrolled 2x)	2,900,000	1x
ICC Scalar (not)	2,800,000	1.03x
Intrinsics	550,000	5.2x

So what kind of ASM did this generate? The main loop is pretty straightforward:

..B5.2:

movdqu (%rsi), %xmm1 # load initial bundle

movl $1, %r10d # initialize counter to 1

movdqa %xmm1, %xmm0 # copy data to xmm0

movl $4, %edx # init offset into data

cmpl $1, %eax # is there > 1 bundle?

jle ..B5.6 # ...no? bail

..B5.4:

incl %r10d # incr counter

movdqu (%rsi,%rdx,4), %xmm2 # load next bundle

addq $4, %rdx # add to offset for next load

pminud %xmm2, %xmm0 # compute lane-wise min

pmaxud %xmm2, %xmm1 # compute lane-wise max

cmpl %eax, %r10d # am I done?

jl ..B5.4 # ...no, do next bundle

(I do notice the loads are unaligned (movdqu) vs. aligned (movdqa). I hand-modified them to be aligned and saw virtually no difference, which kind of surprised me as this is still SSE - in AVX there is no penalty for an unaligned load on aligned data)

If you read the comments you know that the pminud and pmaxud are lane-wise min / max. So if you have

xmm2 = | 10 | 14 | 10 | 10 |
xmm0 = | 11 | 12 | 13 | 14 |

Then pmaxud %xmm2, %xmm0 will give you:
xmm0 = | 11 | 14 | 13 | 14 |

Thus we need to then peel out the lanes and reduce. Horizontal operations are not the strong suit of SSE (nor any SIMD arch I've seen for that matter). On the other hand, this code executes just once vs. (size / 4).

Lastly there is the epilog, which will give you the same result as scalar.

4. Conclusions / Next time
We achieved some nice gains, but boy was that a lot of work relatively speaking to get the intrinsic code written and debugged.

And yes, the O(N^2) nature of the original code still eats this O(N) preamble for breakfast. But ASM tweaking is a hobby of mine.

But next time I will SIMD-ify the main loop.

Matt on Software

Friday, January 24, 2014

SIMD-ifying Google Code Jam "Shop Credit" puzzle - Part 2

No comments:

Post a Comment