Matt on Software: January 2014

Continuing where I left off last time...

I said I was going to try and SIMD-ify my work. I'm going to do that, but I'm also going to break the cardinal rule of optimization and not optimize the thing that is most important. Instead, I will invest unnecessary time in the min/max determination loop:

1. Scalar Version

The 'normal' code that will produce scalar ASM is as simple as can be...

int min = data[0];
int max = data[0];

for (int i = 0; i < sz; i++) {
min = data[i] < min ? data[i] : min;
max = data[i] > max ? data[i] : max;
}

Let's look at the (scalar) ASM for this loop. This is the Intel Compiler (ICC) ASM, though the G++ output is almost the same.

..B1.28:
lea (%rdx,%rdx), %esi # compute offset for 'data[i]'
# (cute way to do %rdx*2)
incl %edx # incr loop counter
movslq %esi, %rsi # 'promote' offset
movl (%rbx,%rsi,4), %edi # load data[i]
cmpl %r15d, %edi # compare data[i] to min...
movl 4(%rbx,%rsi,4), %r8d # load data[i+1]
cmovl %edi, %r15d # ...store if data[i] < min
cmpl %r9d, %edi # compare data[i] to max
cmovge %edi, %r9d # ...store if data[i] >= max
cmpl %r15d, %r8d # compare data[i+1] to min
cmovl %r8d, %r15d # ...store if data[i+1] < min
cmpl %r9d, %r8d # compare data[i+1] to max
cmovge %r8d, %r9d # ...store if data[i+1] >= max
cmpl %eax, %edx # done?
jb ..B1.28 # ...if not, repeat

We see ICC has unrolled this loop for us; it is checking i and i+1 each loop iteration. It has hoisted the loads up above most of the comparisons in an effort to hide the load latency via out-of-order magic. I set up a test with more data than the Google puzzle; here's the time for 1,000,000 elements:

Test	Time (clocks)	Speedup
ICC Scalar (unrolled 2x)	2.9M (2.9 cycles per iteration)	1x

2. "Forced to not unroll" version

For the heck of it I hacked up the ASM to not unroll. The new ASM is:

sall $1, %eax # somewhere, eax was set to sz / 2; I need to
# double it again
..B1.28:
movslq %esi, %rsi # 'promote' offset
movl (%rbx,%rdx,4), %edi # load data[i]
incl %edx # incr counter
cmpl %r15d, %edi # compare data[i] to min
cmovl %edi, %r15d # ...store if data[i] < min
cmpl %r9d, %edi # compare data[i] to max
cmovge %edi, %r9d # ...store if data[i] >= max
cmpl %eax, %edx # done?
jb ..B1.28 # ...if not, repeat

The ASM gets a bit simpler; I no longer need to do an LEA to compute an offset that is 2x my counter; I can just use the counter. My shift-left-to-double-eax is hacky and would probably break in some cases, but there it is. So, did performance worsen?

Test	Time (clocks)	Speedup
ICC Scalar (unrolled 2x)	2.9M	1x
ICC Scalar (not)	2.8M	1.03x

Nope. Practically speaking the unrolling isn't helping nor hurting by much. On the other hand, if the data was not in a nice linear layout we can imaging how the hoisted load would be well worth this tiny cost.

3. Intrinsic (SSE) Version

This of course should be a perfect SIMD opportunity. It turns out ICC and G++ do vectorize this code, but I wrote this intrinsic code as an experiment anyway. Here it is:

int min = data[0]; // cannot init to 0!
int max = data[0];

// how many 'bundles' (full 4-wide chunks) do I have?
int iters_to_do = sz/4;

// if I have at least one...
if (iters_to_do > 0) {

// load a bundle into min & max
__m128i _max = _mm_load_si128((__m128i*)&data[0]);
__m128i _min = _max;

// compare additional bundles to this one, one at a time
// adjusting min / max for each. Note however this is a
// *per lane* comparison
for (int j = 1; j < iters_to_do; j++) {
__m128i these_vals =
_mm_load_si128((__m128i*)&data[j*4]);

_min = _mm_min_epu32(_min, these_vals);
_max = _mm_max_epu32(_max, these_vals);
}

// Now each lane has the min/max for all values that
// occupied that lane. 'reduce' those values down to
// 1 scalar value
unsigned int min1 = _mm_extract_epi32(_min, 0);
unsigned int min2 = _mm_extract_epi32(_min, 1);
unsigned int min3 = _mm_extract_epi32(_min, 2);
unsigned int min4 = _mm_extract_epi32(_min, 3);
min = min1 < min ? min1 : min;
min = min2 < min ? min2 : min;
min = min3 < min ? min3 : min;
min = min4 < min ? min4 : min;

int max1 = _mm_extract_epi32(_max, 0);
int max2 = _mm_extract_epi32(_max, 1);
int max3 = _mm_extract_epi32(_max, 2);
int max4 = _mm_extract_epi32(_max, 3);
max = max1 > max ? max1 : max;
max = max2 > max ? max2 : max;
max = max3 > max ? max3 : max;
max = max4 > max ? max4 : max;
}

// epilogue: if the amount of data is not a multiple of 4,
// do the cleanup case scalar-style
for (int i = iters_to_do*4;
i < iters_to_do*4 + sz % 4; i++) {
min = data[i] < min ? data[i] : min;
max = data[i] > max ? data[i] : max;
}

How did we do? It seems to have paid off. We were doing 2.9 cycles per iteration; now we need just half a clock to do the same check.

Test	Time (clocks)	Speedup
ICC Scalar (unrolled 2x)	2,900,000	1x
ICC Scalar (not)	2,800,000	1.03x
Intrinsics	550,000	5.2x

So what kind of ASM did this generate? The main loop is pretty straightforward:

..B5.2:

movdqu (%rsi), %xmm1 # load initial bundle

movl $1, %r10d # initialize counter to 1

movdqa %xmm1, %xmm0 # copy data to xmm0

movl $4, %edx # init offset into data

cmpl $1, %eax # is there > 1 bundle?

jle ..B5.6 # ...no? bail

..B5.4:

incl %r10d # incr counter

movdqu (%rsi,%rdx,4), %xmm2 # load next bundle

addq $4, %rdx # add to offset for next load

pminud %xmm2, %xmm0 # compute lane-wise min

pmaxud %xmm2, %xmm1 # compute lane-wise max

cmpl %eax, %r10d # am I done?

jl ..B5.4 # ...no, do next bundle

(I do notice the loads are unaligned (movdqu) vs. aligned (movdqa). I hand-modified them to be aligned and saw virtually no difference, which kind of surprised me as this is still SSE - in AVX there is no penalty for an unaligned load on aligned data)

If you read the comments you know that the pminud and pmaxud are lane-wise min / max. So if you have

xmm2 = | 10 | 14 | 10 | 10 |
xmm0 = | 11 | 12 | 13 | 14 |

Then pmaxud %xmm2, %xmm0 will give you:
xmm0 = | 11 | 14 | 13 | 14 |

Thus we need to then peel out the lanes and reduce. Horizontal operations are not the strong suit of SSE (nor any SIMD arch I've seen for that matter). On the other hand, this code executes just once vs. (size / 4).

Lastly there is the epilog, which will give you the same result as scalar.

4. Conclusions / Next time
We achieved some nice gains, but boy was that a lot of work relatively speaking to get the intrinsic code written and debugged.

And yes, the O(N^2) nature of the original code still eats this O(N) preamble for breakfast. But ASM tweaking is a hobby of mine.

But next time I will SIMD-ify the main loop.

The other day I hit upon this Google Code Jam puzzle:

You receive a credit C at a local store and would like to buy two items. You first walk through the store and create a list L of all available items. From this list you would like to buy two items that add up to the entire value of the credit. The solution you provide will consist of the two integers indicating the positions of the items in your list (smaller number first).

I wrote a little harness that could call different implementations. As usual, I started with a naive version: Basically you scan the list of item prices one at a time, then scan all the items again to see if together the 'inner' and 'outer' scan value adds up to the goal.

1. Naive Version

bool naive(int sz, const int* data, int credit, int& ans1, int& ans2) {
int min, max;
for (int i = 0; i < sz; i++) {
for (int j = 0; j < sz; j++) {
if (i == j) continue;
if (data[i] + data[j] == credit) {
ans1 = i < j ? i + 1 : j + 1;
ans2 = i < j ? j + 1 : i + 1;
return 1;
}
}
}
return 0;

}

Results so far:

Version	Performance (abs)	Speedup	Time to Implement
Naive	32M clocks	1x	15 mins

If there's one tricky thing here, it's the need to make sure an item is not being used with itself. That is, if we have $100 to spend, we cannot purchase the same $50 item twice. Our 'continue' line checks against this.

This was fast to implement, but slow to execute (as we will see).

2. Slightly Less Naive Version

Every CS 101 student will see an obvious ~2x trick here. The inner 'j' loop need not start at '0'. It can start at j+1. This eliminates the need for the continue and cuts our execution time in half. Now we simply have:

bool less_naive(int sz, const int* data, int credit, int& ans1, int& ans2) {

int min, max;

for (int i = 0; i < sz; i++) {

for (int j = i + 1; j < sz; j++) {

if (data[i] + data[j] == credit) {

ans1 = i < j ? i + 1 : j + 1;

ans2 = i < j ? j + 1 : i + 1;

return 1;

}

return 0;

}

Results so far:

Version	Performance (abs)	Speedup	Time to Implement
Naive	32M clocks	1x	15 mins
Less Naive	16M clocks	2x	10 mins

And yes, I'll admit I spent 10 minutes thinking about this. I spent a little more time thinking about this...

3. Min / Max Culling

So then it occurred to me that if I first know the min and max value of the items, I can abort checking an item against others if I know there cannot be a combination that will work. For instance if I must achieve a total of $100, and the max item is $70, then I can skip the outer loop for all items < $30. Similarly if the minimum item value is $10 I can skip outer loop values that are > $90. Here's the code:

bool min_max_cull(int sz, const int* data, int credit, int& ans1, int& ans2) {

int min = data[0];

int max = data[0];

for (int i = 0; i < sz; i++) {

min = data[i] < min ? data[i] : min;

max = data[i] > max ? data[i] : max;

}

for (int i = 0; i < sz; i++) {

if (data[i] + max < credit) continue; // hardly helps

if (data[i] + min > credit) continue; // helps; 2x

for (int j = i + 1; j < sz; j++) {

if (data[i] + data[j] == credit) {

ans1 = i < j ? i + 1 : j + 1;

ans2 = i < j ? j + 1 : i + 1;

return 1;

}

return 0;

}

Results so far:

Version	Performance (abs)	Speedup	Time to Implement
Naive	32M clocks	1x	15 mins
Less Naive	16M clocks	2x	10 mins
Min / Max Cull	9M clocks	3.5x	20 mins

Yes, I have to do an additional loop scan, but given this is an O(n^2) kind of thing, an additional 'n' is small potatoes. And as it turns out it is worth it. We are are almost 2x faster than 'less naive' and 3.5x faster overall. This one took a little longer to make work that I'd like to admit, and that's because I tried to get clever and meld the 'min max determination' loop in with the main loop. The logic gets a bit tangled, and really the SIMD stuff is what I came here for.

3. SIMD, take 1

So one would think you could get a SIMD speedup on this code. After all, if the inner loop can check one combo of values, why can't it check, say 4 (with SSE) values? But try as I might I could not get the Intel Compiler nor GCC to automatically vectorize this code.

The reason is in the 'return'...

for (int j = i + 1; j < sz; j++) {

if (data[i] + data[j] == credit) {

ans1 = i < j ? i + 1 : j + 1;

ans2 = i < j ? j + 1 : i + 1;

return 1;

}

So you and I both know 'sz' is the array size of 'data'. And you and I both know that therefore it is impossible for us to read beyond the end of the array. Unfortunately, the compiler does not know this. Our function simply says 'data' is a pointer...

bool naive(int sz, const int* data, int credit, int& ans1, int& ans2) {

Ok, you say, but surely the compiler can reason it out...'j' is a simple linear walk. Obviously it is indexing 'data'. What could go wrong? Or put another way, if something did go wrong it's the programmer's fault for giving a wrong value for 'sz'. That is, if the code crashes in scalar, then it should crash in vector.

Not quite. Let's say 'data' really is 1,000 elements. If the answer is the last element, but sz is 1,000,000, then the compiler in the scalar case would return at the right time and give the right answer. The vector version would read in no-man's land and crash. Not good.

So let's try SIMD-ifying this ourselves. For that you will have to wait for tomorrow.

Matt on Software

Friday, January 24, 2014

SIMD-ifying Google Code Jam "Shop Credit" puzzle - Part 2

Thursday, January 23, 2014

SIMD-ifying Google Code Jam "Shop Credit" puzzle - Part 1