Thursday, January 23, 2014

SIMD-ifying Google Code Jam "Shop Credit" puzzle - Part 1

The other day I hit upon this Google Code Jam puzzle:

You receive a credit C at a local store and would like to buy two items. You first walk through the store and create a list L of all available items. From this list you would like to buy two items that add up to the entire value of the credit. The solution you provide will consist of the two integers indicating the positions of the items in your list (smaller number first).

I wrote a little harness that could call different implementations.  As usual, I started with a naive version:  Basically you scan the list of item prices one at a time, then scan all the items again to see if together the 'inner' and 'outer' scan value adds up to the goal.

1. Naive Version

bool naive(int sz, const int* data, int credit, int& ans1, int& ans2) {
    int min, max;
    for (int i = 0; i < sz; i++) {
        for (int j = 0; j < sz; j++) {
            if (i == j) continue;
            if (data[i] + data[j] == credit) {
                ans1 = i < j ? i + 1 : j + 1;
                ans2 = i < j ? j + 1 : i + 1;
                return 1;
            }
        }
    }
    return 0;

}

Results so far:
VersionPerformance (abs)SpeedupTime to Implement
Naive32M clocks1x15 mins

If there's one tricky thing here, it's the need to make sure an item is not being used with itself.  That is, if we have  $100 to spend, we cannot purchase the same $50 item twice.  Our 'continue' line checks against this.  

This was fast to implement, but slow to execute (as we will see).

2. Slightly Less Naive Version

Every CS 101 student will see an obvious ~2x trick here.  The inner 'j' loop need not start at '0'.  It can start at j+1.  This eliminates the need for the continue and cuts our execution time in half.  Now we simply have:

bool less_naive(int sz, const int* data, int credit, int& ans1, int& ans2) {
    int min, max;
    for (int i = 0; i < sz; i++) {
        for (int j = i + 1; j < sz; j++) {
            if (data[i] + data[j] == credit) {
                ans1 = i < j ? i + 1 : j + 1;
                ans2 = i < j ? j + 1 : i + 1;
                return 1;
            }
        }
    }
    return 0;
}

Results so far:
VersionPerformance (abs)SpeedupTime to Implement
Naive32M clocks1x15 mins
Less Naive16M clocks2x10 mins

And yes, I'll admit I spent 10 minutes thinking about this.  I spent a little more time thinking about this...

3. Min / Max Culling

So then it occurred to me that if I first know the min and max value of the items, I can abort checking an item  against others if I know there cannot be a combination that will work.  For instance if I must achieve a total of $100, and the max item is $70, then I can skip the outer loop for all items < $30.  Similarly if the minimum item value is $10 I can skip outer loop values that are > $90.  Here's the code:

bool min_max_cull(int sz, const int* data, int credit, int& ans1, int& ans2) {
    int min = data[0];
    int max = data[0];

    for (int i = 0; i < sz; i++) {
        min = data[i] < min ? data[i] : min;
        max = data[i] > max ? data[i] : max;
    }

    for (int i = 0; i < sz; i++) {
        if (data[i] + max < credit) continue; // hardly helps
        if (data[i] + min > credit) continue; // helps; 2x
        for (int j = i + 1; j < sz; j++) {
            if (data[i] + data[j] == credit) {
                ans1 = i < j ? i + 1 : j + 1;
                ans2 = i < j ? j + 1 : i + 1;
                return 1;
            }
        }
    }
    return 0;
}

Results so far:
VersionPerformance (abs)SpeedupTime to Implement
Naive32M clocks1x15 mins
Less Naive16M clocks2x10 mins
Min / Max Cull9M clocks3.5x20 mins

Yes, I have to do an additional loop scan, but given this is an O(n^2) kind of thing, an additional 'n' is small potatoes.  And as it turns out it is worth it.  We are are almost 2x faster than 'less naive' and 3.5x faster overall.  This one took a little longer to make work that I'd like to admit, and that's because I tried to get clever and meld the 'min max determination' loop in with the main loop.  The logic gets a bit tangled, and really the SIMD stuff is what I came here for.

3. SIMD, take 1
So one would think you could get a SIMD speedup on this code.  After all, if the inner loop can check one combo of values, why can't it check, say 4 (with SSE) values?  But try as I might I could not get the Intel Compiler nor GCC to automatically vectorize this code.

The reason is in the 'return'...

for (int j = i + 1; j < sz; j++) {
   if (data[i] + data[j] == credit) {
       ans1 = i < j ? i + 1 : j + 1;
       ans2 = i < j ? j + 1 : i + 1;
       return 1;
   }
}

So you and I both know 'sz' is the array size of 'data'.  And you and I both know that therefore it is impossible for us to read beyond the end of the array.  Unfortunately, the compiler does not know this.  Our function simply says 'data' is a pointer...
  
bool naive(int sz, const int* data, int credit, int& ans1, int& ans2) {

Ok, you say, but surely the compiler can reason it out...'j' is a simple linear walk.  Obviously it is indexing 'data'.  What could go wrong?  Or put another way, if something did go wrong it's the programmer's fault for giving a wrong value for 'sz'.  That is, if the code crashes in scalar, then it should crash in vector. 

Not quite.  Let's say 'data' really is 1,000 elements.  If the answer is the last element, but sz is 1,000,000, then the compiler in the scalar case would return at the right time and give the right answer.  The vector version would read in no-man's land and crash.  Not good.

So let's try SIMD-ifying this ourselves.  For that you will have to wait for tomorrow.









No comments:

Post a Comment