Monday, March 3, 2014

SIMD-ifying Google Code Jam "Shop Credit" puzzle - Part 4 - Sorting

Ok, time to take a break from SIMD, at least for now.

1. O(n^2)
One problem with all SIMD approaches thus far is that they have all been O(n^2).  Remember the super naive version?

bool naive(int sz, const int* data, int credit, int& ans1, int& ans2) {
    int min, max;
    for (int i = 0; i < sz; i++) {
        for (int j = 0; j < sz; j++) {
            if (i == j) continue;
            if (data[i] + data[j] == credit) {
                ans1 = i < j ? i + 1 : j + 1;
                ans2 = i < j ? j + 1 : i + 1;
                return 1;
            }
        }
    }
    return 0;
}

We see that for each search we do an 'i' and a 'j' loop each (potentially) as large as 'sz'.  Hence, our O(n^2).

In our SIMD version we had cut down that 'j' loop by the vector width (4 for SSE)....

   for (int i = 0; i < sz; i++) {
   // <...>
        int iters = (sz - i) / 4 + 1;
        for (int j = 0; j <  iters; j++) {
   // <...>
                }
      }

But O(n^2) / 4 is - at least for non-trivial sizes of 'n' dominated by n^2.

2. Sorting
Of the "2 n's" that multiply together, there isn't much we can do about the 'outer' one.  One way or another we will have to scan all the items once on average.  But what about the 'inner' n?

We know that a good sorting algorithm can find data in log2(n) time.  But it also costs us n*log2(n) time to sort it for a total of n*log2(n) + log2(n).

So, how does n^2 compare with n*log2(n) + log2(n)?  This of course reduces to (n + 1)*log2(n), which devolves to O(n*log2(n)), which is always less than n^2.  But in case you don't believe me, here's what Wolfram Alpha has to say about it.

So I tried it out and implemented it like this:

bool sort_version(int sz, int* data, int credit, int& ans1, int& ans2) {
    int min = data[0];
    int max = data[0];
    int j;
    int idxs[sz];
    for (int i = 0; i < sz; i++) {
        idxs[i] = i;
    }

    qsort(data, idxs, range(0, sz - 1));

    for (int i = 0; i < sz; i++) {
        min = data[i] < min ? data[i] : min;
        max = data[i] > max ? data[i] : max;
    }

    for (int i = 0; i < sz; i++) {
        if (data[i] + max < credit) continue; // hardly helps
        if (data[i] + min > credit) continue; // helps; 2x

        int j = find(data, credit - data[i], range(i+1, sz));
        if (j == -1) continue;
        ans1 = idxs[i] + 1;
        ans2 = idxs[j] + 1;

        if (ans1 > ans2) {
            int tmp = ans1;
            ans1 = ans2;
            ans2 = tmp;
        }
        return 1;
    }
    return 0;
}

Here's our results along with our previous work

Version
Performance (abs)
Speedup
Time to Implement
Naive (scalar)
32M clocks
1x
15 mins
Less Naive (scalar)
16M clocks
2x
10 mins
Min / Max Cull (scalar) 
9M clocks
3.5x
20 mins
Intrinsics (vector) 
4M clocks
8x
540 mins
Sort + search (scalar) 
5M clocks
6.4x
20 mins

3. Conclusions
Yes, we lost ground to the vector version.  However note:

  1. We are not using vector hardware...yet.  This result really compares against the other bolded row, which was the best non-vector result we achieved.  So, we achieved 1.8x by doing sorting / searching.  Who knows?  With a vectorized sort / search we might achieve 12.8x...
  2. This is in some sense a worst case scenario for sorting.  We are dong n*log(n) work to sort, then log(n) to search, and then throwing all that sorting work away.  If, say, we were tasked for finding lots of item combinations within the same inventory we could expect O(log(n)) results, which would crush all results so far.  In fact I project if searching was done to the extent the sorting were to be negligible, the result would be a gain of 3500x
  3. The sort / search version tracks O(n*log(n)) very nicely as you can see in the chart below.  The other versions 'get lucky' when they find answers close to the beginning.  The vector version is simply an attenuated result of the scalar version.
  4. Realize also we are dealing with only 1800 points.  For much larger working sets the sort / search version would crush it.
  5. Nonetheless, it is clear that if you have to pick where to invest your tuning time in workloads like this you likely want to do your vectorization work last.  I spent more time on the vectorized version than all the others combined.








No comments:

Post a Comment