Monday, March 3, 2014

SIMD-ifying Google Code Jam "Shop Credit" puzzle - Part 5, Binary Search

Yes, 2 in a row for today.  If you just read my post on doing a sort & search you saw I broke the O(n^2)-ness of the previous attempts, but was still slower than the best (vector) version.

1. "Tree" Version

And one reason why is that the sort & search is a bit wasteful.  It does n*log(n) work to sort and then another log(n) to search.  But the n*log(n) work is 'tossed' at the end as only one query is done on all that sorting work.

So then it occurred to me to instead 'search as you go'.  If I were to do a binary tree, I can insert an item in log(n) time.  I still must do O(n*log(n)) work, but maybe I can skip some of the sorting work.  That is, maybe I can skip sorting things I never need to scan.

Here's the code I came up with.  First I defined a struct to hold the data for an item.

struct item {
    int price;
    int idx;
    item* right;
    item* left;
};

The 'right' and 'left' pointers define the our 2 nodes - the left being smaller and the right being larger in price.  Next, as before I do my min / max determination for later culling.  

bool tree_version(int sz, item* items, int credit, int& ans1, int& ans2) {

    int min = items[0].price;
    int max = items[0].price;
    for (int i = 0; i < sz; i++) {
        min = items[i].price < min ? items[i].price : min;
        max = items[i].price > max ? items[i].price : max;
    }

Now it's time to build the binary tree.  I'm not being very fancy about it; it's not a balanced tree, for instance.

        for (int i = 1; i < sz; i++) {
        int goal = credit - items[i].price;
        if (goal > max) continue;
        if (goal < min) continue;

        item* it = &items[i];
        item* head = &items[0];

        while (true) {
            if (it->price < head->price) {
                if (head->left) {
                    head = head->left;
                } else {
                    head->left = it;
                    break;
                }
            } else {
                if (head->right) {
                    head = head->right;
                } else {
                    head->right = it;
                    break;
                }
            }
        }
    // look through the current tree
        if (scan(sz, &items[0], goal, ans1, ans2, i)) {
            return 1;
        }
    }

Note the 'scan' function is in the insert itself.  Since we are discarding the tree after one query we can abort when we find what we want.  

The 'scan' implementation is pretty simple...except for one thing....in my first version I did "did I find the answer" check first.  But then I realized this is the least common case.  Most of the time we don't find the answer.  By moving this to be the 3rd conditional clause I branch less often.  This showed up as a 6% gain.

bool scan(int sz, const item* items, int goal, int& ans1, int& ans2, 
    int skip_idx) {
    // scan tree
    const item* head = items;
    while (head) {
        if (head->price > goal) {
    // traverse left
            head = head->left;
        } else if (head->price < goal) {
    // traverse right
            head = head->right;
        } else if (head->idx != skip_idx) {
    // have I found the answer?
            ans1 = skip_idx < head->idx ? skip_idx + 1: head->idx+ 1;
            ans2 = skip_idx > head->idx ? skip_idx + 1: head->idx+ 1;
            return 1;
        } else {
    // traverse left; this is the case where I have found an item which 
    //  when added to itself gives the goal, which is not desired
            head = head->left;
        }
    }
    return 0;

}
   
2. Results

Version
Performance (abs)
Speedup
Time to Implement
Naive (scalar)
32M clocks
1x
15 mins
Less Naive (scalar)
16M clocks
2x
10 mins
Min / Max Cull (scalar) 
9M clocks
3.5x
20 mins
Intrinsics (vector) 
4M clocks
8x
540 mins
Sort + search (scalar) 
5M clocks
6.4x
20 mins
Binary Tree (scalar) 
3.4M clocks
9.4x
17 mins

Hooray!  A new record.  And again, there may be vector opportunities to add on top of this.



3. Conclusions

  1. In this case, at least so far, proper use of the scalar hardware has actually surpassed the best vector result
  2. Moreover, recall that while all 'scalar' versions are plainly-compiled C++, the vector version required tricky, non-portable, intricate intrinsics work.
  3. It would be exciting to see what vector + the BST can do.  Maybe the characteristics of the vector instruction set will preclude techniques that got us this far.
  4. This is an incremental improvement over the 'sort' version that takes advantage of 'early out' situations.


No comments:

Post a Comment