1. "Tree" Version
And one reason why is that the sort & search is a bit wasteful. It does n*log(n) work to sort and then another log(n) to search. But the n*log(n) work is 'tossed' at the end as only one query is done on all that sorting work.
So then it occurred to me to instead 'search as you go'. If I were to do a binary tree, I can insert an item in log(n) time. I still must do O(n*log(n)) work, but maybe I can skip some of the sorting work. That is, maybe I can skip sorting things I never need to scan.
Here's the code I came up with. First I defined a struct to hold the data for an item.
struct item {
int price;
int idx;
item* right;
item* left;
};
The 'right' and 'left' pointers define the our 2 nodes - the left being smaller and the right being larger in price. Next, as before I do my min / max determination for later culling.
bool tree_version(int sz, item* items, int credit, int& ans1, int& ans2) {
int min = items[0].price;
int max = items[0].price;
for (int i = 0; i < sz; i++) {
min = items[i].price < min ? items[i].price : min;
max = items[i].price > max ? items[i].price : max;
}
int min = items[0].price;
int max = items[0].price;
for (int i = 0; i < sz; i++) {
min = items[i].price < min ? items[i].price : min;
max = items[i].price > max ? items[i].price : max;
}
Now it's time to build the binary tree. I'm not being very fancy about it; it's not a balanced tree, for instance.
for (int i = 1; i < sz; i++) {
int goal = credit - items[i].price;
if (goal > max) continue;
if (goal < min) continue;
item* it = &items[i];
item* head = &items[0];
while (true) {
if (it->price < head->price) {
if (head->left) {
head = head->left;
} else {
head->left = it;
break;
}
} else {
if (head->right) {
head = head->right;
} else {
head->right = it;
break;
}
}
}
// look through the current tree
if (scan(sz, &items[0], goal, ans1, ans2, i)) {
return 1;
}
}
int goal = credit - items[i].price;
if (goal > max) continue;
if (goal < min) continue;
item* it = &items[i];
item* head = &items[0];
while (true) {
if (it->price < head->price) {
if (head->left) {
head = head->left;
} else {
head->left = it;
break;
}
} else {
if (head->right) {
head = head->right;
} else {
head->right = it;
break;
}
}
}
// look through the current tree
if (scan(sz, &items[0], goal, ans1, ans2, i)) {
return 1;
}
}
Note the 'scan' function is in the insert itself. Since we are discarding the tree after one query we can abort when we find what we want.
The 'scan' implementation is pretty simple...except for one thing....in my first version I did "did I find the answer" check first. But then I realized this is the least common case. Most of the time we don't find the answer. By moving this to be the 3rd conditional clause I branch less often. This showed up as a 6% gain.
bool scan(int sz, const item* items, int goal, int& ans1, int& ans2,
int skip_idx) {
// scan tree
const item* head = items;
while (head) {
if (head->price > goal) {
// traverse left
head = head->left;
} else if (head->price < goal) {
// traverse right
head = head->right;
} else if (head->idx != skip_idx) {
// have I found the answer?
ans1 = skip_idx < head->idx ? skip_idx + 1: head->idx+ 1;
ans2 = skip_idx > head->idx ? skip_idx + 1: head->idx+ 1;
return 1;
} else {
// traverse left; this is the case where I have found an item which
// when added to itself gives the goal, which is not desired
head = head->left;
}
}
return 0;
}
The 'scan' implementation is pretty simple...except for one thing....in my first version I did "did I find the answer" check first. But then I realized this is the least common case. Most of the time we don't find the answer. By moving this to be the 3rd conditional clause I branch less often. This showed up as a 6% gain.
bool scan(int sz, const item* items, int goal, int& ans1, int& ans2,
int skip_idx) {
// scan tree
const item* head = items;
while (head) {
if (head->price > goal) {
// traverse left
head = head->left;
} else if (head->price < goal) {
// traverse right
head = head->right;
} else if (head->idx != skip_idx) {
// have I found the answer?
ans1 = skip_idx < head->idx ? skip_idx + 1: head->idx+ 1;
ans2 = skip_idx > head->idx ? skip_idx + 1: head->idx+ 1;
return 1;
} else {
// traverse left; this is the case where I have found an item which
// when added to itself gives the goal, which is not desired
head = head->left;
}
}
return 0;
}
2. Results
Version
|
Performance (abs)
|
Speedup
|
Time to Implement
|
|---|---|---|---|
Naive (scalar)
|
32M clocks
|
1x
|
15 mins
|
Less Naive (scalar)
|
16M clocks
|
2x
|
10 mins
|
Min / Max Cull (scalar)
|
9M clocks
|
3.5x
|
20 mins
|
Intrinsics (vector)
|
4M clocks
|
8x
|
540 mins
|
Sort + search (scalar)
|
5M clocks
|
6.4x
|
20 mins
|
Binary Tree (scalar)
|
3.4M clocks
|
9.4x
|
17 mins
|
Hooray! A new record. And again, there may be vector opportunities to add on top of this.
3. Conclusions
- In this case, at least so far, proper use of the scalar hardware has actually surpassed the best vector result
- Moreover, recall that while all 'scalar' versions are plainly-compiled C++, the vector version required tricky, non-portable, intricate intrinsics work.
- It would be exciting to see what vector + the BST can do. Maybe the characteristics of the vector instruction set will preclude techniques that got us this far.
- This is an incremental improvement over the 'sort' version that takes advantage of 'early out' situations.
No comments:
Post a Comment