Friday, January 24, 2014

SIMD-ifying Google Code Jam "Shop Credit" puzzle - Part 2

Continuing where I left off last time...

I said I was going to try and SIMD-ify my work.  I'm going to do that, but I'm also going to break the cardinal rule of optimization and not optimize the thing that is most important.  Instead, I will invest unnecessary time in the min/max determination loop:

1. Scalar Version

The 'normal' code that will produce scalar ASM is as simple as can be...

    int min = data[0];
    int max = data[0];

    for (int i = 0; i < sz; i++) {
        min = data[i] < min ? data[i] : min;
        max = data[i] > max ? data[i] : max;
    }  

Let's look at the (scalar) ASM for this loop.  This is the Intel Compiler (ICC) ASM, though the G++ output is almost the same.

..B1.28:                  
  lea    (%rdx,%rdx), %esi    # compute offset for 'data[i]' 
                              #  (cute way to do %rdx*2)
  incl   %edx                 # incr loop counter
  movslq %esi, %rsi           # 'promote' offset
  movl   (%rbx,%rsi,4), %edi  # load data[i]
  cmpl   %r15d, %edi          # compare data[i] to min...
  movl   4(%rbx,%rsi,4), %r8d # load data[i+1]
  cmovl  %edi, %r15d          # ...store if data[i] < min
  cmpl   %r9d, %edi           # compare data[i] to max
  cmovge %edi, %r9d           # ...store if data[i] >= max
  cmpl   %r15d, %r8d          # compare data[i+1] to min
  cmovl  %r8d, %r15d          # ...store if data[i+1] < min
  cmpl   %r9d, %r8d           # compare data[i+1] to max
  cmovge %r8d, %r9d           # ...store if data[i+1] >= max
  cmpl   %eax, %edx           # done? 
  jb     ..B1.28              # ...if not, repeat

We see ICC has unrolled this loop for us; it is checking i and i+1 each loop iteration.  It has hoisted the loads up above most of the comparisons in an effort to hide the load latency via out-of-order magic.  I set up a test with more data than the Google puzzle; here's the time for 1,000,000 elements:

TestTime (clocks)Speedup
ICC Scalar (unrolled 2x)2.9M (2.9 cycles per iteration)1x

2. "Forced to not unroll" version

For the heck of it I hacked up the ASM to not unroll.  The new ASM is:

  sall $1, %eax  # somewhere, eax was set to sz / 2; I need to
                 #  double it again
..B1.28:
  movslq %esi, %rsi            # 'promote' offset
  movl   (%rbx,%rdx,4), %edi   # load data[i]
  incl   %edx                  # incr counter
  cmpl   %r15d, %edi           # compare data[i] to min
  cmovl  %edi, %r15d           # ...store if data[i] < min
  cmpl   %r9d, %edi            # compare data[i] to max
  cmovge %edi, %r9d            # ...store if data[i] >= max
  cmpl   %eax, %edx            # done?
  jb     ..B1.28               # ...if not, repeat

The ASM gets a bit simpler; I no longer need to do an LEA to compute an offset that is 2x my counter; I can just use the counter.  My shift-left-to-double-eax is hacky and would probably break in some cases, but there it is.  So, did performance worsen?

TestTime (clocks)Speedup
ICC Scalar (unrolled 2x)2.9M1x
ICC Scalar (not)2.8M1.03x

Nope.  Practically speaking the unrolling isn't helping nor hurting by much.  On the other hand, if the data was not in a nice linear layout we can imaging how the hoisted load would be well worth this tiny cost.

3. Intrinsic (SSE) Version

This of course should be a perfect SIMD opportunity.  It turns out ICC and G++ do vectorize this code, but I wrote this intrinsic code as an experiment anyway.  Here it is:

    int min = data[0];  // cannot init to 0!
    int max = data[0];

    // how many 'bundles' (full 4-wide chunks) do I have?
    int iters_to_do = sz/4;

    // if I have at least one...
    if (iters_to_do > 0) {

    // load a bundle into min & max
        __m128i _max = _mm_load_si128((__m128i*)&data[0]);
        __m128i _min = _max;

    // compare additional bundles to this one, one at a time
    //  adjusting min / max for each.  Note however this is a 
    //  *per lane* comparison
        for (int j = 1; j < iters_to_do; j++) {
            __m128i these_vals =
               _mm_load_si128((__m128i*)&data[j*4]);

            _min = _mm_min_epu32(_min, these_vals);
            _max = _mm_max_epu32(_max, these_vals);
        }   

     // Now each lane has the min/max for all values that
     //  occupied that lane.  'reduce' those values down to 
     //  1 scalar value    
        unsigned int min1 = _mm_extract_epi32(_min, 0); 
        unsigned int min2 = _mm_extract_epi32(_min, 1); 
        unsigned int min3 = _mm_extract_epi32(_min, 2); 
        unsigned int min4 = _mm_extract_epi32(_min, 3); 
        min = min1 < min ? min1 : min;
        min = min2 < min ? min2 : min;
        min = min3 < min ? min3 : min;
        min = min4 < min ? min4 : min;

        int max1 = _mm_extract_epi32(_max, 0); 
        int max2 = _mm_extract_epi32(_max, 1); 
        int max3 = _mm_extract_epi32(_max, 2); 
        int max4 = _mm_extract_epi32(_max, 3); 
        max = max1 > max ? max1 : max;
        max = max2 > max ? max2 : max;
        max = max3 > max ? max3 : max;
        max = max4 > max ? max4 : max;
    }    

    // epilogue: if the amount of data is not a multiple of 4,
    //  do the cleanup case scalar-style
    for (int i = iters_to_do*4; 
           i < iters_to_do*4 + sz % 4; i++) {
        min = data[i] < min ? data[i] : min;
        max = data[i] > max ? data[i] : max;
    }   


How did we do?   It seems to have paid off.  We were doing 2.9 cycles per iteration; now we need just half a clock to do the same check.


TestTime (clocks)Speedup
ICC Scalar (unrolled 2x)2,900,0001x
ICC Scalar (not)2,800,0001.03x
Intrinsics550,0005.2x


So what kind of ASM did this generate?  The main loop is pretty straightforward:

..B5.2:                         
  movdqu    (%rsi), %xmm1  # load initial bundle
  movl      $1, %r10d      # initialize counter to 1
  movdqa    %xmm1, %xmm0   # copy data to xmm0
  movl      $4, %edx       # init offset into data
  cmpl      $1, %eax       # is there > 1 bundle?
  jle       ..B5.6         # ...no?  bail
                                
..B5.4:                      
  incl      %r10d                # incr counter    
  movdqu    (%rsi,%rdx,4), %xmm2 # load next bundle
  addq      $4, %rdx             # add to offset for next load
  pminud    %xmm2, %xmm0         # compute lane-wise min
  pmaxud    %xmm2, %xmm1         # compute lane-wise max
  cmpl      %eax, %r10d          # am I done?
  jl        ..B5.4               # ...no, do next bundle


(I do notice the loads are unaligned (movdqu) vs. aligned (movdqa).  I hand-modified them to be aligned and saw virtually no difference, which kind of surprised me as this is still SSE - in AVX there is no penalty for an unaligned load on aligned data)

If you read the comments you know that the pminud and pmaxud are lane-wise min / max.  So if you have

xmm2 = | 10 | 14 | 10 | 10 |
xmm0 = | 11 | 12 | 13 | 14 |

Then pmaxud %xmm2, %xmm0 will give you:
xmm0 = | 11 | 14 | 13 | 14 |

Thus we need to then peel out the lanes and reduce.  Horizontal operations are not the strong suit of SSE (nor any SIMD arch I've seen for that matter).  On the other hand, this code executes just once vs. (size / 4).

Lastly there is the epilog, which will give you the same result as scalar.

4. Conclusions / Next time
We achieved some nice gains, but boy was that a lot of work relatively speaking to get the intrinsic code written and debugged.

And yes, the O(N^2) nature of the original code still eats this O(N) preamble for breakfast.  But ASM tweaking is a hobby of mine.

But next time I will SIMD-ify the main loop.


No comments:

Post a Comment