Dev Tip: Improving Application Performance Through Bit Manipulation

Version 18

    by Darryl Gove-Oracle

                                                                                              

    About the Author
    Darryl GoveDarryl Gove is a senior principal software engineer in the Oracle Solaris Studio team, working on optimizing applications and benchmarks for current and future processors.  He is also the author of the books Multicore Application Programming, Solaris Application Programming, and The Developer's Edge.  Find his blog at http://blogs.oracle.com/d.
    See Also
    - Oracle Solaris Studio
    - SPARC Systems
    - Studio Forums
    - Tech Articles for Developers
    Follow OTN
    YouTube OTN Homepage

    Introduction

    Bit manipulation, often referred to as bit twiddling, is a class of algorithms that use simple logical or arithmetic operations to perform a complex action more rapidly than the “obvious” implementation. These algorithms are usually enabled by a characteristic of binary representation. The result of this is that it is not immediately apparent how the algorithms work. In this article we will look at a subset of these algorithms.

     

    Clear last bit set

    We will start with a relatively simple operation to clear the last bit set in a binary number. The first thing we’ll do is code it in C, in Code 1. The code works by shifting a bit along until it matches a set bit in the value, when that happens we clear that bit and return the new value.

     

    unsigned long long clearlastbit(unsigned long long value)  {    int bit=1;    if (value==0) { return 0; }    while (!(value & bit))    {      bit = bit<<1;    }    value = value ^ bit;    return value;  } 

    Code 1: Clear last set bit

     

    Since we are interested in knowing if we can improve the performance of this code we need to use a timing harness. Code 2 uses the Solaris call gethrtime()to record a start and end time in nano-seconds, and then compute the time per iteration for a given number of iterations.

    #include <stdio.h> #include <sys/time.h>  static double s_time;  void starttime() {   s_time=1.0*gethrtime(); }  void endtime(unsigned long long its) {   double e_time=1.0*gethrtime();   printf("Time per iteration %5.2f ns\n", (e_time-s_time)/(1.0*its));   s_time=1.0*gethrtime(); }

    Code 2: timing.h

     

    The final thing that we need is a test workload, as shown in Code 3, so that we can measure the performance of our code. The workloads measures the time it takes to go over COUNT integers and clear all the bits in each one.

     

    #define COUNT 1000000  void main()  {    starttime();    for (unsigned long long i = 0; i<COUNT; i++ )    {      unsigned long long value=i;      while (value) { value=clearlastbit(value); }    }    endtime(COUNT); }

    Code 3: Workload

     

    The basic problem with this approach is that we have a loop that goes through all the bits in the value until it finds one that is set, and clears that bit. If the value has N bits in it then the operation to clear it takes 0...N iterations of the loop, which is usually reported as being order N or O(N) operations. We would ideally like to remove the loop from the operation, this would lead to a constant time or O(1) solution. An approach to do this is shown in Code 4.

     

    unsigned long long clearlastbit2(unsigned long long value)  {    return (value & (value-1) );  } 

    Code 4: Using bit manipulation to clear the last set bit

     

    Code 4 performs the last bit clearing operation using one subtraction and an AND operation – a total of two operations with no loop. This will be much faster than the code containing the loop.

     

    There are two things we should do to test this out. The first is to check that the two routines produce the same output for the same input. The second is to measure their performance on the same workload. A test harness to do this is shown in Code 5.

     

    #define COUNT 1000000  void main()  {    // Correctness test    for (unsigned long long i = 0; i<COUNT; i++ )    {      unsigned long long value=i;      while (value)      {         unsigned long long v2 = value;         value = clearlastbit(value);         if (value!=clearlastbit2(v2))         {           printf(" Mismatch input %llx: %llx!= %llx\n",                                    v2, value, clearlastbit2(v2));         }      }    }       // Performance test    starttime();    for (unsigned long long i = 0; i<COUNT; i++ )    {      unsigned long long value=i;      while (value) { value=clearlastbit(value); }    }    endtime(COUNT); 
      starttime();    for (unsigned long long i = 0; i<COUNT; i++ )    {      unsigned long long value=i;      while (value) { value=clearlastbit2(value); }    }    endtime(COUNT);  }

    Code 5: Correctness and performance testing

     

    Running the test harness indicates that the bit manipulation code is about 3x faster than the loop based code – on this workload.

     

    The bit twiddling code is faster, but it is not clear how it works. To see this we have to look at how a number is encoded in binary. Suppose the number is 10, this becomes 1010 in binary. If we subtract 1 from it the subtraction will clear the last set bit. For example, ten minus one becomes 9 which is 1001 in binary, notice the second bit is no longer set. Performing an AND operation on the number and one less than the number retains all the bits except that last set bit. In this example, we get the result 1000 or 8 in decimal.

     

    Population count

    The comparison of the performance of the algorithms to clear the last set bit might seem a bit unfair. The clear last bit operation is being called in a loop, and in that case there is an obviously more efficient algorithm which remembers the last bit cleared, and restarts from that point rather than restarting from scratch ever iteration. So lets take a look at a more realistic situation where we want to know the number of bits set in a register, this comes up sufficiently frequently that many processors implement an instruction to perform the operation, but in this example we’re not going to use the instruction. First of all, in Code 6, we’ll implement a simple C code to perform the pop count operation.

     

    int popc(unsigned long long value) {   unsigned long long bit=1;   int popc=0;   while (bit)   {     if (value & bit) { popc++; }     bit = bit << 1;   }   return popc; }

    Code 6: Simple population count

     

    This code works by iterating over the entire value and checking each bit. Given the previous code to clear the last set bit it should be apparent that this could be improved if we were to only look at the bits that were actually set.

     

    An improved version which uses this approach is shown in Code 7. There are two advantages to this approach, first of all we only touch the set bits, secondly we stop as soon as there are no longer any set bits in the value. As a result of this we only perform iterations for the set bits.

     

    int popc2(unsigned long long value)  {    int popc=0;    while (value)    {      popc++;      value = value & (value-1);    }    return popc;  } 

    Code 7: Population count using clear last set bit

     

    Again we need to use a test harness to both check for correctness and compare the performance of the two approaches. A suitable harness is shown in Code 8. Population count works on 8 byte values, so it’s important that we test the full width. In the harness we achieve this by copying the lower 32 bits to the upper 32 bits.

     

    #define COUNT 1000000  void main()  {    // Correctness test    for (unsigned long long i = 0; i<COUNT; i++ )    {      if (popc(i+(i<<32)) != popc2(i+(i<<32)) )      {        printf(" Mismatch popc2 input %llx: %u!= %u\n",                       i+(i<<32), popc(i+(i<<32)), popc2(i+(i<<32)));      }    }       // Performance test    starttime();    for (unsigned long long i = 0; i<COUNT; i++ )    {      popc(i+(i<<32));    }    endtime(COUNT);    starttime();    for (unsigned long long i = 0; i<COUNT; i++ )    {      popc2(i+(i<<32));    } }

    Code 8: Test harness for population count

     

    The performance of the improved code is about twice that of the original code. However, there is another approach which leads to even better performance.

     

    One problem with the existing code is that it contains a loop. This is good in the event that it takes few trips around the loop to get the result, but every iteration of the loop has a few instructions of overhead, and these can quickly add up to quite a few instructions when the loop has a high trip count. The other problem with loops is to do with processor microarchitecture.

     

    Modern processors are “pipelined” what this means is that they have one area of logic that fetches the next instruction, whilst another area of logic is executing a previous instruction. It actually takes quite a few cycles for a fetched instruction to reach the stage where it is executed. This is fine for straight line code, but there is a problem with branch instructions.

     

    When the branch instruction is executed it may change the place where the next instruction is fetched from. However, the processor have already fetched what it thought was the next instruction. To make this happen this, the processor needs to predict whether a branch is going to be taken or not, hence modern processors have complex branch prediction units. If the branch predictor is correct then the processor continues executing instructions with no interruption. If the branch predictor is incorrect, then the processor needs to throw away all the instructions after the branch and fetch instructions from the correct path. Hence mispredicted branches are somewhat costly.

     

    Now consider what happens in a loop. The processor encounters the branch at the end of the loop, and because it is a loop the branch is predicted to be taken. This happens every iteration but the last one. On the last iteration, the processor predicts that the branch to the top will once again be taken, but this prediction is wrong. Consequently most loops that have more than one iteration will also have at least one mispredicted branch (the final branch at loop exit). This is another good reason to avoid having loops whenever possible.

     

    So we might get better performance from this code if we can avoid using a loop. This could be done by unrolling the code – we’re dealing with a fixed number of bits so we could unroll the loop 64 times (once for each bit) and have loop-less code. The instruction count here would be proportional to the number of bits. However, there is a better approach where the instruction count is proportional to the log of the number of bits – O(log(N)). Code 9 is straight line code for computing the population count which uses bit operations to produce the result.

     

     

    unsigned int popc3(unsigned long long value)  {    unsigned long long v2;    v2 = value>>1;    v2 &= 0x5555555555555555;    value &= 0x5555555555555555;    value+=v2;    v2 = value>>2;    v2 &= 0x3333333333333333;    value &= 0x3333333333333333;    value+=v2;    v2 = value>>4;    v2 &= 0x0f0f0f0f0f0f0f0f;    value &= 0x0f0f0f0f0f0f0f0f;    value+=v2;    v2 = value>>8;    v2 &= 0x00ff00ff00ff00ff;    value &= 0x00ff00ff00ff00ff;    value+=v2;    v2 = value>>16;    v2 &= 0x0000ffff0000ffff;    value &= 0x0000ffff0000ffff;    value+=v2;    v2 = value>>32;    value+=v2;    return (unsigned int) value;  }

    Code 9: Population count bit manipulation instructions

     

    This code is first of all adds two adjacent bits together to make a 2 bit number. Then it adds a pair of two bit numbers together to make a 4 bit number (notice that the largest value that the four bit number can have is 4 so we only actually use 3 of the bits). Next we add two four bit numbers, then two 8 bit numbers, then two 16 bit numbers, before finally adding two 32 bit numbers to produce the result.

     

    The code is very efficient because we’re using a single scalar operation (add) to compute a vector of results. Because we’re using scalar operations, we have to be very careful to ensure that none of the results overflow into an adjacent lane. Consequently we have quite a few AND operations to zero out the unneeded bits.

    In terms of performance this is about twice as fast as the previous loop-based version of the code, so about four times faster than the original code.

     

    Finding zero bytes

     

    The technique of using a scalar register to hold a vector of values is very powerful. Another example of how this can be used is as an efficient way of finding zero values in an array of bytes – this is particularly useful for string length. Code 10 shows a simple implementation of string length routine together with a correctness and performance test harness.

     

     

    #include "timing.h"   unsigned int len(char* array)  {    unsigned int length=0;    while(array[length]) { length++; }    return length;  }   #define COUNT 100000  void main()  {    char array[COUNT];    for (int i=1; i<COUNT; i++)    {      array[i-1]='a';      array[i]=0;      if (i!=len(array)) { printf("Error at %i\n",i); }    }    starttime();    for (int i=1; i<COUNT; i++)    {      array[i-1]='a';      array[i]=0;      len(array);    }    endtime(COUNT);  } 

    Code 10: Simple implementation of string length

     

     

    A more efficient algorithm is attributed to Alan Mycroft, this is shown in Code 11. The core algorithm consumes  8 bytes at a time. To make this work, the code has a pre-loop which ensures that the array is aligned to an eight byte aligned memory address. Although each iteration of the loop handles 8 bytes at a time, the code is only about four times faster than the original. It is worth noticing that the code detects the presence of a zero byte, but does not necessarily indicate which position the zero byte is in.

     

     

    unsigned int len2(char* array)  {    unsigned int length=0;    // Handle misaligned data    while (((unsigned long long)&array[length])&7)   {     if (array[length]==0) { return length; }      length++;   }     unsigned long long * p = (unsigned long long *)&array[length];    unsigned long long v8, v7;    do    {      v8 = *p;      v7 = v8 - 0x0101010101010101;      v7 = (v7 & ~v8) & 0x8080808080808080;      p++;    }    while (!v7);    length = (char*)p - array-8;    while (array[length]) { length++; }    return length;  } 

    Code 11: Alan Mycroft’s algorithm for zero detection

     

    The Mycroft algorithm works because it generates two conditions, both of which need to be true for a byte to be zero. The first condition is that subtracting one from a given byte must leave the upper bit in the byte set. This is true if either the value of the byte is greater than 0x80, or if the byte is zero. The second condition is that the inverse of a byte must have the upper bit set. This is only true if the byte is less than 0x80. These two conditions are only both true if the byte is zero.

     

    This algorithm can be extended to search for particular values. To do this the input value needs to be XORd with the value being searched for. After the XOR the register will contain zero bytes where the target value was, or non-zero bytes elsewhere.

     

    It is also easy to change the code to search for different width values. For example to locate a zero in 4 bit values the two constants would need to be changed to 0x1111111111111111 and 0x8888888888888888.

     

    Range comparisons

    Suppose instead of searching for a zero we want to find a value that is greater than a particular target. Code 12 shows code that searches an array for a value that is greater than a target.

     

     

    #include "timing.h"   int range(char * array, unsigned int length, unsigned char target)  {    for (unsigned int i=0; i<length; i++)    {      if (array[i]>target) { return i; }    }    return -1;  } 

    Code 12: Searching an array for a value greater than a target

     

    It is possible to recode this to use vectors of values. Code 13 shows a routine that performs this operation on a vector of values.

     

    The code appears quite complex because we have to split the test into two versions. The reason we need two versions is that we are going to do a different set of operations depending on whether the target value is less than 128 or not.

     

    If the target value is less than 128:

     

    When the most significant bit in a byte is set, then that byte is greater than the target value. If the most significant bit in the byte is not set, then the byte still may be a match and we determine this by adding sufficient to it to make it greater than 128 if it is a match. The value to add is 127 minus the target value. So the operation we perform if the target value is less than 128 is

     

    ( (byte & 0x80) | ( (byte & 0x7f) + (127-target)) ) & 0x80.

     

    This has a result of 0x80 for all the bytes which are greater than the target value.

     

    If the target value is greater than 128:

     

    The most significant bit in any byte must be set for that byte to be greater than the target value. The remaining bits must be large enough that when we add 255 minus the target value to them, that they overflow. So the operation we perform in this case is

     

    ( (byte & 0x80) & ( ( byte & 0x7f) + (255-target)) )

     

    Once again we get a performance gain of about 4x from performing this operation on a vector of values rather than a single value at a time.

     

    int range2(unsigned char* array, unsigned int length, unsigned char target)  {    unsigned int i=0;    // Handle misalignment    while ((length>0) && ((unsigned long long)&array[i]&7) )    {      if (array[i]>target) { return i; }      i++;      length--;    }    // Optimised code    unsigned long long * p = (unsigned long long*)&array[i];    if (target<128)    {      unsigned long long v8 = 128-target;      v8 = v8 | (v8<<8);      v8 = v8 | (v8<<16);      v8 = v8 | (v8<<32);       while (length>8)      {        unsigned long long v = *p;        unsigned long long u;        u = v & 0x8080808080808080; // upper bit        v = v & 0x7f7f7f7f7f7f7f7f; // lower bits        v = v + v8;        unsigned long long r = (v | u) & 0x8080808080808080;        if (r) { break; }        length-=8;        p++;        i+=8;      }    }    else    {      unsigned long long v8 = 256 - target;      v8 = v8 | (v8<<8);      v8 = v8 | (v8<<16);      v8 = v8 | (v8<<32);      while (length>8)      {        unsigned long long v = *p;        unsigned long long u;        u = v & 0x8080808080808080; // upper bit        v = v & 0x7f7f7f7f7f7f7f7f; // lower bits        v = v + v8;        unsigned long long r = v & u;        if (r) { break; }        length-=8;        p++;        i+=8;      }    }     // Handle trailing values    while (length>0)    {      if (array[i]>target) { return i; }      i++;      length--;    }    return -1;  } 

    Code 13: Searching an array for a value greater than a target

     

    Gathering bits

     

    The final operation that we’ll discuss is the situation where we want to produce a bit vector indicating which values meet some criteria. Suppose we need a bit vector indicating all the places in an array where there is a zero, Code 14 shows an example. To make the code more efficient it deals with 8 bytes at a time, and uses bit shifting to put the results into the correct bit positions.

     

    void zeros(unsigned char * array, int length, unsigned char * result)  {    for (int i=0;i<length; i+=8)    {      result[i>>3] =      ((array[i+0]==0)<<7) +      ((array[i+1]==0)<<6) +      ((array[i+2]==0)<<5) +      ((array[i+3]==0)<<4) +      ((array[i+4]==0)<<3) +      ((array[i+5]==0)<<2) +      ((array[i+6]==0)<<1) +      ((array[i+7]==0)<<0);    }  } 

    Code 14: Producing a vector indicating locations of zero bytes

     

    To make this operation more efficient, we need to have a set of operations that generate the byte positions where the zero bytes are located. Once we have that result we need to shift it so that the bits occupy the correct bit positions in the output vector. Code 15 shows one approach to solving this.

     

    The first part of the code identifies zero bytes by negating the input vector. Any zeros would at this point be indicated by a one in all their bit positions. We then mask out the most significant bit in each byte, at this point all the zeros would contain 0x7f, if we add one to this the bytes that initially contained zero would now contain 0x80. Unfortunately the bytes that originally contained 0x80 would also now contain 0x80, but we can zero these by ANDing with the negative of the original bytes. Then by ANDing with a vector of 0x80 we can produce a vector of bytes where the most significant bit is set for every zero byte.

     

    The next step is to reduce this vector of bytes into a vector of bits. We can do this by shifting the bits together, first of all shifting the most significant bits from adjacent bytes. Then shifting pairs of bits from adjacent shorts, and then finally shifting nibbles from adjacent integers.


    The resulting code is about twice as fast as the original.

    void zeros2(unsigned long long* array, int length, unsigned char* result)  {    for (int i=0; i<length; i+=8)    {      unsigned long long v, u;      v = array[i>>3];       u = ~v;      u = u & 0x7f7f7f7f7f7f7f7f;      u = u + 0x0101010101010101;      v = u & (~v);      v = v & 0x8080808080808080;       v = v | (v<<7);      v = v | (v<<14);      v = (v>>56) | (v>>28);      result[i>>3] = v;    }  } 

    Code 15: Using bit manipulation to produce a vector indicating locations of zero bytes

     

    Concluding remarks

    Bit manipulation can be a very effective way of improving application performance because it potentially works on a vector of values, so we obtain multiple results from a single set of instructions. The other advantage of this approach is that we can end up with code that does not contain branch instructions, and this kind of code is usually faster for the processor to execute.